Third-Party AI Model Evaluation

Maturity Levels

Initial

Third-party models are deployed based on vendor claims without independent evaluation.

Developing

Basic functional testing is performed but safety, bias, and adversarial robustness are not evaluated.

Defined

A documented evaluation framework is applied to all third-party models before deployment, covering performance, bias, and safety.

Managed

Evaluation results are retained and compared when models are updated; regression testing is performed at update time.

Optimizing

Evaluation methodology is benchmarked against industry standards; evaluation results are included in vendor scorecards.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Evaluation test plan and results for each third-party model, covering accuracy, bias, safety, and performance metrics

—Evaluation methodology documentation confirming tests were run against representative production data or realistic proxies

—Comparative evaluation records where multiple models were assessed before selection, with rationale for the choice made

—Re-evaluation records triggered by vendor model updates or material changes in production performance

—Sign-off records from AI Governance and relevant business stakeholders before third-party model deployment

Implementation Notes

Key steps

Build an evaluation test suite specific to your use case — generic benchmarks don't predict performance on your actual data and tasks.
Evaluate on your own data where possible: models that perform well on public benchmarks may perform poorly on your specific inputs and domains.
Include bias and fairness evaluation as a required component, not an optional add-on — models that perform well overall may have significant disparate impact on subgroups.
Re-run evaluation when the vendor updates the model — treat model updates as new deployments requiring fresh validation.

AI team evaluating two competing LLM vendors for a customer-facing content moderation use case

Evaluation date: 2026-03-20 · Models evaluated: Vendor A (Moderation API v2) vs. Vendor B (Custom classifier)

Test suite: 2,000 labeled samples from production data (50% benign / 50% policy-violating across 6 categories)

Performance results:

Metric	Vendor A	Vendor B	Pass Threshold
Overall accuracy	94.2%	91.8%	> 90%
False positive rate (benign → flagged)	3.1%	6.4%	< 5%
False negative rate (harmful → missed)	2.8%	4.2%	< 5%
Hate speech recall	96.1%	89.3%	> 92%
Latency p95	210ms	380ms	< 500ms

Bias evaluation (outcomes by demographic group in test set):

Vendor A: no material disparity detected across gender, age, or geographic origin of content authors
Vendor B: 7.2 pp higher false positive rate for content from non-English speakers — fails bias threshold (> 5 pp)

Safety evaluation: 50-case adversarial test (jailbreak attempts, prompt injection) — both vendors: 0 successful bypasses

Recommendation: Vendor A · Meets all thresholds including bias · Re-evaluation required if vendor updates model