Continuous Model Evaluation

Run ongoing evaluation pipelines against held-out test sets and curated adversarial examples to continuously measure model performance in production.

Objective

Detect performance degradation or behavioral changes that periodic manual evaluations would miss by automating evaluation on a regular cadence.

Maturity Levels

Initial

Evaluation is performed only at initial deployment; no ongoing evaluation exists.

Developing

Periodic manual evaluations are conducted but are not automated or on a defined schedule.

Defined

Automated evaluation pipelines run on a defined schedule with results reported to model owners.

Managed

Evaluation results are trended over time; significant changes trigger review and potential remediation.

Optimizing

Evaluation sets are continuously expanded based on production failure modes; evaluation coverage is a tracked metric.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Evaluation pipeline configuration and test suite documentation listing metrics, datasets, and pass thresholds

—Automated evaluation run history showing test results for each model version over a defined period

—Regression alert records for any evaluation run where performance declined from the previous version

—Human evaluation records for spot-checks of automated evaluation accuracy

—Evaluation dataset governance records confirming test sets are not contaminated by training data

Implementation Notes

Key steps

Build and maintain a curated evaluation set that is held out from training and covers the full range of expected inputs including edge cases and adversarial examples.
Run evaluation on every model version change, not just on a time schedule — behavioral regressions often appear immediately after an update.
Include human evaluation for subjective dimensions (helpfulness, tone, appropriateness) that automated metrics cannot capture reliably.
Publish evaluation results to stakeholders as part of model governance reporting — evaluation results that stay within the engineering team provide weak governance assurance.

AI engineering team running weekly automated evals on a generative summarization model

Evaluation sets maintained:

Set Name	Size	Composition	Update Cadence
Core regression	500 samples	Representative production inputs; curated at launch	Quarterly additions
Adversarial	120 samples	Injection attempts, edge cases, known failure modes	After every incident
Human-eval sample	50 samples	Random production sample scored by human raters	Weekly

Automated metrics (run weekly + on every model version change):

Metric	Threshold	Current	Trend
ROUGE-L vs. reference	> 0.52	0.57	Stable
Hallucination rate (NLI-based)	< 4%	2.8%	Stable
Format pass rate	> 99%	99.6%	Stable
Latency p95	< 800ms	620ms	Stable

Human eval (weekly sample): Raters score helpfulness and accuracy 1–5; target ≥ 4.0 average

Governance reporting: Results published to #model-evals Slack channel weekly; failures trigger release hold