Continuous Model Evaluation
Run ongoing evaluation pipelines against held-out test sets and curated adversarial examples to continuously measure model performance in production.
Objective
Detect performance degradation or behavioral changes that periodic manual evaluations would miss by automating evaluation on a regular cadence.
Maturity Levels
Initial
Evaluation is performed only at initial deployment; no ongoing evaluation exists.
Developing
Periodic manual evaluations are conducted but are not automated or on a defined schedule.
Defined
Automated evaluation pipelines run on a defined schedule with results reported to model owners.
Managed
Evaluation results are trended over time; significant changes trigger review and potential remediation.
Optimizing
Evaluation sets are continuously expanded based on production failure modes; evaluation coverage is a tracked metric.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Evaluation pipeline configuration and test suite documentation listing metrics, datasets, and pass thresholds
- —Automated evaluation run history showing test results for each model version over a defined period
- —Regression alert records for any evaluation run where performance declined from the previous version
- —Human evaluation records for spot-checks of automated evaluation accuracy
- —Evaluation dataset governance records confirming test sets are not contaminated by training data
Implementation Notes
Key steps
- Build and maintain a curated evaluation set that is held out from training and covers the full range of expected inputs including edge cases and adversarial examples.
- Run evaluation on every model version change, not just on a time schedule — behavioral regressions often appear immediately after an update.
- Include human evaluation for subjective dimensions (helpfulness, tone, appropriateness) that automated metrics cannot capture reliably.
- Publish evaluation results to stakeholders as part of model governance reporting — evaluation results that stay within the engineering team provide weak governance assurance.
Example Implementation
AI engineering team running weekly automated evals on a generative summarization model
Continuous Evaluation Pipeline — Document Summarization Model
Evaluation sets maintained:
| Set Name | Size | Composition | Update Cadence |
|---|---|---|---|
| Core regression | 500 samples | Representative production inputs; curated at launch | Quarterly additions |
| Adversarial | 120 samples | Injection attempts, edge cases, known failure modes | After every incident |
| Human-eval sample | 50 samples | Random production sample scored by human raters | Weekly |
Automated metrics (run weekly + on every model version change):
| Metric | Threshold | Current | Trend |
|---|---|---|---|
| ROUGE-L vs. reference | > 0.52 | 0.57 | Stable |
| Hallucination rate (NLI-based) | < 4% | 2.8% | Stable |
| Format pass rate | > 99% | 99.6% | Stable |
| Latency p95 | < 800ms | 620ms | Stable |
Human eval (weekly sample): Raters score helpfulness and accuracy 1–5; target ≥ 4.0 average
Governance reporting: Results published to #model-evals Slack channel weekly; failures trigger release hold
Control Details
- Control ID
- MON-005
- Domain
- Monitoring & Drift
- Typical owner
- AI Engineering / AI Governance Team
- Implementation effort
- Medium effort
- Agent-relevant
- No
