AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Monitoring & Drift
MON · Monitoring & DriftMON-005Medium effort

Continuous Model Evaluation

Run ongoing evaluation pipelines against held-out test sets and curated adversarial examples to continuously measure model performance in production.

Objective

Detect performance degradation or behavioral changes that periodic manual evaluations would miss by automating evaluation on a regular cadence.

Maturity Levels

1

Initial

Evaluation is performed only at initial deployment; no ongoing evaluation exists.

2

Developing

Periodic manual evaluations are conducted but are not automated or on a defined schedule.

3

Defined

Automated evaluation pipelines run on a defined schedule with results reported to model owners.

4

Managed

Evaluation results are trended over time; significant changes trigger review and potential remediation.

5

Optimizing

Evaluation sets are continuously expanded based on production failure modes; evaluation coverage is a tracked metric.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Evaluation pipeline configuration and test suite documentation listing metrics, datasets, and pass thresholds
  • Automated evaluation run history showing test results for each model version over a defined period
  • Regression alert records for any evaluation run where performance declined from the previous version
  • Human evaluation records for spot-checks of automated evaluation accuracy
  • Evaluation dataset governance records confirming test sets are not contaminated by training data

Implementation Notes

Key steps

  • Build and maintain a curated evaluation set that is held out from training and covers the full range of expected inputs including edge cases and adversarial examples.
  • Run evaluation on every model version change, not just on a time schedule — behavioral regressions often appear immediately after an update.
  • Include human evaluation for subjective dimensions (helpfulness, tone, appropriateness) that automated metrics cannot capture reliably.
  • Publish evaluation results to stakeholders as part of model governance reporting — evaluation results that stay within the engineering team provide weak governance assurance.

Example Implementation

AI engineering team running weekly automated evals on a generative summarization model

Continuous Evaluation Pipeline — Document Summarization Model

Evaluation sets maintained:

Set NameSizeCompositionUpdate Cadence
Core regression500 samplesRepresentative production inputs; curated at launchQuarterly additions
Adversarial120 samplesInjection attempts, edge cases, known failure modesAfter every incident
Human-eval sample50 samplesRandom production sample scored by human ratersWeekly

Automated metrics (run weekly + on every model version change):

MetricThresholdCurrentTrend
ROUGE-L vs. reference> 0.520.57Stable
Hallucination rate (NLI-based)< 4%2.8%Stable
Format pass rate> 99%99.6%Stable
Latency p95< 800ms620msStable

Human eval (weekly sample): Raters score helpfulness and accuracy 1–5; target ≥ 4.0 average

Governance reporting: Results published to #model-evals Slack channel weekly; failures trigger release hold

Control Details

Control ID
MON-005
Typical owner
AI Engineering / AI Governance Team
Implementation effort
Medium effort
Agent-relevant
No

Tags

model evaluationcontinuous testingevaluation pipelinemodel governance