AI Performance Baseline
Establish documented, quantified performance baselines for production AI systems against which ongoing performance can be compared.
Objective
Enable reliable detection of performance degradation by defining what 'normal' looks like before deployment.
Maturity Levels
Initial
No performance baselines exist; degradation is detected only when users complain.
Developing
Informal performance expectations exist but are not quantified or documented.
Defined
Baselines are established at deployment covering key metrics: accuracy, latency, error rate, and business outcome measures.
Managed
Baselines are updated when models are updated; performance trends are reviewed regularly.
Optimizing
Baselines are statistically derived and automatically adjusted for seasonality and distributional shifts.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Documented performance baseline with metric definitions, baseline values, and acceptable deviation thresholds per system
- —Baseline measurement records from the evaluation period used to establish each baseline
- —Monitoring dashboard or report showing current performance vs. baseline over a defined period
- —Alert records for any metric that crossed a defined threshold, including response and resolution
- —Baseline review records confirming thresholds are revisited after major model updates or distribution shifts
Implementation Notes
Key steps
- Define baselines before deployment, using held-out evaluation data — baselines established after deployment are contaminated by production drift.
- Include business outcome metrics alongside technical metrics: an AI model may maintain high technical accuracy while its business utility degrades.
- Set alert thresholds as a percentage deviation from baseline, not absolute values — this makes thresholds robust to model updates.
- Stratify baselines by user segment, jurisdiction, or data category where performance may differ materially.
Example Implementation
MLOps team establishing baselines for a fraud detection model before production launch
Performance Baseline — Fraud Detection Model v2.1
Established at deployment (2026-04-01) using 90-day holdout evaluation set:
| Metric | Baseline Value | Alert Threshold | Action |
|---|---|---|---|
| Precision | 0.84 | < 0.78 (7% drop) | Investigate; escalate to model owner |
| Recall | 0.79 | < 0.72 (7% drop) | Investigate; escalate to model owner |
| False positive rate | 3.2% | > 5.0% | Immediate escalation; consider rollback |
| Inference latency (p95) | 48ms | > 120ms | Engineering investigation |
| Fraud catch rate (business) | $2.1M/month | < $1.7M/month (trend over 2 weeks) | Business review + model assessment |
| Out-of-distribution rate | 1.8% | > 4.0% | Data team investigation |
Stratified baselines: Separate thresholds maintained for Card-Not-Present vs. In-Person transactions (documented in monitoring runbook)
Baseline refresh: Re-established after any model version update or when triggered by drift alert
Control Details
- Control ID
- MON-001
- Domain
- Monitoring & Drift
- Typical owner
- AI Engineering / MLOps
- Implementation effort
- Low effort
- Agent-relevant
- No
