AI Performance Baseline

Establish documented, quantified performance baselines for production AI systems against which ongoing performance can be compared.

Objective

Enable reliable detection of performance degradation by defining what 'normal' looks like before deployment.

Maturity Levels

Initial

No performance baselines exist; degradation is detected only when users complain.

Developing

Informal performance expectations exist but are not quantified or documented.

Defined

Baselines are established at deployment covering key metrics: accuracy, latency, error rate, and business outcome measures.

Managed

Baselines are updated when models are updated; performance trends are reviewed regularly.

Optimizing

Baselines are statistically derived and automatically adjusted for seasonality and distributional shifts.

What an auditor or assessor would expect to see for this control.

—Documented performance baseline with metric definitions, baseline values, and acceptable deviation thresholds per system
—Baseline measurement records from the evaluation period used to establish each baseline
—Monitoring dashboard or report showing current performance vs. baseline over a defined period
—Alert records for any metric that crossed a defined threshold, including response and resolution
—Baseline review records confirming thresholds are revisited after major model updates or distribution shifts

Define baselines before deployment, using held-out evaluation data — baselines established after deployment are contaminated by production drift.
Include business outcome metrics alongside technical metrics: an AI model may maintain high technical accuracy while its business utility degrades.
Set alert thresholds as a percentage deviation from baseline, not absolute values — this makes thresholds robust to model updates.
Stratify baselines by user segment, jurisdiction, or data category where performance may differ materially.

MLOps team establishing baselines for a fraud detection model before production launch

Established at deployment (2026-04-01) using 90-day holdout evaluation set:

Metric	Baseline Value	Alert Threshold	Action
Precision	0.84	< 0.78 (7% drop)	Investigate; escalate to model owner
Recall	0.79	< 0.72 (7% drop)	Investigate; escalate to model owner
False positive rate	3.2%	> 5.0%	Immediate escalation; consider rollback
Inference latency (p95)	48ms	> 120ms	Engineering investigation
Fraud catch rate (business)	$2.1M/month	< $1.7M/month (trend over 2 weeks)	Business review + model assessment
Out-of-distribution rate	1.8%	> 4.0%	Data team investigation

Stratified baselines: Separate thresholds maintained for Card-Not-Present vs. In-Person transactions (documented in monitoring runbook)

Baseline refresh: Re-established after any model version update or when triggered by drift alert