AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Monitoring & Drift
MON · Monitoring & DriftMON-001Low effort

AI Performance Baseline

Establish documented, quantified performance baselines for production AI systems against which ongoing performance can be compared.

Objective

Enable reliable detection of performance degradation by defining what 'normal' looks like before deployment.

Maturity Levels

1

Initial

No performance baselines exist; degradation is detected only when users complain.

2

Developing

Informal performance expectations exist but are not quantified or documented.

3

Defined

Baselines are established at deployment covering key metrics: accuracy, latency, error rate, and business outcome measures.

4

Managed

Baselines are updated when models are updated; performance trends are reviewed regularly.

5

Optimizing

Baselines are statistically derived and automatically adjusted for seasonality and distributional shifts.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Documented performance baseline with metric definitions, baseline values, and acceptable deviation thresholds per system
  • Baseline measurement records from the evaluation period used to establish each baseline
  • Monitoring dashboard or report showing current performance vs. baseline over a defined period
  • Alert records for any metric that crossed a defined threshold, including response and resolution
  • Baseline review records confirming thresholds are revisited after major model updates or distribution shifts

Implementation Notes

Key steps

  • Define baselines before deployment, using held-out evaluation data — baselines established after deployment are contaminated by production drift.
  • Include business outcome metrics alongside technical metrics: an AI model may maintain high technical accuracy while its business utility degrades.
  • Set alert thresholds as a percentage deviation from baseline, not absolute values — this makes thresholds robust to model updates.
  • Stratify baselines by user segment, jurisdiction, or data category where performance may differ materially.

Example Implementation

MLOps team establishing baselines for a fraud detection model before production launch

Performance Baseline — Fraud Detection Model v2.1

Established at deployment (2026-04-01) using 90-day holdout evaluation set:

MetricBaseline ValueAlert ThresholdAction
Precision0.84< 0.78 (7% drop)Investigate; escalate to model owner
Recall0.79< 0.72 (7% drop)Investigate; escalate to model owner
False positive rate3.2%> 5.0%Immediate escalation; consider rollback
Inference latency (p95)48ms> 120msEngineering investigation
Fraud catch rate (business)$2.1M/month< $1.7M/month (trend over 2 weeks)Business review + model assessment
Out-of-distribution rate1.8%> 4.0%Data team investigation

Stratified baselines: Separate thresholds maintained for Card-Not-Present vs. In-Person transactions (documented in monitoring runbook)

Baseline refresh: Re-established after any model version update or when triggered by drift alert

Control Details

Control ID
MON-001
Typical owner
AI Engineering / MLOps
Implementation effort
Low effort
Agent-relevant
No

Tags

performance baselinemonitoringMLOpsmodel performance