AI Graceful Degradation
Define and implement fallback behavior for AI systems when they are unavailable, underperforming, or producing outputs below acceptable quality thresholds.
Objective
Maintain continuity of critical operations and prevent harm when AI systems fail by ensuring a defined, tested fallback path always exists.
Maturity Levels
Initial
No fallback exists; AI system failures cause process failures.
Developing
Fallback paths exist informally for some use cases but are not documented or tested.
Defined
Fallback procedures are documented for all AI-dependent processes, including manual process alternatives and user communication templates.
Managed
Fallback activation is tracked; recovery time is measured against defined SLAs.
Optimizing
Fallback triggers are automated based on monitored thresholds; degradation scenarios are tested quarterly.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Degradation design documentation specifying fallback behaviors, trigger conditions, and user communication approach for each AI system
- —Failover test records confirming fallback paths activate correctly under simulated failure conditions
- —Degradation event logs showing instances where fallback mode was triggered in production
- —User notification records confirming affected users were informed when AI capabilities were degraded
- —Post-degradation review records for significant events, including root cause and time-to-recovery
Implementation Notes
Key steps
- For every AI-dependent process, define what happens when the AI is unavailable: manual process, cached output, partial functionality, or graceful error?
- Test fallback paths under realistic conditions before deployment — undiscovered fallback failures during incidents are costly.
- For customer-facing AI, prepare user communications for degraded mode: be transparent about what is unavailable and what users can do instead.
- Apply circuit breaker patterns for AI API calls: if an API returns errors above a threshold, fail fast to the fallback rather than queuing requests.
Example Implementation
E-commerce platform using AI for product recommendations and customer support chat
Graceful Degradation Plan — AI Features
| Feature | Degraded Mode | Trigger | User Communication |
|---|---|---|---|
| Product recommendations | Show rule-based bestsellers (pre-computed) | AI API error rate > 10% for 2 min | None — seamless fallback |
| Support chat (AI response) | Route all chats to human queue | AI API unavailable or latency > 8s | "We're connecting you with a support specialist" |
| Search ranking (AI-enhanced) | Standard text-search ranking | AI scoring service unavailable | None — seamless fallback |
| Order anomaly detection | Queue all flagged orders for manual review | Model unavailable | Internal alert to Fraud team |
Circuit breaker config: AI API calls fail fast after 3 consecutive errors or timeout > 5s; circuit stays open for 60 seconds before retry
Fallback test schedule: Quarterly — each fallback mode is triggered in staging and verified to activate correctly; results logged in degradation test register
Communication template: Pre-drafted status page and in-app banner for extended AI outage (> 30 min) stored at /runbooks/ai-degradation-comms.md
Control Details
- Control ID
- SAF-003
- Domain
- Safety & Reliability
- Typical owner
- AI Engineering / Operations
- Implementation effort
- Medium effort
- Agent-relevant
- Yes
