AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Safety & Reliability
SAF · Safety & ReliabilitySAF-003Medium effortAgent-relevant

AI Graceful Degradation

Define and implement fallback behavior for AI systems when they are unavailable, underperforming, or producing outputs below acceptable quality thresholds.

Objective

Maintain continuity of critical operations and prevent harm when AI systems fail by ensuring a defined, tested fallback path always exists.

Maturity Levels

1

Initial

No fallback exists; AI system failures cause process failures.

2

Developing

Fallback paths exist informally for some use cases but are not documented or tested.

3

Defined

Fallback procedures are documented for all AI-dependent processes, including manual process alternatives and user communication templates.

4

Managed

Fallback activation is tracked; recovery time is measured against defined SLAs.

5

Optimizing

Fallback triggers are automated based on monitored thresholds; degradation scenarios are tested quarterly.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Degradation design documentation specifying fallback behaviors, trigger conditions, and user communication approach for each AI system
  • Failover test records confirming fallback paths activate correctly under simulated failure conditions
  • Degradation event logs showing instances where fallback mode was triggered in production
  • User notification records confirming affected users were informed when AI capabilities were degraded
  • Post-degradation review records for significant events, including root cause and time-to-recovery

Implementation Notes

Key steps

  • For every AI-dependent process, define what happens when the AI is unavailable: manual process, cached output, partial functionality, or graceful error?
  • Test fallback paths under realistic conditions before deployment — undiscovered fallback failures during incidents are costly.
  • For customer-facing AI, prepare user communications for degraded mode: be transparent about what is unavailable and what users can do instead.
  • Apply circuit breaker patterns for AI API calls: if an API returns errors above a threshold, fail fast to the fallback rather than queuing requests.

Example Implementation

E-commerce platform using AI for product recommendations and customer support chat

Graceful Degradation Plan — AI Features

FeatureDegraded ModeTriggerUser Communication
Product recommendationsShow rule-based bestsellers (pre-computed)AI API error rate > 10% for 2 minNone — seamless fallback
Support chat (AI response)Route all chats to human queueAI API unavailable or latency > 8s"We're connecting you with a support specialist"
Search ranking (AI-enhanced)Standard text-search rankingAI scoring service unavailableNone — seamless fallback
Order anomaly detectionQueue all flagged orders for manual reviewModel unavailableInternal alert to Fraud team

Circuit breaker config: AI API calls fail fast after 3 consecutive errors or timeout > 5s; circuit stays open for 60 seconds before retry

Fallback test schedule: Quarterly — each fallback mode is triggered in staging and verified to activate correctly; results logged in degradation test register

Communication template: Pre-drafted status page and in-app banner for extended AI outage (> 30 min) stored at /runbooks/ai-degradation-comms.md

Control Details

Control ID
SAF-003
Typical owner
AI Engineering / Operations
Implementation effort
Medium effort
Agent-relevant
Yes

Tags

graceful degradationfallbackresiliencebusiness continuity