Adversarial Robustness Testing
Systematically test AI systems against adversarial inputs, edge cases, and known attack techniques before deployment and on a recurring basis.
Objective
Identify vulnerabilities in AI system behavior under adversarial conditions before they are exploited in production.
Maturity Levels
Initial
No adversarial testing is performed; systems are tested only for functional correctness.
Developing
Ad hoc adversarial testing is performed by individual engineers without a defined scope or methodology.
Defined
A documented adversarial testing program is conducted before each major deployment, covering prompt injection, data poisoning, and output manipulation.
Managed
Testing results are tracked over time; findings are remediated with defined SLAs; testing scope expands as new attack techniques emerge.
Optimizing
Automated adversarial probing runs continuously in staging; novel attack techniques are sourced from threat intelligence.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Structured test plan including threat model, test categories, pass criteria, and tester role assignments
- —Test results report with findings, severity ratings, and current remediation status for each finding
- —Remediation tracking records confirming critical and high findings were resolved within defined SLAs
- —Retest results confirming all findings were addressed before the system was promoted to production
- —Testing cadence records confirming tests were conducted on schedule and triggered by model or prompt changes
Implementation Notes
Key steps
- Distinguish adversarial robustness testing from functional QA — the goal is to break the system, not confirm it works normally.
- Cover at minimum: prompt injection, jailbreaks, data extraction attempts, role confusion, and boundary violation tests.
- Use structured red-team methodology: assign a dedicated team, define the threat model, document findings formally, and track remediation.
- Retest after every significant model update or prompt change — robustness properties do not transfer automatically between versions.
Example Implementation
Financial institution testing an AI model for credit application decisions before production deployment
Adversarial Robustness Test Plan — Credit Decision Model v3.1
Test categories and pass criteria:
| Category | Test Cases | Pass Criterion |
|---|---|---|
| Prompt injection | 40 cases — instruction override attempts in free-text fields | 0 instructions followed |
| Jailbreak | 25 cases — role-play and persona switching attempts | 0 policy violations |
| Boundary probing | 30 cases — inputs at edge of training distribution | Graceful degradation, no confidence inflation |
| Data extraction | 20 cases — attempts to elicit training data or other applicants' information | 0 data leaks |
| Adversarial feature manipulation | 50 cases — systematically modified inputs designed to flip decision | < 5% unexpected flips |
Methodology: Dedicated red team (2 engineers, 1 external consultant); findings documented in structured report with severity ratings
Remediation SLAs: Critical findings block deployment; High findings require remediation plan before deployment; Medium tracked in backlog
Retest requirement: Any model update or prompt change triggers re-run of full test suite before promotion to production
Control Details
- Control ID
- SEC-005
- Domain
- Security
- Typical owner
- AI Security / Red Team
- Implementation effort
- High effort
- Agent-relevant
- Yes
