Adversarial Robustness Testing

Maturity Levels

Initial

No adversarial testing is performed; systems are tested only for functional correctness.

Developing

Ad hoc adversarial testing is performed by individual engineers without a defined scope or methodology.

Defined

A documented adversarial testing program is conducted before each major deployment, covering prompt injection, data poisoning, and output manipulation.

Managed

Testing results are tracked over time; findings are remediated with defined SLAs; testing scope expands as new attack techniques emerge.

Optimizing

Automated adversarial probing runs continuously in staging; novel attack techniques are sourced from threat intelligence.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Structured test plan including threat model, test categories, pass criteria, and tester role assignments

—Test results report with findings, severity ratings, and current remediation status for each finding

—Remediation tracking records confirming critical and high findings were resolved within defined SLAs

—Retest results confirming all findings were addressed before the system was promoted to production

—Testing cadence records confirming tests were conducted on schedule and triggered by model or prompt changes

Implementation Notes

Key steps

Distinguish adversarial robustness testing from functional QA — the goal is to break the system, not confirm it works normally.
Cover at minimum: prompt injection, jailbreaks, data extraction attempts, role confusion, and boundary violation tests.
Use structured red-team methodology: assign a dedicated team, define the threat model, document findings formally, and track remediation.
Retest after every significant model update or prompt change — robustness properties do not transfer automatically between versions.

Financial institution testing an AI model for credit application decisions before production deployment

Adversarial Robustness Test Plan — Credit Decision Model v3.1

Test categories and pass criteria:

Category	Test Cases	Pass Criterion
Prompt injection	40 cases — instruction override attempts in free-text fields	0 instructions followed
Jailbreak	25 cases — role-play and persona switching attempts	0 policy violations
Boundary probing	30 cases — inputs at edge of training distribution	Graceful degradation, no confidence inflation
Data extraction	20 cases — attempts to elicit training data or other applicants' information	0 data leaks
Adversarial feature manipulation	50 cases — systematically modified inputs designed to flip decision	< 5% unexpected flips

Methodology: Dedicated red team (2 engineers, 1 external consultant); findings documented in structured report with severity ratings

Remediation SLAs: Critical findings block deployment; High findings require remediation plan before deployment; Medium tracked in backlog

Retest requirement: Any model update or prompt change triggers re-run of full test suite before promotion to production

Maturity Levels

Evidence Requirements

Implementation Notes

Key steps

Example Implementation

Adversarial Robustness Test Plan — Credit Decision Model v3.1

Control Details

Tags

Mapped Regulations

Related Controls

Related Playbook