Question 20 of 45

Is our AI red-teaming rigorous enough?

By Cody Maxwell · AI Governance Institute · 2026

Defining pass/fail criteria for adversarial testing of high-risk AI systems before deployment, covering toxicity, data leakage, jailbreaking, and misuse scenarios.

If you only do 3 things, do this:

1.Define pass/fail criteria before red-teaming begins, not after. What is an unacceptable failure for your specific use case? A content moderation system and a loan decision system have very different tolerances.
2.Re-test after every remediation. Fixes introduce their own failure modes. A passed re-test is a prerequisite for deployment, not a formality.
3.Document all findings, including attacks that didn't succeed. A red-team log that only shows failures creates a false picture of safety.

The Situation

Who this is for: Security, ML engineering, and compliance teams responsible for pre-deployment testing of high-risk AI systems

When you need this: Before deploying any high-risk AI system, after major model updates, and on a scheduled basis for production generative AI

The Decision

Does our AI system have failure modes that would be unacceptable in production, and have we tested rigorously enough to be confident we've found the important ones?

The Steps

1Define pass/fail criteria for the specific system and use case — be explicit about what constitutes an unacceptable outcome
2Assemble a red team with the mandate to actively try to break the system (do not settle for a checklist review)
3For generative AI: test jailbreaking, prompt injection, data extraction, and deployment-specific misuse scenarios
4For decision-making AI: test edge cases at classification boundaries, bias probes, and out-of-distribution robustness
5Document all findings: attack description, result, severity, and recommended remediation
6Remediate findings above the acceptable threshold; re-test to verify fixes and confirm no new failures were introduced
7Archive red-team findings and re-test results as part of the system's deployment documentation

The Artifacts

—Red-team scope definition template (system description, failure definition, pass/fail criteria)
—Attack scenario library (jailbreaking, prompt injection, data extraction, bias probes, robustness tests)
—Red-team findings log template (attack, result, severity, remediation, re-test status)
—Pre-deployment security checklist for AI systems
—Red-team report template (executive summary + technical findings)

The Output

A documented red-team assessment for every high-risk AI system, with pass/fail criteria met, all high-severity findings remediated and re-tested, and results archived as deployment documentation.

Red-teaming is not optional for high-risk AI

Red-teaming, adversarial testing designed to find failure modes before deployment, is increasingly expected by regulators and required by internal governance standards for high-risk AI systems. The EU AI Act requires accuracy, robustness, and cybersecurity testing for high-risk systems. NIST's AI RMF and the Generative AI Profile both address adversarial testing as a component of responsible deployment.

A red-team exercise that does not find anything is usually an exercise that did not try hard enough. Effective red-teaming requires people who are actively trying to make the system fail, with the creativity and persistence to find non-obvious failure modes rather than just checking a predetermined list.

What red-teaming should cover

For generative AI systems, red-teaming should cover: jailbreaking attempts designed to bypass safety controls and produce prohibited content; prompt injection attacks that attempt to override system instructions through user input; data extraction probes designed to elicit training data or confidential system information; and misuse scenarios specific to the deployment context, such as generating fraudulent documents or manipulating decisions.

For decision-making AI systems, adversarial testing should cover: edge cases at classification boundaries where small input changes produce dramatically different outputs; inputs designed to exploit known model weaknesses or biases; and robustness testing against distribution shift and out-of-distribution inputs that the model was not trained on.

Pass/fail criteria and documentation

Define pass/fail criteria before red-teaming begins, not after. What constitutes an unacceptable failure for your specific use case? A content moderation system and a loan decision system have very different failure tolerances. Criteria should be specific, measurable, and tied to the risk level and regulatory requirements of the system.

Document all red-team findings, including the attacks attempted, the results, the severity assessment, and the remediation taken. Findings that do not result in immediate remediation should be tracked as known risks with owners and timelines. Re-test after remediation to verify that fixes are effective and have not introduced new failure modes. Red-team documentation becomes part of the system's audit trail and may be reviewed by regulators or in litigation.

Governance Controls

Operational controls that implement the guidance in this playbook.

AGT-002Agent Prompt Injection Defense SAF-004AI Reliability Testing SAF-006Post-Deployment Adversarial Testing Cadence SEC-001Prompt Injection Prevention SEC-005Adversarial Robustness Testing

Related frameworks

NIST AI 600-1 Generative AI Profile NIST AI RMF EU AI Act Bletchley Declaration on AI Safety

Not sure where to start? Answer 3 questions and get a tailored compliance action plan.

What applies to me? →

← How do we handle intellectual property and copyright in AI?How do we govern AI agents that take autonomous actions? →