Prompt Injection Prevention
Detect and block adversarial inputs designed to override AI system instructions, extract sensitive information, or cause the model to behave in unintended ways.
Objective
Protect AI systems from being hijacked through crafted inputs that manipulate model behavior beyond its intended scope.
Maturity Levels
Initial
No injection defenses exist; all user inputs are passed directly to the model.
Developing
Basic input filtering is in place but rules are incomplete and not formally tested.
Defined
Documented injection prevention controls include input validation, instruction separation, and output monitoring.
Managed
Injection attempt detection is automated; attempted attacks are logged and reviewed.
Optimizing
Red team exercises proactively test new injection techniques; defenses are updated continuously.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Input sanitization configuration listing active filtering rules and structural content-boundary enforcement
- —Monthly red team test results: test case count, pass/fail outcome, and remediation records for any failures
- —Injection attempt detection logs reviewed on a defined cadence, with escalation records for confirmed attempts
- —Output monitoring configuration showing permitted-topic controls and flagging thresholds
- —Post-remediation retest results confirming identified weaknesses were resolved before returning to production
Implementation Notes
Key steps
- Separate system instructions from user-supplied content structurally — never concatenate user input into the instruction context without clear demarcation.
- Implement input validation that flags common injection patterns (role-playing instructions, instruction override phrases, delimiter injection).
- Test for indirect injection: content retrieved from external sources (web, documents, emails) should be treated as untrusted data, not trusted instructions.
- Monitor outputs for signs of injection success: unexpected role changes, policy violations, out-of-scope content.
Example Implementation
B2B SaaS company with a customer-facing AI assistant that processes user-supplied documents
Prompt Injection Prevention Controls — Document Q&A Assistant
Structural separation: User documents are passed in a clearly labelled <document> block in the user turn. System instructions live exclusively in the system prompt. The two are never concatenated.
Input validation rules (applied before model receives content):
- Flag and quarantine inputs containing: "ignore previous instructions", "you are now", "disregard your", "system prompt:", "new instructions:"
- Strip HTML/XML tags from user-supplied text before injection into prompt template
- Enforce 8,000-token input limit per document chunk
Output monitoring: Model responses are checked against a permitted topic allowlist. Responses discussing system instructions, other customers' data, or unrelated domains are logged and held for review.
Testing cadence: Monthly red team exercise — 30 injection test cases submitted via the customer-facing interface; pass criterion: 0 successful instruction overrides, 0 system prompt leaks
Control Details
- Control ID
- SEC-001
- Domain
- Security
- Typical owner
- CISO / AI Engineering
- Implementation effort
- Medium effort
- Agent-relevant
- Yes
