Prompt Injection Prevention

Maturity Levels

Initial

No injection defenses exist; all user inputs are passed directly to the model.

Developing

Basic input filtering is in place but rules are incomplete and not formally tested.

Defined

Documented injection prevention controls include input validation, instruction separation, and output monitoring.

Managed

Injection attempt detection is automated; attempted attacks are logged and reviewed.

Optimizing

Red team exercises proactively test new injection techniques; defenses are updated continuously.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Input sanitization configuration listing active filtering rules and structural content-boundary enforcement

—Monthly red team test results: test case count, pass/fail outcome, and remediation records for any failures

—Injection attempt detection logs reviewed on a defined cadence, with escalation records for confirmed attempts

—Output monitoring configuration showing permitted-topic controls and flagging thresholds

—Post-remediation retest results confirming identified weaknesses were resolved before returning to production

Implementation Notes

Key steps

Separate system instructions from user-supplied content structurally — never concatenate user input into the instruction context without clear demarcation.
Implement input validation that flags common injection patterns (role-playing instructions, instruction override phrases, delimiter injection).
Test for indirect injection: content retrieved from external sources (web, documents, emails) should be treated as untrusted data, not trusted instructions.
Monitor outputs for signs of injection success: unexpected role changes, policy violations, out-of-scope content.

B2B SaaS company with a customer-facing AI assistant that processes user-supplied documents