AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Security
SEC · SecuritySEC-001Medium effortAgent-relevant

Prompt Injection Prevention

Detect and block adversarial inputs designed to override AI system instructions, extract sensitive information, or cause the model to behave in unintended ways.

Objective

Protect AI systems from being hijacked through crafted inputs that manipulate model behavior beyond its intended scope.

Maturity Levels

1

Initial

No injection defenses exist; all user inputs are passed directly to the model.

2

Developing

Basic input filtering is in place but rules are incomplete and not formally tested.

3

Defined

Documented injection prevention controls include input validation, instruction separation, and output monitoring.

4

Managed

Injection attempt detection is automated; attempted attacks are logged and reviewed.

5

Optimizing

Red team exercises proactively test new injection techniques; defenses are updated continuously.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Input sanitization configuration listing active filtering rules and structural content-boundary enforcement
  • Monthly red team test results: test case count, pass/fail outcome, and remediation records for any failures
  • Injection attempt detection logs reviewed on a defined cadence, with escalation records for confirmed attempts
  • Output monitoring configuration showing permitted-topic controls and flagging thresholds
  • Post-remediation retest results confirming identified weaknesses were resolved before returning to production

Implementation Notes

Key steps

  • Separate system instructions from user-supplied content structurally — never concatenate user input into the instruction context without clear demarcation.
  • Implement input validation that flags common injection patterns (role-playing instructions, instruction override phrases, delimiter injection).
  • Test for indirect injection: content retrieved from external sources (web, documents, emails) should be treated as untrusted data, not trusted instructions.
  • Monitor outputs for signs of injection success: unexpected role changes, policy violations, out-of-scope content.

Example Implementation

B2B SaaS company with a customer-facing AI assistant that processes user-supplied documents

Prompt Injection Prevention Controls — Document Q&A Assistant

Structural separation: User documents are passed in a clearly labelled <document> block in the user turn. System instructions live exclusively in the system prompt. The two are never concatenated.

Input validation rules (applied before model receives content):

  • Flag and quarantine inputs containing: "ignore previous instructions", "you are now", "disregard your", "system prompt:", "new instructions:"
  • Strip HTML/XML tags from user-supplied text before injection into prompt template
  • Enforce 8,000-token input limit per document chunk

Output monitoring: Model responses are checked against a permitted topic allowlist. Responses discussing system instructions, other customers' data, or unrelated domains are logged and held for review.

Testing cadence: Monthly red team exercise — 30 injection test cases submitted via the customer-facing interface; pass criterion: 0 successful instruction overrides, 0 system prompt leaks

Control Details

Control ID
SEC-001
Domain
Security
Typical owner
CISO / AI Engineering
Implementation effort
Medium effort
Agent-relevant
Yes

Tags

prompt injectionAI securityinput validationadversarial inputs