Agent Prompt Injection Defense

Maturity Levels

Initial

No prompt injection defenses exist; agents process external content without filtering.

Developing

Engineering teams are aware of prompt injection but defenses are inconsistent and undocumented.

Defined

Input sanitization and content-boundary enforcement are applied to all agent inputs from external sources.

Managed

Red team testing for prompt injection is conducted quarterly; findings are tracked to remediation.

Optimizing

Automated injection attempt detection feeds continuous improvement of agent system prompts and input filters.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Input sanitization configuration documentation listing active filters and content-boundary enforcement rules

—Quarterly red team testing results: number of injection payloads tested, pass/fail outcome, and remediation actions for any failures

—Injection attempt detection logs reviewed on a defined cadence, with escalation records for confirmed attempts

—System prompt documentation demonstrating structural separation of agent instructions from external data

—Post-remediation re-test results confirming findings from red team exercises were resolved

Implementation Notes

Key steps

Treat all external content (web pages, emails, documents, API responses) as untrusted — enforce a strict boundary between agent instructions and external data.
Use structural separation: pass external content in clearly labeled data fields rather than appending it to the instruction context.
Test agents explicitly for indirect injection: submit tool results containing instruction-like text and verify the agent does not follow them.
Monitor for anomalous agent actions that diverge from the initiating task — injection attacks often cause scope drift.

Prompt Injection Defense Policy — Email Intake Agent

Content boundary rule: All email body content is passed inside structured <email> tags. The system prompt explicitly instructs the agent to treat any content inside <email> as untrusted data and to ignore instructions found within it. External content is never appended to the instruction context.

Permitted action set: {classify, route, flag_for_review}. Any agent response outside this set is logged as a potential injection attempt and held for human review.

Input filters (applied before agent receives content):

Strip strings matching: "ignore previous instructions", "you are now", "new task:", "system:", "[INST]"
Flag any content claiming to originate from the system, operator, or a higher-trust agent

Monitoring: Injection attempt log reviewed weekly; patterns feed updates to system prompt and filter rules

Red team schedule: Quarterly — 20 emails with embedded instruction-like payloads; pass criterion: 0 instructions followed

Maturity Levels

Evidence Requirements

Implementation Notes

Key steps

Example Implementation

Prompt Injection Defense Policy — Email Intake Agent

Control Details

Tags

Mapped Regulations

Related Controls

Related Playbook