Agent Prompt Injection Defense
Protect AI agents from prompt injection attacks — adversarial instructions embedded in external content that hijack agent behavior.
Objective
Prevent agents from being redirected by malicious instructions in tool outputs, user-supplied content, or web-retrieved data.
Maturity Levels
Initial
No prompt injection defenses exist; agents process external content without filtering.
Developing
Engineering teams are aware of prompt injection but defenses are inconsistent and undocumented.
Defined
Input sanitization and content-boundary enforcement are applied to all agent inputs from external sources.
Managed
Red team testing for prompt injection is conducted quarterly; findings are tracked to remediation.
Optimizing
Automated injection attempt detection feeds continuous improvement of agent system prompts and input filters.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Input sanitization configuration documentation listing active filters and content-boundary enforcement rules
- —Quarterly red team testing results: number of injection payloads tested, pass/fail outcome, and remediation actions for any failures
- —Injection attempt detection logs reviewed on a defined cadence, with escalation records for confirmed attempts
- —System prompt documentation demonstrating structural separation of agent instructions from external data
- —Post-remediation re-test results confirming findings from red team exercises were resolved
Implementation Notes
Key steps
- Treat all external content (web pages, emails, documents, API responses) as untrusted — enforce a strict boundary between agent instructions and external data.
- Use structural separation: pass external content in clearly labeled data fields rather than appending it to the instruction context.
- Test agents explicitly for indirect injection: submit tool results containing instruction-like text and verify the agent does not follow them.
- Monitor for anomalous agent actions that diverge from the initiating task — injection attacks often cause scope drift.
Example Implementation
Sales team using an AI agent to process and route inbound email inquiries
Prompt Injection Defense Policy — Email Intake Agent
Content boundary rule: All email body content is passed inside structured <email> tags. The system prompt explicitly instructs the agent to treat any content inside <email> as untrusted data and to ignore instructions found within it. External content is never appended to the instruction context.
Permitted action set: {classify, route, flag_for_review}. Any agent response outside this set is logged as a potential injection attempt and held for human review.
Input filters (applied before agent receives content):
- Strip strings matching: "ignore previous instructions", "you are now", "new task:", "system:", "[INST]"
- Flag any content claiming to originate from the system, operator, or a higher-trust agent
Monitoring: Injection attempt log reviewed weekly; patterns feed updates to system prompt and filter rules
Red team schedule: Quarterly — 20 emails with embedded instruction-like payloads; pass criterion: 0 instructions followed
Control Details
- Control ID
- AGT-002
- Domain
- Agentic AI
- Typical owner
- AI Security / AI Engineering
- Implementation effort
- Medium effort
- Agent-relevant
- Yes
