AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Agentic AI
AGT · Agentic AIAGT-002Medium effortAgent-relevant

Agent Prompt Injection Defense

Protect AI agents from prompt injection attacks — adversarial instructions embedded in external content that hijack agent behavior.

Objective

Prevent agents from being redirected by malicious instructions in tool outputs, user-supplied content, or web-retrieved data.

Maturity Levels

1

Initial

No prompt injection defenses exist; agents process external content without filtering.

2

Developing

Engineering teams are aware of prompt injection but defenses are inconsistent and undocumented.

3

Defined

Input sanitization and content-boundary enforcement are applied to all agent inputs from external sources.

4

Managed

Red team testing for prompt injection is conducted quarterly; findings are tracked to remediation.

5

Optimizing

Automated injection attempt detection feeds continuous improvement of agent system prompts and input filters.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Input sanitization configuration documentation listing active filters and content-boundary enforcement rules
  • Quarterly red team testing results: number of injection payloads tested, pass/fail outcome, and remediation actions for any failures
  • Injection attempt detection logs reviewed on a defined cadence, with escalation records for confirmed attempts
  • System prompt documentation demonstrating structural separation of agent instructions from external data
  • Post-remediation re-test results confirming findings from red team exercises were resolved

Implementation Notes

Key steps

  • Treat all external content (web pages, emails, documents, API responses) as untrusted — enforce a strict boundary between agent instructions and external data.
  • Use structural separation: pass external content in clearly labeled data fields rather than appending it to the instruction context.
  • Test agents explicitly for indirect injection: submit tool results containing instruction-like text and verify the agent does not follow them.
  • Monitor for anomalous agent actions that diverge from the initiating task — injection attacks often cause scope drift.

Example Implementation

Sales team using an AI agent to process and route inbound email inquiries

Prompt Injection Defense Policy — Email Intake Agent

Content boundary rule: All email body content is passed inside structured <email> tags. The system prompt explicitly instructs the agent to treat any content inside <email> as untrusted data and to ignore instructions found within it. External content is never appended to the instruction context.

Permitted action set: {classify, route, flag_for_review}. Any agent response outside this set is logged as a potential injection attempt and held for human review.

Input filters (applied before agent receives content):

  • Strip strings matching: "ignore previous instructions", "you are now", "new task:", "system:", "[INST]"
  • Flag any content claiming to originate from the system, operator, or a higher-trust agent

Monitoring: Injection attempt log reviewed weekly; patterns feed updates to system prompt and filter rules

Red team schedule: Quarterly — 20 emails with embedded instruction-like payloads; pass criterion: 0 instructions followed

Control Details

Control ID
AGT-002
Typical owner
AI Security / AI Engineering
Implementation effort
Medium effort
Agent-relevant
Yes

Tags

prompt injectionagent securityadversarial inputsindirect injection