Reference Guide

Prompt Injection: Defenses

Defense in depth, applied — limiting input surface, tagged inputs, capability restriction, scoped credentials, human-in-the-loop gates, output validation, content provenance, sandboxing, monitoring, incident response. No silver bullets, layered defenses, sized to consequence.

← Back to Reference Hub

Every defense further down this list is harder than just not reading the content. Before you reach for tagged delimiters or output filters, ask whether the agent needs that surface at all. Does the summarization agent need to read the entire email or just the from/subject? Does the code agent need full web access or just the documentation URLs you've allowlisted? Does the document agent need to read DOCX comments and footnotes, or just the body text? Narrowing input is the highest-leverage defense because it eliminates entire classes of attack rather than mitigating them. It is also the defense people are most reluctant to apply, because it visibly reduces capability — but a narrower agent that works correctly is more valuable than a broader agent that periodically betrays its user.

Allowlists over blocklists — define what content is in scope, reject the rest
Strip metadata, comments, hidden fields from documents before reading
Limit web fetch to specific domains; treat search results as untrusted summaries, not direct content
Read structured fields (subject, sender) before opening bodies, where possible
Re-evaluate input scope every time a new tool or connector is added

Limitations: Real workflows often need the broader surface, and narrowing too aggressively kills product value. The technique works best when applied at design time; retrofitting it into a working system is expensive.

FoundationalDefensePre-Session

Wrapping untrusted content in explicit markers ('<<user_email>>...<<end_user_email>>') and instructing the model to treat content inside the markers as data, not instructions, measurably raises the bar for injection. The Claude API supports this pattern explicitly with system prompts that establish trust hierarchy. The bar is not infinite — the model can still be persuaded by sufficiently sophisticated content inside the markers, and attackers can sometimes impersonate the markers themselves. Treat tagged inputs as the first layer of defense, not the only one. Anyone who tells you 'just use XML tags and you're safe' is selling you something.

Tag untrusted content with explicit, hard-to-impersonate markers
Use the system prompt to define the trust hierarchy and how to treat each tag
Combine with output structure: ask the model to produce specific output formats that ignore in-content instructions
Anthropic specifically: use the documented prompt structure for tool use and document reading
Re-validate the tags on every model interaction — do not assume they survived a long context

Limitations: Tag impersonation is a known bypass class. Models trained on tagged inputs are more resistant, but no model is immune. This is a 'raise the cost of attack' technique, not a 'close the gap' technique.

DefensePrompt Layer

An agent that cannot send email cannot exfiltrate via email. An agent that cannot write to a database cannot poison a database. Capability restriction is the structural defense against the most damaging injection outcomes: even if the agent is fully compromised mid-session, the worst it can do is bounded by the tools it can call. Apply it per-workflow: a summarization agent gets read-only tools, an inbox-cleanup agent gets read + archive but not send, a code-review agent gets read + comment but not push. The friction of splitting agents into capability tiers pays off the first time an injection gets past your other defenses — which it will, eventually.

Match tool access to workflow — not to convenience
Read-only is dramatically safer than read-write; prefer it where possible
Network egress is the highest-leverage capability to restrict (no egress = no exfil)
Separate agents per privilege tier rather than one agent with all tools
Inventory current agent tool access annually; capabilities accumulate over time

Limitations: Real workflows resist splitting — users want one agent that does everything. Capability restriction is most effective when adopted from day one; retrofitting it requires re-platforming agents, which is painful.

FoundationalDefenseArchitecture

When an agent acts as the user, it should act with credentials scoped to the specific task — not the user's full identity. OAuth scopes that grant 'read all email' should be downgraded to 'read messages matching this label' where the API supports it. Service accounts for agent operations should have separate permissions from human accounts. Time-limited tokens are better than long-lived ones. The cost of an injection scales with the credentials the agent is wielding when it gets injected: a compromised agent with a full-access token can do far more harm than a compromised agent with a read-only, label-scoped, time-limited token.

OAuth scope minimization — request only the scopes the workflow needs
Per-agent service accounts, separate from user accounts
Time-limited credentials over long-lived API keys where the platform supports it
Audit logs for agent actions distinguishable from user actions
Revocation playbook — know how to kill agent credentials fast

Limitations: Many platforms have coarser scopes than ideal — "read email" is often not splittable into "read this folder." You work with what the platform offers and accept that some scopes are bigger than you wish.

DefenseIdentity Layer

The simplest and most effective defense against in-progress injection: before the agent takes a consequential action, show the user what it is about to do and require an explicit approval. The user catches the suspicious action because they can see it framed plainly ('Send email to attacker@example.com with body X — Approve?'), even if they could not have spotted the injection in the source content. The defense fails when overused: confirmation fatigue trains users to approve everything. Reserve gates for actions that match a consequence threshold — sending email, moving money, deleting, sharing externally, scheduling — not for every read or every minor edit.

Gate every action that touches the outside world or is hard to reverse
Show the action in user-meaningful terms (the actual email body, the actual amount, the actual recipient)
Do not gate routine reads or minor edits — fatigue kills the defense
Distinguish "this is the agent acting" from "this is the user acting" in UI
Default off-mode confirmations to ON; require deliberate opt-out per workflow

Limitations: Confirmation fatigue is the killer. Over-gating teaches users to approve without reading, which makes the defense useless. Tuning is per-workflow and per-user; one-size-fits-all gating fails both directions.

DefenseUX LayerIn-Session

Between the model's decision and the tool call, insert validation: is the agent about to call a tool with arguments that contain content from untrusted sources? Is the email recipient in the user's contacts or a stranger? Is the URL the agent is about to fetch in the allowlist? Is the file the agent is about to delete one the user explicitly named? Output validation catches the injection at the action layer rather than the input layer — useful because it works even when input-side defenses fail. The technique pairs especially well with capability restriction: the same code that enforces 'this agent can only email people in the user's contacts' is also the code that catches the injection trying to email attackers.

Pre-flight check tool calls against policy before executing them
Validate arguments against allowlists (recipients, URLs, file paths)
Anomaly-detect arguments that contain injected-looking content
Block tool calls that compose untrusted content with high-consequence actions
Log every blocked call — even unsuccessful injections are signal

Limitations: Hard to write the right validators. Too strict and the agent cannot do real work; too loose and injections slip through. Generally implementable for closed-vocabulary actions (email, file ops, payments) and harder for open-vocabulary actions (web fetch, code generation).

DefenseAction Layer

When content enters the agent's context, tag it with where it came from: user-typed, system prompt, internal-doc, external-email, web-fetch, third-party-tool. Policy can then reference these tags: 'do not call high-consequence tools when the request involved content tagged external-email,' or 'require user confirmation when the answer is informed by content tagged web-fetch.' Provenance is the prerequisite for almost every other defense — without it, the system has no way to know what to be careful about. It is also one of the hardest defenses to retrofit because most agent frameworks do not track provenance natively.

Tag every piece of content as it enters context (source, trust level, timestamp)
Carry tags forward through summarization and tool composition
Surface provenance in audit logs and (where relevant) in user-facing UI
Policy hooks consume the tags — "only allow capability X when no content with tag Y is in context"
Re-tag when content moves between trust levels (e.g., user explicitly approves an external doc)

Limitations: Frameworks need to support it natively or it does not survive long context. Most current frameworks (2026) only have partial provenance support. The architecture pattern is right but the implementation reality is uneven.

DefenseArchitectureEmerging

Below the application layer, the operating environment can enforce limits the agent cannot override: network egress allowlists, filesystem mounts narrowed to specific paths, container-level isolation, no shell access. The Anthropic Cowork model is informative here — file operations require explicit per-action approval for deletes, code runs in a VM separate from the host, but computer-use bypasses these because it interacts with the actual screen. The pattern: every layer of sandboxing closes one class of attack; none of them close all classes; the security posture of the system is the union of which layers you have in place and which you have left open.

Network egress allowlists at the container or VM level
Filesystem mounts narrowed to working directories, not full home
Separate sandbox for code execution from sandbox for file access
Per-tool capability isolation — one tool getting compromised does not compromise the others
Be aware which capabilities bypass the sandbox (computer use, screen access, system-level MCPs)

Limitations: Sandboxing fights with capability — the more sandboxed the agent, the less useful for tasks that genuinely need broad access. Computer-use modes specifically bypass most software sandboxing because they operate above the OS.

DefenseInfrastructure

Accept that defense-in-depth will be incomplete and design for detection of the injections that get through. Log every tool call with arguments, every action taken, every content source that influenced a response. Build dashboards for the patterns that correlate with injection: sudden topic shifts, unexpected tool calls, accesses to resources the user did not mention, output that contains URLs to unfamiliar domains. The goal is not real-time prevention (that is what the other layers are for) but post-hoc visibility so that an injection attack is caught hours after it happens rather than days or weeks. The faster you see it, the smaller the blast radius.

Tool-call audit logs with arguments and source attribution
Action diff reviews — "what changed because of this agent run?"
Anomaly patterns: sudden topic shift, unrequested resource access, output URLs to unknown domains
Alerting on injection-shaped events without paging on every false positive
Regular log review on a scheduled cadence even when no alerts fire

Limitations: Monitoring is reactive — it catches incidents after they happen, not before. Useful precisely because the proactive defenses are imperfect. Volume of agent activity makes raw log review impractical; you need either dashboards or LLM-assisted summarization.

DefenseDetectionAlways Applicable

Every prompt-injection incident is intelligence — for you (this content reached this surface), for your vendor (this attack pattern slipped past their defenses), and for the broader ecosystem (other organizations face the same surface). Have an incident playbook: stop the run immediately, preserve the conversation and tool-call logs, identify which content surface delivered the payload, rotate any credentials that were in scope, and report to the vendor's security channel. Anthropic accepts reports at security@anthropic.com and through the in-app feedback button; competitors have similar channels. Reports feed back into model training and content classifiers — the defenses everyone gets sharper because of your report.

Pre-defined playbook: stop, preserve, identify, rotate, report
Stop first, investigate second — gathering more evidence during the attack worsens the blast radius
Preserve full conversation history and tool-call audit log before clearing the session
Identify the entry surface (which document/page/message carried the payload)
Report upward to the AI vendor and downward to your team for shared learning

Limitations: Most teams do not have an incident playbook for AI agents the way they do for traditional security incidents. Build it before you need it. The first incident is a bad time to figure out who to call.

DefenseIncident Response

Tagged delimiters are a starting line, not a defense in themselves

Every other week a vendor announces 'we now use XML tags to prevent prompt injection.' Tags help — they raise the bar — but they do not close the gap, and treating them as the defense is the most common mistake teams make. Use the tagging your platform offers, AND restrict capabilities, AND gate consequential actions, AND monitor. The shape of a mature defense is layered; the shape of an immature one is a single technique with vendor confidence behind it.

Defense	Layer	Cost to implement	Coverage
Limit input surface	Input	Low at design time, high to retrofit	Highest — eliminates whole classes
Tagged/structured inputs	Prompt	Low	Medium — raises the bar but bypassable
Capability restriction	Architecture	High — requires splitting agents	Highest — bounds blast radius regardless of input
Scoped credentials	Identity	Medium — platform-dependent	High where granular scopes exist
Human-in-the-loop gates	UX	Low to implement, high in user friction if over-applied	High where applied; fatigue kills it if over-applied
Output filtering	Action	Medium — requires validators per action type	High for closed-vocab actions, lower for open-vocab
Content provenance	Architecture	High — framework-dependent	Prerequisite for several other defenses
Sandboxing	Infrastructure	Medium	High for software-mediated tools, low for computer-use
Monitoring	Detection	Medium — log + dashboard	Reactive — does not prevent
Incident response	Process	Low — written playbook	Minimizes blast radius of incidents that happen

No single defense is sufficient. The shape of a mature defense is at least one layer from input/prompt/architecture/UX/action and at least one layer from detection/process. Most real systems stack 4–6 of these and accept that 2 of them will fail in any given incident.

You are about to ship an agent that summarizes incoming customer feedback.Limit Input Surface + Capability Restriction — the agent does not need to send email or take any action; restrict it to read + write-to-internal-doc only. Cap the input at message body + subject (skip attachments unless explicitly requested). These two cover the lethal trifecta by removing ingredient 2 (exfil).

Your existing agent has read/write access to the whole Drive folder. A teammate asks for justification.Scoped Credentials + Capability Restriction — re-scope to the specific subfolders the agent actually touches. Use a service account scoped to those folders. Audit log every write. The 'we trust the model' answer is not the right answer; the right answer is 'we restrict what the model can do regardless of trust.'

Confirmation fatigue: users are clicking 'approve' on every action without reading.Human-in-the-Loop Gates — your gating is over-applied. Reserve gates for consequential actions (email send, money move, external share, delete). Drop gates on routine reads and minor edits. Better one gate the user reads than ten gates they autopilot through.

Investigating an incident where the agent attempted to email a file to an unfamiliar address.Output Filtering + Monitoring — the validation should have caught the email recipient not being in the user's contacts; check whether that validator exists or fired. Monitoring should show the source-content that influenced the action; trace back to which document or web page carried the payload. Then close the gap: tighter validator, narrower input scope, or both.

Your agent uses three third-party MCP servers. Engineering wants to add a fourth.Limit Input Surface + Content Provenance — vet the fourth MCP like a dependency, not a feature. Tag its outputs with `source:mcp:<name>` and add a policy that high-consequence actions cannot be triggered while content from that source is in context. If the framework does not support tagging, that is a flag to address before adding more MCPs.

You have decent in-session defenses but no incident playbook.Incident Response and Reporting — write the playbook this week, before you need it. Five steps: stop, preserve, identify, rotate, report. Pin it in your team's docs. The first incident is a bad time to figure out who to call.

Stack at least one input-layer, one architecture-layer, and one process-layer defense

A useful minimum for any agent: narrow the input (what content does the agent actually need?), restrict the capability (what actions can it take given that content?), and have an incident playbook (what do you do when the first two fail?). Three layers, one from each tier. Add UX gating, output validation, and monitoring as the consequence of the workflow rises. Skip none of the three baseline layers — they cover the most attack surface for the least cost.

Prompt Injection: Defenses

Limit the Input Surface

Tagged and Structured Inputs

Capability Restriction

Scoped Credentials

Human-in-the-Loop Gates

Output Filtering and Validation

Content Provenance and Trust Tagging

Sandboxing the Execution Environment

Monitoring and Detection

Incident Response and Reporting

Tagged delimiters are a starting line, not a defense in themselves

Stack at least one input-layer, one architecture-layer, and one process-layer defense