Prompt Injection: Defenses
Defense in depth, applied — limiting input surface, tagged inputs, capability restriction, scoped credentials, human-in-the-loop gates, output validation, content provenance, sandboxing, monitoring, incident response. No silver bullets, layered defenses, sized to consequence.
← Back to Reference HubEvery defense further down this list is harder than just not reading the content. Before you reach for tagged delimiters or output filters, ask whether the agent needs that surface at all. Does the summarization agent need to read the entire email or just the from/subject? Does the code agent need full web access or just the documentation URLs you've allowlisted? Does the document agent need to read DOCX comments and footnotes, or just the body text? Narrowing input is the highest-leverage defense because it eliminates entire classes of attack rather than mitigating them. It is also the defense people are most reluctant to apply, because it visibly reduces capability — but a narrower agent that works correctly is more valuable than a broader agent that periodically betrays its user.
- Allowlists over blocklists — define what content is in scope, reject the rest
- Strip metadata, comments, hidden fields from documents before reading
- Limit web fetch to specific domains; treat search results as untrusted summaries, not direct content
- Read structured fields (subject, sender) before opening bodies, where possible
- Re-evaluate input scope every time a new tool or connector is added
Limitations: Real workflows often need the broader surface, and narrowing too aggressively kills product value. The technique works best when applied at design time; retrofitting it into a working system is expensive.
Wrapping untrusted content in explicit markers ('<<user_email>>...<<end_user_email>>') and instructing the model to treat content inside the markers as data, not instructions, measurably raises the bar for injection. The Claude API supports this pattern explicitly with system prompts that establish trust hierarchy. The bar is not infinite — the model can still be persuaded by sufficiently sophisticated content inside the markers, and attackers can sometimes impersonate the markers themselves. Treat tagged inputs as the first layer of defense, not the only one. Anyone who tells you 'just use XML tags and you're safe' is selling you something.
- Tag untrusted content with explicit, hard-to-impersonate markers
- Use the system prompt to define the trust hierarchy and how to treat each tag
- Combine with output structure: ask the model to produce specific output formats that ignore in-content instructions
- Anthropic specifically: use the documented prompt structure for tool use and document reading
- Re-validate the tags on every model interaction — do not assume they survived a long context
Limitations: Tag impersonation is a known bypass class. Models trained on tagged inputs are more resistant, but no model is immune. This is a 'raise the cost of attack' technique, not a 'close the gap' technique.
An agent that cannot send email cannot exfiltrate via email. An agent that cannot write to a database cannot poison a database. Capability restriction is the structural defense against the most damaging injection outcomes: even if the agent is fully compromised mid-session, the worst it can do is bounded by the tools it can call. Apply it per-workflow: a summarization agent gets read-only tools, an inbox-cleanup agent gets read + archive but not send, a code-review agent gets read + comment but not push. The friction of splitting agents into capability tiers pays off the first time an injection gets past your other defenses — which it will, eventually.
- Match tool access to workflow — not to convenience
- Read-only is dramatically safer than read-write; prefer it where possible
- Network egress is the highest-leverage capability to restrict (no egress = no exfil)
- Separate agents per privilege tier rather than one agent with all tools
- Inventory current agent tool access annually; capabilities accumulate over time
Limitations: Real workflows resist splitting — users want one agent that does everything. Capability restriction is most effective when adopted from day one; retrofitting it requires re-platforming agents, which is painful.
When an agent acts as the user, it should act with credentials scoped to the specific task — not the user's full identity. OAuth scopes that grant 'read all email' should be downgraded to 'read messages matching this label' where the API supports it. Service accounts for agent operations should have separate permissions from human accounts. Time-limited tokens are better than long-lived ones. The cost of an injection scales with the credentials the agent is wielding when it gets injected: a compromised agent with a full-access token can do far more harm than a compromised agent with a read-only, label-scoped, time-limited token.
- OAuth scope minimization — request only the scopes the workflow needs
- Per-agent service accounts, separate from user accounts
- Time-limited credentials over long-lived API keys where the platform supports it
- Audit logs for agent actions distinguishable from user actions
- Revocation playbook — know how to kill agent credentials fast
Limitations: Many platforms have coarser scopes than ideal — "read email" is often not splittable into "read this folder." You work with what the platform offers and accept that some scopes are bigger than you wish.
The simplest and most effective defense against in-progress injection: before the agent takes a consequential action, show the user what it is about to do and require an explicit approval. The user catches the suspicious action because they can see it framed plainly ('Send email to attacker@example.com with body X — Approve?'), even if they could not have spotted the injection in the source content. The defense fails when overused: confirmation fatigue trains users to approve everything. Reserve gates for actions that match a consequence threshold — sending email, moving money, deleting, sharing externally, scheduling — not for every read or every minor edit.
- Gate every action that touches the outside world or is hard to reverse
- Show the action in user-meaningful terms (the actual email body, the actual amount, the actual recipient)
- Do not gate routine reads or minor edits — fatigue kills the defense
- Distinguish "this is the agent acting" from "this is the user acting" in UI
- Default off-mode confirmations to ON; require deliberate opt-out per workflow
Limitations: Confirmation fatigue is the killer. Over-gating teaches users to approve without reading, which makes the defense useless. Tuning is per-workflow and per-user; one-size-fits-all gating fails both directions.
Between the model's decision and the tool call, insert validation: is the agent about to call a tool with arguments that contain content from untrusted sources? Is the email recipient in the user's contacts or a stranger? Is the URL the agent is about to fetch in the allowlist? Is the file the agent is about to delete one the user explicitly named? Output validation catches the injection at the action layer rather than the input layer — useful because it works even when input-side defenses fail. The technique pairs especially well with capability restriction: the same code that enforces 'this agent can only email people in the user's contacts' is also the code that catches the injection trying to email attackers.
- Pre-flight check tool calls against policy before executing them
- Validate arguments against allowlists (recipients, URLs, file paths)
- Anomaly-detect arguments that contain injected-looking content
- Block tool calls that compose untrusted content with high-consequence actions
- Log every blocked call — even unsuccessful injections are signal
Limitations: Hard to write the right validators. Too strict and the agent cannot do real work; too loose and injections slip through. Generally implementable for closed-vocabulary actions (email, file ops, payments) and harder for open-vocabulary actions (web fetch, code generation).
When content enters the agent's context, tag it with where it came from: user-typed, system prompt, internal-doc, external-email, web-fetch, third-party-tool. Policy can then reference these tags: 'do not call high-consequence tools when the request involved content tagged external-email,' or 'require user confirmation when the answer is informed by content tagged web-fetch.' Provenance is the prerequisite for almost every other defense — without it, the system has no way to know what to be careful about. It is also one of the hardest defenses to retrofit because most agent frameworks do not track provenance natively.
- Tag every piece of content as it enters context (source, trust level, timestamp)
- Carry tags forward through summarization and tool composition
- Surface provenance in audit logs and (where relevant) in user-facing UI
- Policy hooks consume the tags — "only allow capability X when no content with tag Y is in context"
- Re-tag when content moves between trust levels (e.g., user explicitly approves an external doc)
Limitations: Frameworks need to support it natively or it does not survive long context. Most current frameworks (2026) only have partial provenance support. The architecture pattern is right but the implementation reality is uneven.
Below the application layer, the operating environment can enforce limits the agent cannot override: network egress allowlists, filesystem mounts narrowed to specific paths, container-level isolation, no shell access. The Anthropic Cowork model is informative here — file operations require explicit per-action approval for deletes, code runs in a VM separate from the host, but computer-use bypasses these because it interacts with the actual screen. The pattern: every layer of sandboxing closes one class of attack; none of them close all classes; the security posture of the system is the union of which layers you have in place and which you have left open.
- Network egress allowlists at the container or VM level
- Filesystem mounts narrowed to working directories, not full home
- Separate sandbox for code execution from sandbox for file access
- Per-tool capability isolation — one tool getting compromised does not compromise the others
- Be aware which capabilities bypass the sandbox (computer use, screen access, system-level MCPs)
Limitations: Sandboxing fights with capability — the more sandboxed the agent, the less useful for tasks that genuinely need broad access. Computer-use modes specifically bypass most software sandboxing because they operate above the OS.
Accept that defense-in-depth will be incomplete and design for detection of the injections that get through. Log every tool call with arguments, every action taken, every content source that influenced a response. Build dashboards for the patterns that correlate with injection: sudden topic shifts, unexpected tool calls, accesses to resources the user did not mention, output that contains URLs to unfamiliar domains. The goal is not real-time prevention (that is what the other layers are for) but post-hoc visibility so that an injection attack is caught hours after it happens rather than days or weeks. The faster you see it, the smaller the blast radius.
- Tool-call audit logs with arguments and source attribution
- Action diff reviews — "what changed because of this agent run?"
- Anomaly patterns: sudden topic shift, unrequested resource access, output URLs to unknown domains
- Alerting on injection-shaped events without paging on every false positive
- Regular log review on a scheduled cadence even when no alerts fire
Limitations: Monitoring is reactive — it catches incidents after they happen, not before. Useful precisely because the proactive defenses are imperfect. Volume of agent activity makes raw log review impractical; you need either dashboards or LLM-assisted summarization.
Every prompt-injection incident is intelligence — for you (this content reached this surface), for your vendor (this attack pattern slipped past their defenses), and for the broader ecosystem (other organizations face the same surface). Have an incident playbook: stop the run immediately, preserve the conversation and tool-call logs, identify which content surface delivered the payload, rotate any credentials that were in scope, and report to the vendor's security channel. Anthropic accepts reports at security@anthropic.com and through the in-app feedback button; competitors have similar channels. Reports feed back into model training and content classifiers — the defenses everyone gets sharper because of your report.
- Pre-defined playbook: stop, preserve, identify, rotate, report
- Stop first, investigate second — gathering more evidence during the attack worsens the blast radius
- Preserve full conversation history and tool-call audit log before clearing the session
- Identify the entry surface (which document/page/message carried the payload)
- Report upward to the AI vendor and downward to your team for shared learning
Limitations: Most teams do not have an incident playbook for AI agents the way they do for traditional security incidents. Build it before you need it. The first incident is a bad time to figure out who to call.
Tagged delimiters are a starting line, not a defense in themselves
| Defense | Layer | Cost to implement | Coverage |
|---|---|---|---|
| Limit input surface | Input | Low at design time, high to retrofit | Highest — eliminates whole classes |
| Tagged/structured inputs | Prompt | Low | Medium — raises the bar but bypassable |
| Capability restriction | Architecture | High — requires splitting agents | Highest — bounds blast radius regardless of input |
| Scoped credentials | Identity | Medium — platform-dependent | High where granular scopes exist |
| Human-in-the-loop gates | UX | Low to implement, high in user friction if over-applied | High where applied; fatigue kills it if over-applied |
| Output filtering | Action | Medium — requires validators per action type | High for closed-vocab actions, lower for open-vocab |
| Content provenance | Architecture | High — framework-dependent | Prerequisite for several other defenses |
| Sandboxing | Infrastructure | Medium | High for software-mediated tools, low for computer-use |
| Monitoring | Detection | Medium — log + dashboard | Reactive — does not prevent |
| Incident response | Process | Low — written playbook | Minimizes blast radius of incidents that happen |
Stack at least one input-layer, one architecture-layer, and one process-layer defense
A useful minimum for any agent: narrow the input (what content does the agent actually need?), restrict the capability (what actions can it take given that content?), and have an incident playbook (what do you do when the first two fail?). Three layers, one from each tier. Add UX gating, output validation, and monitoring as the consequence of the workflow rises. Skip none of the three baseline layers — they cover the most attack surface for the least cost.