Reference Guide

Prompt Injection: Mental Models

The eight mental models that organize every other prompt-injection defense — data vs instructions confusion, the lethal trifecta, direct vs indirect injection, the agent surface as attack surface, trust boundaries, capability vs autonomy, defense in depth, and the Stranger Test.

← Back to Reference Hub

Traditional software draws a hard line between code (instructions the machine executes) and data (content the machine processes). Language models do not. Every token in context is read with the same eye — the user's question, the system prompt, the tool result, the document the agent just opened, the email it just read. The model decides moment-to-moment what to treat as direction and what to treat as content, and an attacker who can write any of that content can try to shift the boundary. This is why prompt injection is not solvable the way SQL injection is solvable: you cannot escape your way out of a problem that exists at the interpretation layer rather than the parsing layer.

Every token the model sees is potentially executable as an instruction
There is no syntactic marker that reliably distinguishes "this is what the user wants" from "this is content the user wants me to work with"
Vendor mechanisms (system prompts, role tags, content tags) raise the bar but do not close the gap
The problem worsens as context windows grow — more content means more opportunities for an injected payload to be reached
Mitigation must be defense-in-depth at the application layer, not a parser fix at the model layer

Limitations: This framing is uncomfortable for engineers coming from traditional software security, where input sanitization is a solved enough problem. The fact that there is no fix at the model layer often gets read as defeatism — it is not. It is the starting point for understanding which layer the fixes have to live at.

FoundationalMental Model

Prompt injection is dangerous when three properties hold simultaneously: the agent has access to private or sensitive data, the agent can exfiltrate (send data outside the user's control), and the agent is exposed to untrusted content. Drop any one ingredient and the attack surface collapses. A read-only agent on public data has nothing to steal. A sandboxed agent with no network can be tricked but cannot exfiltrate. An agent that only ever sees content the user themselves wrote has no injection channel. This framing is unusually portable: it works for chat assistants, code assistants, browser agents, file-reading agents, and connector-mediated agents alike, and it gives engineers a checklist they can apply during design rather than a vocabulary they have to memorize.

Ingredient 1: access to private or sensitive data the attacker wants
Ingredient 2: a path for data to leave the trust boundary (network call, tool output, email send, screenshot, anything outbound)
Ingredient 3: exposure to content an attacker can shape (web page, email body, document, tool result)
Remove any one ingredient and you reduce the attack from "real risk" to "annoying but contained"
Use it during design reviews: "Does this feature add ingredient 1, 2, or 3 to a path that already has the other two?"

Limitations: Like any heuristic it oversimplifies. A determined attacker can chain across separate agents (one has trifecta-element-2, another has trifecta-element-1) — the model still applies but you have to draw the trust boundary around the whole system, not a single agent. It also doesn't help with reputational harm where no data is exfiltrated but the agent says something embarrassing publicly.

FoundationalDesign Tool

Direct prompt injection is what most people picture: a user types 'ignore your previous instructions and do X.' This is interesting for jailbreaks but limited as a real-world threat because the only person it harms is the user themselves. Indirect prompt injection is the asymmetric version: an attacker plants instructions in content that the model will later read on behalf of a victim. A poisoned web page, a malicious email, a booby-trapped PDF, a planted Slack message. The victim's own agent then encounters the payload while doing legitimate work. Almost every serious prompt-injection incident in the wild is indirect, because that is the version that lets an attacker target other people.

Direct: user types the injection — limited blast radius (themselves)
Indirect: attacker plants the injection in content the victim's agent will later read
Indirect attacks ride along on legitimate workflows ("summarize my email," "read this PDF")
The victim has no awareness of the attack — there is no malicious-looking message to flag
The attacker does not need access to the victim's account, only to content the victim's agent will reach

Limitations: The line between direct and indirect blurs in shared workspaces — a colleague's Slack message is technically content from someone else, but most people don't treat it as attacker-controlled. Threat modeling has to define explicitly which sources are trusted, which are not, and where the boundary sits.

FoundationalThreat Pattern

A chat-only LLM has a small attack surface: the user, the system prompt, the conversation history. An agent with tools has a surface equal to everything those tools can reach. Connect Gmail, and every email becomes potential injection content. Connect a browser, and every web page does. Connect a file reader, and every document does. Connect an MCP server, and every tool result returned by that server does. The agent does not need to be jailbroken or social-engineered — it just needs to read content from a surface an attacker controls, in the course of doing its legitimate job. This is the structural reason agents are riskier than chat: not because they are smarter, but because their input surface is dramatically larger.

Surface = union of everything every connected tool can reach
Adding a tool to an agent adds its entire input surface to the trust boundary
Read-only tools still expand the attack surface (input poisoning) even if not the exfil surface
Surface grows non-linearly: each new tool also enables new chains with existing tools
Treat tool connection decisions with the same gravity as opening a new network port

Limitations: Pure surface-counting underweights the practical value tools provide — an agent with one tool may be safer but also vastly less useful. The real question is whether the surface added is one you can monitor and reason about, not whether to minimize tool count for its own sake.

FoundationalArchitecture

Every agent system has trust boundaries, even if no one has drawn them. The user is one trust level. The system prompt is another. The model's own outputs are a third. Tool results from an internal API differ from tool results from a third-party MCP server. Documents the user uploads themselves differ from documents pulled in from email. If you cannot name the trust levels in your system, you cannot reason about which injection attacks matter and which do not. Drawing the boundaries explicitly is half the work; the other half is making sure each boundary's content is treated according to its level — sanitized, tagged, scoped, or denied capabilities as appropriate.

List every source of content the agent reads: user, system, each tool, each connector, each MCP
Assign each source a trust level — 'trusted' (system prompt), 'semi-trusted' (user), 'untrusted' (external content)
Capabilities granted should reflect trust — untrusted content should never reach high-capability tool calls without a gate
Document the boundaries in code comments or design docs so they survive refactors
Re-audit when adding tools or connectors — every new source has its own implicit trust level

Limitations: Real trust is rarely binary. A user uploads a document — is that trusted (because the user chose it) or untrusted (because the document might have come from anywhere)? Most teams end up with three or four levels and accept the messiness.

ArchitectureDesign Tool

Per-action confirmation is the most accessible defense against mid-task injection — the user sees the proposed action and can refuse it. Removing that gate ('Act without asking' modes, autonomous loops, scheduled tasks) trades injection resistance for speed. This is a legitimate tradeoff: not every workflow needs the gate, and over-gating creates its own failure mode (confirmation fatigue, where users approve everything to make the prompts go away). The principle: grant the minimum autonomy the workflow actually requires, and keep gates on for anything that touches untrusted content or high-consequence tools.

Per-action confirmation is the cheapest defense against in-progress injection
Autonomous modes ("Act without asking," scheduled tasks, multi-step agents) bypass that defense
Reserve autonomy for: short tasks, trusted content, easily reversible actions, supervised runs
Keep gates for: long tasks, content from external sources, irreversible actions (email send, payments, deletes)
Confirmation fatigue is real — over-gating teaches users to click through, undoing the benefit

Limitations: There is no clean rule for where the line sits. Teams converge on it by trial and incident review. The honest version: 'we keep gates on for tools that touch external content or take consequential actions, and we keep them off for read-only summarization of trusted documents.'

DefenseMode Setting

Every individual defense against prompt injection has been bypassed by some attack: tagged delimiters can be impersonated, system prompts can be overridden, output filters can be evaded, sandboxes can be escaped, human review can be fatigued. The reason the field is not hopeless is that defenses compound — an attacker has to bypass them in sequence, and most attacks fail at one or another layer. The mental shift required: stop searching for the defense that 'solves' prompt injection (none does), and start asking 'what is layer one, two, three, four, and is there at least one layer between every plausible attack and every consequential action?' This is identical to how mature security thinking handles every other class of attack.

Input layer: limit what content the agent reads at all
Boundary layer: tag/structure content so the model can distinguish levels (helps, not enough alone)
Capability layer: restrict what tools an agent can call given what content it has read
Output layer: validate or filter what the agent is about to do or say
Gate layer: human confirmation on the highest-consequence actions
Monitoring layer: log everything; detect drift after the fact even when prevention fails

Limitations: Defense-in-depth costs latency, dev time, and user experience. Over-engineering it for a low-risk workflow is its own failure mode. Layer count should match consequence — a summary tool can run with fewer layers than a tool that can spend money.

FoundationalDefense

Before connecting a tool or granting a permission, ask: 'If a stranger could write whatever they wanted into this surface, what is the worst thing they could get my agent to do?' If the worst case is 'waste my time,' the surface is probably fine. If the worst case is 'exfiltrate my data,' 'send messages on my behalf,' or 'spend money,' the surface needs gates. This is not a substitute for threat modeling — it is the warm-up for it. It works well in design reviews because non-security people can answer it, and it surfaces the indirect-injection cases that pure functional thinking misses.

Apply at design time, before code: "What is the worst thing a stranger writing into this content could trigger?"
Apply at install time, before connecting an MCP or plugin: "What is the worst thing the maintainer of this server could trigger if they turned malicious?"
Apply during incident triage: "What surface did this attack come through, and does that surface have a 'stranger' on the other side?"
Maps cleanly to the lethal trifecta — if the worst case is bad, you have all three ingredients
Use as a discussion frame in design reviews, especially with PMs and non-security engineers

Limitations: A blunt heuristic, not a methodology. Misses sophisticated chain attacks where the worst single-step outcome looks benign but multi-step composition produces real harm. Useful as the first filter, not the only one.

Design ToolTriage

Prompt injection is not solved at the model layer — and that is the most useful thing to know

Engineers coming from traditional security keep looking for the sanitization library that fixes prompt injection. There isn't one and won't be one in the short term, because the issue is at the interpretation layer where the model decides what counts as instruction versus content. Accepting this redirects energy from 'find the magic fix' to 'stack defenses at the application layer.' Every page in this module is downstream of that single redirection.

Mental model	What it helps you do	Where it applies	Limits
Instructions vs. data confusion	Stop expecting a model-layer fix; design at the application layer	Every system using an LLM	Reframes the problem — does not solve it
Lethal trifecta	Decide whether a feature adds real risk	Design, code review, install decisions	Heuristic — oversimplifies chains
Direct vs. indirect injection	Focus defense on the asymmetric (indirect) attacks	Threat modeling, content sourcing	Boundary between "user content" and "external content" can be fuzzy
Agent surface = attack surface	Treat new tool connections like new network ports	Architecture review, tool wiring	Undercounts the value tools provide; not just about minimizing count
Trust boundaries	Name where trust ends; tag content accordingly	System design, code structure	Real trust is rarely binary
Capability vs. autonomy	Choose the right autonomy level per workflow	Mode settings, UX gates, scheduled tasks	Confirmation fatigue is a real failure mode
Defense-in-depth	Stop searching for the single defense; stack them	All defense planning	Layer count should match consequence — not maximalism
The Stranger Test	Quick design-review filter	Design and install decisions	Misses multi-step chain attacks

The eight models compose. The trifecta tells you when you have a problem, agent-surface thinking tells you where, trust boundaries tell you how to scope, defense-in-depth tells you how many layers to stack, and the Stranger Test gets you started talking about it.

A teammate wants to connect a Gmail MCP to their coding agent so it can read product feedback.The Lethal Trifecta — does the agent have access to sensitive data (yes, the codebase), can it exfiltrate (yes, it can write commits or PRs), and is it now exposed to untrusted content (yes, every email body)? All three ingredients. Either remove an ingredient (no commit/PR capability while Gmail is connected) or add a confirmation gate to anything outbound.

Your team is debating whether 'Act without asking' is safe for a refactor across a folder of internal source files.Capability vs. Autonomy — internal files are trusted content, the action is reversible (git), the run is short. Reasonable to flip the autonomy gate off. Compare to: 'Act without asking' across a folder of PDFs from external vendors — different content trust level, different decision.

A vendor proposes an MCP server that 'just reads your support tickets and helps you triage.'The Stranger Test — every customer who has ever filed a ticket is now writing into your agent's context. Some of those customers might be hostile. The 'stranger' here is anyone who has ever submitted a ticket. Triage capabilities are probably fine; ticket-summary-then-reply capabilities are not, because the strangers can shape the replies.

An engineer is frustrated that they cannot 'just fix prompt injection' by sanitizing input.Instructions vs. Data Confusion — there is no parser to sanitize against. The fix lives at the application layer: trust boundaries, capability restriction, gates. Redirect the energy from 'find the sanitization library' to 'pick which layers of defense-in-depth this workflow earns.'

Your agent works fine in dev. Then someone adds a 'read recent web search results' tool. What changed?The Agent Surface as Attack Surface — the surface just expanded to include any web page the search returns, which means anyone who can rank for any query the agent might run. Re-apply the trifecta check now that ingredient 3 (untrusted content) is materially larger.

Three-person team wants to ship an internal agent fast and is asking which one defense to build first.Defense-in-Depth — the question is malformed. The right shape is 'what is the minimum stack for this consequence level?' For an internal-only read-only summarization agent, two layers might be enough. For an agent that can send email, three or four. Pick layers in proportion to what the agent can do, not in proportion to how scared you are.

If you only remember one thing, remember the trifecta

Access to sensitive data + ability to exfiltrate + exposure to untrusted content is the combination that makes prompt injection a real risk. Drop any one ingredient and the attack surface collapses. Use it at design time (does this feature add an ingredient?), at install time (does this MCP add an ingredient?), and at triage time (which ingredient enabled the incident?). Everything else on this page elaborates on or sits underneath this single check.

Prompt Injection: Mental Models

Instructions vs. Data Confusion

The Lethal Trifecta

Direct vs. Indirect Injection

The Agent Surface as Attack Surface

Trust Boundaries

The Capability vs. Autonomy Tradeoff

Defense-in-Depth, Not Silver Bullets

The Stranger Test

Prompt injection is not solved at the model layer — and that is the most useful thing to know

If you only remember one thing, remember the trifecta