Prompt Injection: Mental Models
The eight mental models that organize every other prompt-injection defense — data vs instructions confusion, the lethal trifecta, direct vs indirect injection, the agent surface as attack surface, trust boundaries, capability vs autonomy, defense in depth, and the Stranger Test.
← Back to Reference HubTraditional software draws a hard line between code (instructions the machine executes) and data (content the machine processes). Language models do not. Every token in context is read with the same eye — the user's question, the system prompt, the tool result, the document the agent just opened, the email it just read. The model decides moment-to-moment what to treat as direction and what to treat as content, and an attacker who can write any of that content can try to shift the boundary. This is why prompt injection is not solvable the way SQL injection is solvable: you cannot escape your way out of a problem that exists at the interpretation layer rather than the parsing layer.
- Every token the model sees is potentially executable as an instruction
- There is no syntactic marker that reliably distinguishes "this is what the user wants" from "this is content the user wants me to work with"
- Vendor mechanisms (system prompts, role tags, content tags) raise the bar but do not close the gap
- The problem worsens as context windows grow — more content means more opportunities for an injected payload to be reached
- Mitigation must be defense-in-depth at the application layer, not a parser fix at the model layer
Limitations: This framing is uncomfortable for engineers coming from traditional software security, where input sanitization is a solved enough problem. The fact that there is no fix at the model layer often gets read as defeatism — it is not. It is the starting point for understanding which layer the fixes have to live at.
Prompt injection is dangerous when three properties hold simultaneously: the agent has access to private or sensitive data, the agent can exfiltrate (send data outside the user's control), and the agent is exposed to untrusted content. Drop any one ingredient and the attack surface collapses. A read-only agent on public data has nothing to steal. A sandboxed agent with no network can be tricked but cannot exfiltrate. An agent that only ever sees content the user themselves wrote has no injection channel. This framing is unusually portable: it works for chat assistants, code assistants, browser agents, file-reading agents, and connector-mediated agents alike, and it gives engineers a checklist they can apply during design rather than a vocabulary they have to memorize.
- Ingredient 1: access to private or sensitive data the attacker wants
- Ingredient 2: a path for data to leave the trust boundary (network call, tool output, email send, screenshot, anything outbound)
- Ingredient 3: exposure to content an attacker can shape (web page, email body, document, tool result)
- Remove any one ingredient and you reduce the attack from "real risk" to "annoying but contained"
- Use it during design reviews: "Does this feature add ingredient 1, 2, or 3 to a path that already has the other two?"
Limitations: Like any heuristic it oversimplifies. A determined attacker can chain across separate agents (one has trifecta-element-2, another has trifecta-element-1) — the model still applies but you have to draw the trust boundary around the whole system, not a single agent. It also doesn't help with reputational harm where no data is exfiltrated but the agent says something embarrassing publicly.
Direct prompt injection is what most people picture: a user types 'ignore your previous instructions and do X.' This is interesting for jailbreaks but limited as a real-world threat because the only person it harms is the user themselves. Indirect prompt injection is the asymmetric version: an attacker plants instructions in content that the model will later read on behalf of a victim. A poisoned web page, a malicious email, a booby-trapped PDF, a planted Slack message. The victim's own agent then encounters the payload while doing legitimate work. Almost every serious prompt-injection incident in the wild is indirect, because that is the version that lets an attacker target other people.
- Direct: user types the injection — limited blast radius (themselves)
- Indirect: attacker plants the injection in content the victim's agent will later read
- Indirect attacks ride along on legitimate workflows ("summarize my email," "read this PDF")
- The victim has no awareness of the attack — there is no malicious-looking message to flag
- The attacker does not need access to the victim's account, only to content the victim's agent will reach
Limitations: The line between direct and indirect blurs in shared workspaces — a colleague's Slack message is technically content from someone else, but most people don't treat it as attacker-controlled. Threat modeling has to define explicitly which sources are trusted, which are not, and where the boundary sits.
A chat-only LLM has a small attack surface: the user, the system prompt, the conversation history. An agent with tools has a surface equal to everything those tools can reach. Connect Gmail, and every email becomes potential injection content. Connect a browser, and every web page does. Connect a file reader, and every document does. Connect an MCP server, and every tool result returned by that server does. The agent does not need to be jailbroken or social-engineered — it just needs to read content from a surface an attacker controls, in the course of doing its legitimate job. This is the structural reason agents are riskier than chat: not because they are smarter, but because their input surface is dramatically larger.
- Surface = union of everything every connected tool can reach
- Adding a tool to an agent adds its entire input surface to the trust boundary
- Read-only tools still expand the attack surface (input poisoning) even if not the exfil surface
- Surface grows non-linearly: each new tool also enables new chains with existing tools
- Treat tool connection decisions with the same gravity as opening a new network port
Limitations: Pure surface-counting underweights the practical value tools provide — an agent with one tool may be safer but also vastly less useful. The real question is whether the surface added is one you can monitor and reason about, not whether to minimize tool count for its own sake.
Every agent system has trust boundaries, even if no one has drawn them. The user is one trust level. The system prompt is another. The model's own outputs are a third. Tool results from an internal API differ from tool results from a third-party MCP server. Documents the user uploads themselves differ from documents pulled in from email. If you cannot name the trust levels in your system, you cannot reason about which injection attacks matter and which do not. Drawing the boundaries explicitly is half the work; the other half is making sure each boundary's content is treated according to its level — sanitized, tagged, scoped, or denied capabilities as appropriate.
- List every source of content the agent reads: user, system, each tool, each connector, each MCP
- Assign each source a trust level — 'trusted' (system prompt), 'semi-trusted' (user), 'untrusted' (external content)
- Capabilities granted should reflect trust — untrusted content should never reach high-capability tool calls without a gate
- Document the boundaries in code comments or design docs so they survive refactors
- Re-audit when adding tools or connectors — every new source has its own implicit trust level
Limitations: Real trust is rarely binary. A user uploads a document — is that trusted (because the user chose it) or untrusted (because the document might have come from anywhere)? Most teams end up with three or four levels and accept the messiness.
Per-action confirmation is the most accessible defense against mid-task injection — the user sees the proposed action and can refuse it. Removing that gate ('Act without asking' modes, autonomous loops, scheduled tasks) trades injection resistance for speed. This is a legitimate tradeoff: not every workflow needs the gate, and over-gating creates its own failure mode (confirmation fatigue, where users approve everything to make the prompts go away). The principle: grant the minimum autonomy the workflow actually requires, and keep gates on for anything that touches untrusted content or high-consequence tools.
- Per-action confirmation is the cheapest defense against in-progress injection
- Autonomous modes ("Act without asking," scheduled tasks, multi-step agents) bypass that defense
- Reserve autonomy for: short tasks, trusted content, easily reversible actions, supervised runs
- Keep gates for: long tasks, content from external sources, irreversible actions (email send, payments, deletes)
- Confirmation fatigue is real — over-gating teaches users to click through, undoing the benefit
Limitations: There is no clean rule for where the line sits. Teams converge on it by trial and incident review. The honest version: 'we keep gates on for tools that touch external content or take consequential actions, and we keep them off for read-only summarization of trusted documents.'
Every individual defense against prompt injection has been bypassed by some attack: tagged delimiters can be impersonated, system prompts can be overridden, output filters can be evaded, sandboxes can be escaped, human review can be fatigued. The reason the field is not hopeless is that defenses compound — an attacker has to bypass them in sequence, and most attacks fail at one or another layer. The mental shift required: stop searching for the defense that 'solves' prompt injection (none does), and start asking 'what is layer one, two, three, four, and is there at least one layer between every plausible attack and every consequential action?' This is identical to how mature security thinking handles every other class of attack.
- Input layer: limit what content the agent reads at all
- Boundary layer: tag/structure content so the model can distinguish levels (helps, not enough alone)
- Capability layer: restrict what tools an agent can call given what content it has read
- Output layer: validate or filter what the agent is about to do or say
- Gate layer: human confirmation on the highest-consequence actions
- Monitoring layer: log everything; detect drift after the fact even when prevention fails
Limitations: Defense-in-depth costs latency, dev time, and user experience. Over-engineering it for a low-risk workflow is its own failure mode. Layer count should match consequence — a summary tool can run with fewer layers than a tool that can spend money.
Before connecting a tool or granting a permission, ask: 'If a stranger could write whatever they wanted into this surface, what is the worst thing they could get my agent to do?' If the worst case is 'waste my time,' the surface is probably fine. If the worst case is 'exfiltrate my data,' 'send messages on my behalf,' or 'spend money,' the surface needs gates. This is not a substitute for threat modeling — it is the warm-up for it. It works well in design reviews because non-security people can answer it, and it surfaces the indirect-injection cases that pure functional thinking misses.
- Apply at design time, before code: "What is the worst thing a stranger writing into this content could trigger?"
- Apply at install time, before connecting an MCP or plugin: "What is the worst thing the maintainer of this server could trigger if they turned malicious?"
- Apply during incident triage: "What surface did this attack come through, and does that surface have a 'stranger' on the other side?"
- Maps cleanly to the lethal trifecta — if the worst case is bad, you have all three ingredients
- Use as a discussion frame in design reviews, especially with PMs and non-security engineers
Limitations: A blunt heuristic, not a methodology. Misses sophisticated chain attacks where the worst single-step outcome looks benign but multi-step composition produces real harm. Useful as the first filter, not the only one.
Prompt injection is not solved at the model layer — and that is the most useful thing to know
| Mental model | What it helps you do | Where it applies | Limits |
|---|---|---|---|
| Instructions vs. data confusion | Stop expecting a model-layer fix; design at the application layer | Every system using an LLM | Reframes the problem — does not solve it |
| Lethal trifecta | Decide whether a feature adds real risk | Design, code review, install decisions | Heuristic — oversimplifies chains |
| Direct vs. indirect injection | Focus defense on the asymmetric (indirect) attacks | Threat modeling, content sourcing | Boundary between "user content" and "external content" can be fuzzy |
| Agent surface = attack surface | Treat new tool connections like new network ports | Architecture review, tool wiring | Undercounts the value tools provide; not just about minimizing count |
| Trust boundaries | Name where trust ends; tag content accordingly | System design, code structure | Real trust is rarely binary |
| Capability vs. autonomy | Choose the right autonomy level per workflow | Mode settings, UX gates, scheduled tasks | Confirmation fatigue is a real failure mode |
| Defense-in-depth | Stop searching for the single defense; stack them | All defense planning | Layer count should match consequence — not maximalism |
| The Stranger Test | Quick design-review filter | Design and install decisions | Misses multi-step chain attacks |
If you only remember one thing, remember the trifecta
Access to sensitive data + ability to exfiltrate + exposure to untrusted content is the combination that makes prompt injection a real risk. Drop any one ingredient and the attack surface collapses. Use it at design time (does this feature add an ingredient?), at install time (does this MCP add an ingredient?), and at triage time (which ingredient enabled the incident?). Everything else on this page elaborates on or sits underneath this single check.