Reference Guide

Prompt Injection: Mental Models

The eight mental models that organize every other prompt-injection defense — data vs instructions confusion, the lethal trifecta, direct vs indirect injection, the agent surface as attack surface, trust boundaries, capability vs autonomy, defense in depth, and the Stranger Test.

← Back to Reference Hub

Traditional software draws a hard line between code (instructions the machine executes) and data (content the machine processes). Language models do not. Every token in context is read with the same eye — the user's question, the system prompt, the tool result, the document the agent just opened, the email it just read. The model decides moment-to-moment what to treat as direction and what to treat as content, and an attacker who can write any of that content can try to shift the boundary. This is why prompt injection is not solvable the way SQL injection is solvable: you cannot escape your way out of a problem that exists at the interpretation layer rather than the parsing layer.

  • Every token the model sees is potentially executable as an instruction
  • There is no syntactic marker that reliably distinguishes "this is what the user wants" from "this is content the user wants me to work with"
  • Vendor mechanisms (system prompts, role tags, content tags) raise the bar but do not close the gap
  • The problem worsens as context windows grow — more content means more opportunities for an injected payload to be reached
  • Mitigation must be defense-in-depth at the application layer, not a parser fix at the model layer

Limitations: This framing is uncomfortable for engineers coming from traditional software security, where input sanitization is a solved enough problem. The fact that there is no fix at the model layer often gets read as defeatism — it is not. It is the starting point for understanding which layer the fixes have to live at.

FoundationalMental Model

Prompt injection is dangerous when three properties hold simultaneously: the agent has access to private or sensitive data, the agent can exfiltrate (send data outside the user's control), and the agent is exposed to untrusted content. Drop any one ingredient and the attack surface collapses. A read-only agent on public data has nothing to steal. A sandboxed agent with no network can be tricked but cannot exfiltrate. An agent that only ever sees content the user themselves wrote has no injection channel. This framing is unusually portable: it works for chat assistants, code assistants, browser agents, file-reading agents, and connector-mediated agents alike, and it gives engineers a checklist they can apply during design rather than a vocabulary they have to memorize.

  • Ingredient 1: access to private or sensitive data the attacker wants
  • Ingredient 2: a path for data to leave the trust boundary (network call, tool output, email send, screenshot, anything outbound)
  • Ingredient 3: exposure to content an attacker can shape (web page, email body, document, tool result)
  • Remove any one ingredient and you reduce the attack from "real risk" to "annoying but contained"
  • Use it during design reviews: "Does this feature add ingredient 1, 2, or 3 to a path that already has the other two?"

Limitations: Like any heuristic it oversimplifies. A determined attacker can chain across separate agents (one has trifecta-element-2, another has trifecta-element-1) — the model still applies but you have to draw the trust boundary around the whole system, not a single agent. It also doesn't help with reputational harm where no data is exfiltrated but the agent says something embarrassing publicly.

FoundationalDesign Tool

Direct prompt injection is what most people picture: a user types 'ignore your previous instructions and do X.' This is interesting for jailbreaks but limited as a real-world threat because the only person it harms is the user themselves. Indirect prompt injection is the asymmetric version: an attacker plants instructions in content that the model will later read on behalf of a victim. A poisoned web page, a malicious email, a booby-trapped PDF, a planted Slack message. The victim's own agent then encounters the payload while doing legitimate work. Almost every serious prompt-injection incident in the wild is indirect, because that is the version that lets an attacker target other people.

  • Direct: user types the injection — limited blast radius (themselves)
  • Indirect: attacker plants the injection in content the victim's agent will later read
  • Indirect attacks ride along on legitimate workflows ("summarize my email," "read this PDF")
  • The victim has no awareness of the attack — there is no malicious-looking message to flag
  • The attacker does not need access to the victim's account, only to content the victim's agent will reach

Limitations: The line between direct and indirect blurs in shared workspaces — a colleague's Slack message is technically content from someone else, but most people don't treat it as attacker-controlled. Threat modeling has to define explicitly which sources are trusted, which are not, and where the boundary sits.

FoundationalThreat Pattern

A chat-only LLM has a small attack surface: the user, the system prompt, the conversation history. An agent with tools has a surface equal to everything those tools can reach. Connect Gmail, and every email becomes potential injection content. Connect a browser, and every web page does. Connect a file reader, and every document does. Connect an MCP server, and every tool result returned by that server does. The agent does not need to be jailbroken or social-engineered — it just needs to read content from a surface an attacker controls, in the course of doing its legitimate job. This is the structural reason agents are riskier than chat: not because they are smarter, but because their input surface is dramatically larger.

  • Surface = union of everything every connected tool can reach
  • Adding a tool to an agent adds its entire input surface to the trust boundary
  • Read-only tools still expand the attack surface (input poisoning) even if not the exfil surface
  • Surface grows non-linearly: each new tool also enables new chains with existing tools
  • Treat tool connection decisions with the same gravity as opening a new network port

Limitations: Pure surface-counting underweights the practical value tools provide — an agent with one tool may be safer but also vastly less useful. The real question is whether the surface added is one you can monitor and reason about, not whether to minimize tool count for its own sake.

FoundationalArchitecture

Every agent system has trust boundaries, even if no one has drawn them. The user is one trust level. The system prompt is another. The model's own outputs are a third. Tool results from an internal API differ from tool results from a third-party MCP server. Documents the user uploads themselves differ from documents pulled in from email. If you cannot name the trust levels in your system, you cannot reason about which injection attacks matter and which do not. Drawing the boundaries explicitly is half the work; the other half is making sure each boundary's content is treated according to its level — sanitized, tagged, scoped, or denied capabilities as appropriate.

  • List every source of content the agent reads: user, system, each tool, each connector, each MCP
  • Assign each source a trust level — 'trusted' (system prompt), 'semi-trusted' (user), 'untrusted' (external content)
  • Capabilities granted should reflect trust — untrusted content should never reach high-capability tool calls without a gate
  • Document the boundaries in code comments or design docs so they survive refactors
  • Re-audit when adding tools or connectors — every new source has its own implicit trust level

Limitations: Real trust is rarely binary. A user uploads a document — is that trusted (because the user chose it) or untrusted (because the document might have come from anywhere)? Most teams end up with three or four levels and accept the messiness.

ArchitectureDesign Tool

Per-action confirmation is the most accessible defense against mid-task injection — the user sees the proposed action and can refuse it. Removing that gate ('Act without asking' modes, autonomous loops, scheduled tasks) trades injection resistance for speed. This is a legitimate tradeoff: not every workflow needs the gate, and over-gating creates its own failure mode (confirmation fatigue, where users approve everything to make the prompts go away). The principle: grant the minimum autonomy the workflow actually requires, and keep gates on for anything that touches untrusted content or high-consequence tools.

  • Per-action confirmation is the cheapest defense against in-progress injection
  • Autonomous modes ("Act without asking," scheduled tasks, multi-step agents) bypass that defense
  • Reserve autonomy for: short tasks, trusted content, easily reversible actions, supervised runs
  • Keep gates for: long tasks, content from external sources, irreversible actions (email send, payments, deletes)
  • Confirmation fatigue is real — over-gating teaches users to click through, undoing the benefit

Limitations: There is no clean rule for where the line sits. Teams converge on it by trial and incident review. The honest version: 'we keep gates on for tools that touch external content or take consequential actions, and we keep them off for read-only summarization of trusted documents.'

DefenseMode Setting

Every individual defense against prompt injection has been bypassed by some attack: tagged delimiters can be impersonated, system prompts can be overridden, output filters can be evaded, sandboxes can be escaped, human review can be fatigued. The reason the field is not hopeless is that defenses compound — an attacker has to bypass them in sequence, and most attacks fail at one or another layer. The mental shift required: stop searching for the defense that 'solves' prompt injection (none does), and start asking 'what is layer one, two, three, four, and is there at least one layer between every plausible attack and every consequential action?' This is identical to how mature security thinking handles every other class of attack.

  • Input layer: limit what content the agent reads at all
  • Boundary layer: tag/structure content so the model can distinguish levels (helps, not enough alone)
  • Capability layer: restrict what tools an agent can call given what content it has read
  • Output layer: validate or filter what the agent is about to do or say
  • Gate layer: human confirmation on the highest-consequence actions
  • Monitoring layer: log everything; detect drift after the fact even when prevention fails

Limitations: Defense-in-depth costs latency, dev time, and user experience. Over-engineering it for a low-risk workflow is its own failure mode. Layer count should match consequence — a summary tool can run with fewer layers than a tool that can spend money.

FoundationalDefense

Before connecting a tool or granting a permission, ask: 'If a stranger could write whatever they wanted into this surface, what is the worst thing they could get my agent to do?' If the worst case is 'waste my time,' the surface is probably fine. If the worst case is 'exfiltrate my data,' 'send messages on my behalf,' or 'spend money,' the surface needs gates. This is not a substitute for threat modeling — it is the warm-up for it. It works well in design reviews because non-security people can answer it, and it surfaces the indirect-injection cases that pure functional thinking misses.

  • Apply at design time, before code: "What is the worst thing a stranger writing into this content could trigger?"
  • Apply at install time, before connecting an MCP or plugin: "What is the worst thing the maintainer of this server could trigger if they turned malicious?"
  • Apply during incident triage: "What surface did this attack come through, and does that surface have a 'stranger' on the other side?"
  • Maps cleanly to the lethal trifecta — if the worst case is bad, you have all three ingredients
  • Use as a discussion frame in design reviews, especially with PMs and non-security engineers

Limitations: A blunt heuristic, not a methodology. Misses sophisticated chain attacks where the worst single-step outcome looks benign but multi-step composition produces real harm. Useful as the first filter, not the only one.

Design ToolTriage

Prompt injection is not solved at the model layer — and that is the most useful thing to know

Engineers coming from traditional security keep looking for the sanitization library that fixes prompt injection. There isn't one and won't be one in the short term, because the issue is at the interpretation layer where the model decides what counts as instruction versus content. Accepting this redirects energy from 'find the magic fix' to 'stack defenses at the application layer.' Every page in this module is downstream of that single redirection.