03. Indirect prompt injection — hostile instructions hidden in trusted-looking content¶

~12 min read. The dangerous instruction may not come from the user. It may come from a document, email, web page, ticket, spreadsheet, or tool output the model was asked to read.

Continues from 02-direct-prompt-injection.md. Direct injection attacks the front door. Indirect injection hides inside lobby text that the assistant retrieves or receives from a tool.

The previous chapter showed a user trying to become the system directly. That solved the obvious attack surface, but real assistants read far more than the current user prompt. This chapter moves the same authority-confusion problem into retrieved documents, emails, web pages, tool outputs, and memory.

1) The wall — retrieval turns documents into instructions¶

A user asks, "Summarize this vendor security report." The report contains normal paragraphs plus a hidden instruction telling any AI reader to include unrelated sensitive details in the summary.

The user did not ask for that. The application did not intend it. The model still read the hostile text inside the context window.

That is the core problem:

untrusted document
  -> retrieved as context
  -> model reads it as text
  -> text competes with system/developer instructions
  -> output or tool plan changes

Indirect injection is dangerous because the attacker may influence the model without being the current user.

2) Where indirect injection hides¶

Common hiding places:

retrieved documents in RAG
web pages summarized by browsing agents
emails and calendar invites
support tickets and chat transcripts
PDFs, spreadsheets, comments, and metadata
tool responses from external APIs
code comments or README files read by coding agents
memory written in an earlier session

The shared pattern is untrusted text entering the model through a path that looks like evidence.

3) Worked example — malicious policy document¶

An enterprise docs assistant retrieves a policy page. The visible policy text is legitimate, but the page also contains attacker-controlled text in a comment field. The assistant is asked to summarize the policy and create a support ticket.

Weak design:

retrieved page -> full text in prompt -> model follows hidden instruction -> ticket contains attacker text

Stronger design:

retrieved page
  -> source trust label
  -> quote/extract mode for untrusted content
  -> model cannot treat document text as instruction
  -> ticket tool validates typed fields
  -> server authorization checks tenant and action

The defense is not "the model should know better." The defense is treating retrieved content as data, not authority.

4) Why not trust internal documents¶

The tempting alternative is to trust documents from your own workspace. That feels reasonable because the content is behind authentication.

It fails because internal content can be user-generated, stale, imported, compromised, or written by someone who never expected an agent to execute instructions from it. A wiki page that is safe for a human reader can be unsafe when placed inside a model context next to tools.

Authentication says who can access the document. It does not say the document is safe to obey.

5) Production signals — indirect injection resistance¶

The first metric is attack success rate across source types: docs, email, web, tool output, ticket, memory.

The misleading metric is direct-jailbreak refusal rate. A model can refuse direct attacks and still follow hostile retrieved content.

The expert artifact is a taint trace:

untrusted source -> retrieved span -> prompt segment -> model output/tool argument

If the team cannot trace untrusted text into actions, indirect injection is almost impossible to debug.

6) Boundary — not every instruction-like sentence is malicious¶

Some documents legitimately contain instructions: runbooks, recipes, code comments, API docs, policy steps. The system must let the model summarize them without treating them as commands to the application.

The boundary is role separation. Document instructions are content to report on. System and developer instructions are authority to follow.

The pathology is flattening all text into one instruction soup.

Recall checkpoint¶

Why is indirect injection harder than direct injection?
Which sources can carry hostile instructions?
Why does authentication not make content safe to obey?
What is a taint trace?

Interview Q&A¶

Q: How do you defend against indirect prompt injection in RAG? A: Treat retrieved content as untrusted data, label source trust, separate content from instructions, constrain tool calls with schemas and authorization, and red-team across document sources.

Common wrong answer to avoid: "Only retrieve trusted internal docs." Internal docs can still be user-generated or compromised.

Q: What is the key artifact for debugging indirect injection? A: A taint trace from source content to prompt segment to output or tool argument.

Common wrong answer to avoid: "Just inspect the final answer." The attack path lives in how content influenced the model.

Q: Why is quoting safer than obeying? A: Quoting treats document text as content. Obeying lets document text act as application authority.

Common wrong answer to avoid: "The model can infer which instructions are real." Role separation should not depend only on inference.

Apply now (10 min)¶

Model the exercise. Draw an indirect injection path from uploaded document to ticket creation and mark the hard controls.

Your turn. Pick one RAG source and decide whether it is trusted content, untrusted content, or privileged configuration.

Reproduce from memory. Explain why retrieved text is evidence, not authority.

What you should remember¶

This chapter explained indirect prompt injection. The important idea is that attackers can place instructions in content the model later reads as evidence.

Carry this diagnostic forward: every RAG or tool-output path needs source labels, role separation, taint traces, and hard tool boundaries.

Remember:

Indirect injection hides in content, not the current user prompt.
Internal documents are not automatically safe to obey.
Retrieved text should be treated as data.
Taint traces make hidden influence debuggable.

Bridge. Indirect injection attacks context authority. Jailbreaks attack policy behavior more directly by pressuring the model to cross refusal boundaries. → 04-jailbreaks-and-policy-pressure.md