05. Data exfiltration and secrets — the model must not become a leak path¶
~12 min read. The model may need access to private context to help the user. That does not mean the user should be able to extract any context the model sees.
Continues from 04-jailbreaks-and-policy-pressure.md. Jailbreaks test policy boundaries. Exfiltration tests whether sealed envelopes stay sealed.
The previous chapter treated refusal as useful but probabilistic behavior. That matters for unsafe content, but confidentiality creates a different pressure: the model may need to read private context without being allowed to reveal it. This chapter moves from "will it answer?" to "what can it leak?"
1) The wall — access for reasoning is not permission to reveal¶
An assistant summarizes a customer account. The prompt includes internal notes, retrieved documents, tool outputs, and maybe hidden system instructions. The user asks a clever question that tries to make the assistant reveal those hidden details.
The security boundary cannot be "the model saw it, so it may say it."
model may read for task
≠ model may reveal to user
≠ model may use in tool arguments
≠ model may store in memory
The sealed envelope pattern says some data can influence safe reasoning without being directly disclosed.
2) What can be exfiltrated¶
Sensitive AI-system data includes:
- system prompts and hidden policies
- API keys, tokens, and credentials
- private customer documents
- another tenant's retrieved chunks
- tool outputs beyond user authorization
- internal notes, risk scores, or moderation labels
- memory values from other users or contexts
- chain-of-thought or private reasoning traces when stored
- logs and traces containing prompt data
The defense starts by not placing unnecessary secrets in the model context. What the model never sees, it cannot leak in output.
3) Worked example — support account notes¶
A support assistant can read internal notes to decide whether to escalate a customer. The user should not see those notes verbatim.
Weak design:
Stronger design:
internal note stored behind policy
-> tool returns only allowed fields
-> prompt labels private evidence
-> output policy blocks verbatim disclosure
-> audit camera records attempted reveal
The strongest design avoids putting raw secrets in the prompt at all. Use derived signals when possible.
4) Why not rely on "do not reveal secrets" instructions¶
The tempting alternative is a system prompt rule: "Never reveal confidential information."
That rule is necessary but weak by itself. It depends on the model recognizing every secret, every paraphrase, every authorization context, and every adversarial framing.
Harder controls include:
- secret scanning before prompt assembly
- least-privilege retrieval
- tenant-scoped tools
- field-level filtering
- output data-loss checks
- trace redaction
- no credentials in prompts
- server-side authorization on every data fetch
The model should not carry the burden alone.
5) Production signals — exfiltration risk¶
The first metric is sensitive-field exposure rate in outputs and traces.
The misleading metric is prompt secrecy. Hiding system prompts helps, but the bigger risk may be customer data or tool outputs.
The expert signal is asset-level testing: for each sensitive field, can any prompt, retrieved doc, tool output, memory path, or log surface expose it?
6) Boundary — transparency versus confidentiality¶
Some products need transparency. Users may deserve to know why a decision was made. But transparency should reveal policy, citations, and allowed evidence, not raw secrets or another tenant's data.
The pathology is confusing explainability with disclosure. A system can explain a decision without dumping the sealed envelope.
Recall checkpoint¶
- Why is model visibility not disclosure permission?
- Which data types can be exfiltrated?
- Why should secrets be kept out of prompts when possible?
- How can transparency and confidentiality coexist?
Interview Q&A¶
Q: How do you prevent an AI assistant from leaking secrets? A: Minimize secrets in context, enforce server-side authorization, filter fields, scope retrieval, scan outputs, redact traces, and test asset-level exfiltration paths.
Common wrong answer to avoid: "Tell the model not to reveal secrets." Instruction is one layer, not a boundary.
Q: What is the sealed-envelope pattern? A: Data may influence allowed reasoning without being directly revealed, stored, or used outside its authorized purpose.
Common wrong answer to avoid: "If the model needs it, the user can see it." Reading for reasoning and disclosing to the user are different permissions.
Q: Why should API keys never be placed in model context? A: Any text in context can potentially be echoed, transformed, logged, or used in unintended tool arguments.
Common wrong answer to avoid: "The system prompt will keep them hidden." Prompt rules are probabilistic.
Apply now (10 min)¶
Model the exercise. List every sensitive field in a support assistant and mark whether the model needs raw value, derived signal, or no access.
Your turn. Pick one AI workflow and write an exfiltration test for each asset category.
Reproduce from memory. Explain the difference between access for reasoning and permission to reveal.
What you should remember¶
This chapter explained exfiltration and secrets. The important idea is that AI systems blur read access and output generation, so confidentiality must be enforced outside model obedience.
Carry this diagnostic forward: never put a secret in context unless you can explain why the model must see it and what prevents disclosure.
Remember:
- Model visibility is not disclosure permission.
- Keep raw secrets out of prompts when possible.
- Server-side authorization remains mandatory.
- Explainability does not require dumping sealed envelopes.
Bridge. Data leakage is dangerous, but many AI systems can do more than speak. Next we secure the action layer: tools. → 06-tool-abuse-and-action-boundaries.md