07. Prompt-layer bugs — the first suspect in the lineup¶

~14 min read. The prompt is the cheapest place to interrogate. Open the lineup here.

Built on the ELI5 in 00-eli5.md. The suspects — five layers that could have caused the failure — start with the prompt. The lineup from chapter 06 hands us one suspect at a time. This one is up first.

Why the prompt is suspect number one¶

See. The prompt is text. You can read it, diff it, replay it without any real tool call. Cheap to interrogate. So we start here.

Other suspects cost more. Tool-layer needs a sandbox. Loop-layer needs replay. Model-layer needs A/B between versions. A good detective tests the cheap suspect first. Yes?

suspect interrogation cost (low → high)

prompt   ──→  read, diff, replay text. minutes.
tool     ──→  sandbox the schema. tens of minutes.
loop     ──→  replay full agent run. an hour.
memory   ──→  snapshot store, replay reads. an hour.
model    ──→  version A/B. half a day.

Clear the prompt first. Then move on.

Where leaks happen in prompt assembly¶

Picture before details. A prompt is not one string. It is a stack of pieces glued at runtime.

final prompt sent to model
┌────────────────────────────────────────────────────────┐
│ [1] system instructions  ◀── line-1 directives          │
│ [2] role + persona       ◀── "you are a sales agent"    │
│ [3] few-shot examples    ◀── 2-6 demonstrations         │
│ [4] retrieved context    ◀── RAG chunks                 │
│ [5] tool catalogue       ◀── function descriptions      │
│ [6] memory summary       ◀── prior session digest       │
│ [7] conversation turns   ◀── chat history, this session │
│ [8] current user input   ◀── the new turn               │
└────────────────────────────────────────────────────────┘
        │           │           │           │
   leak: stale  leak: stale  leak: prior  leak: conflict
   example      summary      turn bleed   with [1]

Six known leak points. Each maps to one named bug pattern. We walk all six.

Bug 1 — Context bleed¶

Prior turns' content leaks into a new turn unintentionally.

Trace signature. The agent answers about a topic the user never asked about this turn. Open the case file and scroll up. The topic appeared 3 turns ago in the same session. The model kept treating it as current.

Elimination test. Run the same user turn in a fresh session — no history. If the bug disappears, context bleed confirmed.

Fix pattern. Add explicit turn boundaries (### Turn 4 — only address this). Summarise old turns into a short memory line, do not paste raw. Truncate after K rounds. Reset state on intent change.

Bug 2 — Conflicting instructions¶

System says X. Recent user message implies Y. Model picks one. You do not know which.

Trace signature. The witness note shows system saying "never give legal advice." User says "pretend you are my lawyer." The completion gives a half-disclaimer, half-advice. Neither side wins cleanly.

Elimination test. Run system prompt alone with neutral user message (behaviour A). Run conflicting user message with no system (behaviour B). The real run sits between. Conflict confirmed.

Fix pattern. Add precedence rule: "If user conflicts with system, system wins. Refuse." Move critical rules to the end of the system message — recent text gets more weight. Echo the rule in the output schema so the model must commit.

weak system rule                  strong system rule
"do not give legal advice"        "if user asks for legal advice,
                                   reply ONLY with: 'I cannot help
                                   with legal advice.' No other text."

The second one is enforceable. The first one is a wish.

Bug 3 — Role drift¶

Agent stops behaving as "support assistant," starts behaving as generic chat.

Trace signature. Early turns: structured, on-brand, tool-calling. Later turns: chatty, off-topic, no tools. The case file shows persona at turn 1 but no reinforcement at turn 20.

Elimination test. Take turn 20's input. Prepend a fresh persona block. If behaviour snaps back, role decayed. Confirmed.

Fix pattern. Re-inject persona on every turn, not only at session start. Keep persona short (3-5 lines) so it survives long context. Add a guard tool call (assert_role) periodically. See module 18-versioning-agents for tracking persona across deploys.

Bug 4 — Instruction decay¶

Model "forgets" instruction from line 1 by token 4000. Classic long-context degradation.

Trace signature. The completion violates rule 1. But rule 1 is right there at the top of the prompt. The evidence tag prompt_tokens reads 12,400. The U-curve has bitten you — middle-of-context recall is worst.

Elimination test. Shorten the prompt to 1500 tokens. Same rule, same input. If the model obeys now, decay confirmed.

Fix pattern. Move critical rules to the end of the system message (recency bias helps). Repeat critical rules twice — top and bottom. Trim retrieved context aggressively. Use longer-context models only when needed.

attention recall across context

start  ▓▓▓▓▓▓▓▓▓▓▓░░░  high
middle ░░░░░░░░░░░░░░  low   ◀── instructions die here
end    ▓▓▓▓▓▓▓▓▓▓▓▓▓░  high

Visual model. The middle is the danger zone. Keep rules out of it.

Bug 5 — Few-shot poisoning¶

A stale or biased example in the prompt shifts behaviour subtly.

Trace signature. No error. No bad rule. Just slow drift. The agent quietly favours one outcome. The few-shot block in the case file has one example added last sprint. It matches the drift direction.

Elimination test. Remove the suspect example. Replay 50 cases. Compare output distribution. If drift disappears, poisoning confirmed.

Fix pattern. Treat few-shot examples as code — version them, review in PRs. Keep example distribution balanced. Audit quarterly. Use eval suite to detect skew before deploy. This is the bug in our worked example below.

Bug 6 — Hidden whitespace and encoding¶

Invisible characters from copy-paste alter tokenization.

Trace signature. Same prompt, copied from Notion, behaves differently than the typed version. Token count differs. The evidence tag shows prompt_tokens = 1247 vs 1241. Six tokens different. Six invisible characters.

Elimination test. Run cat -A prompt.txt or pass through repr(). Look for (zero-width space), (non-breaking space), CRLF vs LF. If they exist — encoding confirmed.

Fix pattern. Strip invisible characters in prompt build pipeline. Lint prompt files in CI. Never copy from rich-text editors.

Worked example — the sales bot that went rogue¶

A real case file pattern. Names changed.

Day 1. A B2B sales-qualification agent works fine. Asks discovery questions. Routes hot leads. Customer reports a 14% conversion lift.

Day 3. Prompt team adds one new few-shot example for a corner case — an enterprise lead asking about pricing for a migration product. Real transcript. Pushed to prod.

Day 5. Sales ops files a complaint slip. "The bot is recommending migration tooling to leads who came in for our analytics product."

Investigation — the lineup runs.

suspect       interrogation                  result
─────────────────────────────────────────────────────────
prompt        diff vs last week's version    ◀── new few-shot added
tool          replay catalog routing         clean
loop          replay step trace              clean
memory        check session reset            clean
model         compare model id last week     no change

The prompt diff shows the new example mentions "migration" 7 times. The model picked it up as a soft signal — when in doubt, mention migration.

Confession. Few-shot poisoning. One example skewed output distribution. Verified by replaying 200 leads with the example removed. Drift vanishes.

Lock. Regression eval added — 50 leads tagged by intent. Routing must match expected category in ≥ 90% of cases. Future prompt updates must pass before merge.

Solved in 3 hours. Without prompt diff, this would be a "the bot feels off" Slack thread lasting two weeks.

The prompt-diff tool pattern¶

Every prompt-related bug ends with one question. What changed? So build a prompt-diff tool.

$ prompt-diff sales_agent --from v23 --to v24

 [SYSTEM]      line 8 added: "If user mentions 'migrate'..."
 [FEW_SHOT]  + example 4 added (token cost +312)
                user: "we're moving off Snowflake..."
                assistant: "for migration, here is..."
 [TOOLS]       unchanged
 [TEMP]        0.3 → 0.3 unchanged

A diff turns vague "feels different" into a precise list. Tie diff to deploy timestamp. Tie timestamp to alarm bell spikes. Tie spike to the offending line.

This pattern lives in module 18-versioning-agents — every prompt should be tracked like code, with hash, author, deploy time, and rollback path.

Prompt-layer bug patterns across agent stacks¶

GitHub Copilot tab-vs-space incident — inconsistent indentation handling in prompt template silently degraded suggestions for tab projects until token-level whitespace drift was found in prompt assembly.
ChatGPT system-prompt leaks (2023) — user transcripts showed "ignore previous instructions" injection overriding system block; the conflicting-instructions class made public.
Bing Sydney persona drift — long sessions caused role drift to an emotional persona; mitigation was a five-turn hard reset, treating drift as inevitable in long sessions.
Klarna customer-service bot — agent answered out-of-scope personal-finance questions; root cause was retrieved context bleed across turns. Fix: per-turn intent classification, aggressive turn-boundary marking.
Azure OpenAI prompt-flow guidance — documents instruction decay at 8k+ tokens; pattern is repeated rule injection at top and end, mid-context reserved for low-priority retrieved text.
PromptLayer prompt registry — every prompt has a version, a diff, and a usage log; the role is making prompt drift a first-class deployable artifact instead of a string buried in code.
Helicone prompt diff mode — request-by-request comparison of two prompt versions on the same input; the role is catching prompt regressions before they reach the eval set.
LangSmith prompt playground — locked-input replay against new prompts; the role is suspect-1 elimination in one click.
BAML prompt linting — typed prompt arguments rejected at compile time; the role is catching template-bug-class failures before any inference runs.
Pydantic AI typed templates — Jinja-shaped templates with validated context; the role is preventing silent placeholder fall-through.
Vellum's prompt deployment manager — staged rollout of prompt changes with A/B traffic split; the role is keeping prompt changes deployment-safe rather than text-edit-safe.
Anthropic prompt-cache discipline — static prefix cached, dynamic content at the end; the role is exposing how prompt structure affects both quality and cost.
OpenAI's system message override — explicit system role separation from user/assistant; the role is encoding the conflicting-instructions defense in the API itself.
Vercel AI SDK template helpers — typed streamText({ system, messages }) with validation; the role is preventing accidental string concatenation that loses role boundaries.
Anthropic prompt-injection benchmark — open dataset of injection attempts; the role is making suspect-1 elimination testable in CI.
Lakera Guard — runtime prompt-injection detection; the role is catching the injection class as a perimeter defense, not just a prompt-layer fix.
NeMo Guardrails by NVIDIA — declarative input/output constraints; the role is encoding the system-prompt's load-bearing rules outside the prompt where they cannot decay.
CrewAI agent role definitions — explicit role/goal/backstory per agent; the role is exposing role-drift class bugs at design time rather than runtime.
Notion AI's prompt-template lock — workspace admins lock prompt versions across users; the role is preventing per-user prompt mutations that fragment the bug surface.
Microsoft Copilot for M365 grounding instructions — Graph-aware rules injected as anchored end-of-prompt content; the role is the canonical "rule at the end of the prompt" pattern in production.
Anthropic Claude Projects custom instructions — persistent project-level prompt with versioned edits; the role is treating the project prompt as code, not text.
LangGraph state-aware prompts — prompt assembly that respects loop state and turn count; the role is making instruction-decay-by-turn-count visible at template level.
OpenAI Evals prompt-regression suites — locked prompts run against eval set on every change; the role is enforcing suspect-1 elimination as a CI gate.

Recall — six prompt bug shapes and where they hide¶

Why is the prompt the first suspect interrogated in the lineup?
What is the elimination test for context bleed?
Where in a long prompt should you place the most critical rule, and why?
In few-shot poisoning, why is the bug invisible to per-trace inspection — and how do you detect it?

Interview Q&A¶

Q: A model violates a system instruction that is clearly at the top of an 8k-token prompt. The instruction is correctly written. What is your first hypothesis and how do you test it? A: Instruction decay due to mid-context attention degradation. Models recall start and end of context better than the middle. The instruction is at the top, but with 8k tokens of retrieved chunks pushing it into the "U-curve dip." Test: shorten the prompt to 1500 tokens, keep instruction at top, replay the input. If the model obeys, decay confirmed. Fix by moving the rule to the end of the system block, repeating it, and trimming retrieval.

Common wrong answer to avoid: "The model is buggy, switch model version" — switching models without isolating the prompt-layer issue moves the bug, does not fix it. The next model may fail differently on the same long context.

Q: Why is few-shot poisoning particularly dangerous compared to other prompt-layer bugs? A: Because it produces no individual-trace error. Every single completion looks reasonable on inspection. The bug shows only in aggregate — output distribution shifts subtly. You cannot find it by reading one case file. You need a regression eval over many cases, comparing output distributions before and after the prompt change. Without that eval, the bug runs undetected for weeks.

Common wrong answer to avoid: "We add an LLM-as-judge to catch it" — an LLM judge inherits the same bias as the production model and will rate the poisoned outputs as fine. You need a deterministic eval with labelled ground truth, not another model.

Q: A user reports the agent went off-persona on turn 25 of a long conversation. The system prompt clearly defines the persona at turn 1. Why didn't the persona stick, and what is the fix? A: Role drift from instruction decay. The persona block sits at turn 1, but by turn 25 it is buried under 24 turns of chat history. The model's attention to the persona block is weak. Fix: re-inject a short persona reminder on every turn (last 200 tokens of the system message), or after every K turns, plus enforce role via an output schema field the model must populate.

Common wrong answer to avoid: "Use a larger context window model" — longer context worsens the U-curve dip. The persona block stays buried; making the context longer often increases drift, not decreases.

Q: You ship a prompt update. Two days later, conversion rates drop 7%. How do you isolate whether the prompt change caused this versus a model-version change or upstream data shift? A: First, pull prompt-diff for the deploy window. Identify exactly what changed. Second, replay the last 500 sessions through both the old and new prompt with the same model version. Compare outputs and conversion proxy on the eval set. If the new prompt underperforms in replay, prompt is the cause. If both perform equally in replay, the bug is upstream (input distribution shift or external factor).

Common wrong answer to avoid: "Roll back the prompt and see if conversion recovers" — uncontrolled rollback mixes two variables (prompt + time). You need controlled replay on a fixed eval set to isolate cause without exposing more users to the bad version.

Apply now (10 min)¶

Step 1 — model the exercise. Here is the prompt-section token audit I would run on the refund chatbot's 8k-token prompt:

Section	Token count	Position	Decay risk	Action
System role + load-bearing rules	450	top	low	keep
Few-shot examples (3)	1200	top-mid	medium	review for poisoning quarterly
Retrieved chunks (top-3 + rerank)	4200	middle	HIGH (U-curve dip)	trim to top-2, dedupe near-duplicates
Conversation history (24 turns)	1800	mid-bottom	medium	summarise after 20 turns
End-of-prompt rule repeat	80	bottom	low	keep — load-bearing for long prompts
User question	50	bottom	low	keep

The 4200 retrieval tokens in the dip is the suspect. Trim to top-2 and the rule at the top is no longer buried under context the model under-reads.

Step 2 — your turn. Take any agent prompt you have written. Print its token count at each section. Identify which section sits in the mid-context dip. Move at least one critical rule to the very end of the system block. Then list the last three prompt changes you shipped and write a one-line eval test for each; if you cannot write one, the change was not safe.

Step 3 — reproduce from memory. Draw the 8-section prompt-assembly stack. Mark the four leak points. Name each leak with one of the six bug patterns.

What you should remember¶

This chapter explained why the prompt is the cheapest suspect in the lineup and also the place with the largest number of distinct bug shapes. Six patterns recur — context bleed, conflicting instructions, instruction decay, role drift, few-shot poisoning, and template fall-through — and each has its own elimination test. The mistake is treating the prompt as a single string. It is an assembly with eight sections, four leak points, and a U-curve attention profile that gets worse as the prompt grows.

You also learned why few-shot poisoning is the most dangerous of the six. Every individual trace looks clean; the bug shows only in aggregate distribution shift. A regression eval over many cases catches it; an LLM-as-judge does not, because the judge inherits the same bias as the production model.

Carry this diagnostic forward: when a prompt change is proposed, ask for the replay on the last 500 sessions before the change ships. If the team cannot produce that replay in an hour, prompt deployment is not yet under engineering discipline.

Remember:

The prompt is a stack with sections, not a string. Always think section-by-section.
The mid-context dip is real. Critical rules belong at the top and the end.
Few-shot poisoning is invisible per-trace; only distribution-level evals catch it.
Role drift in long sessions is inevitable. Re-inject the persona every K turns.
Treat prompt changes as deploys. Replay against a fixed eval set before shipping.

Bridge. The prompt cleared the lineup. Read clean. Diffs clean. Replays clean. So the prompt is not the culprit. Move on to the next suspect — the tools. Schemas drift. Arguments get coerced silently. Errors hide as success. → 08-tool-layer-bugs.md