06. The layer-isolation lineup — five suspects, one at a time¶
~14 min read. The spine of agent debugging. Cheap eliminations first, expensive ones last.
Built on the ELI5 in 00-eli5.md. The lineup — the systematic march past five suspects — is how we go from a reproduced failure to a verified confession without guessing.
The picture before the protocol¶
See. You have a reproduced bug. You have the case file open. The agent did the wrong thing. Now what to do?
Wrong move — stare at the final answer and guess. Right move — walk the lineup.
There are exactly five suspects in any agent crime. Always the same five. Always in the same order.
reproduced bug
│
▼
┌─────────────┐
│ 1. PROMPT │── eliminate? ──→ guilty → STOP (chapter 07)
└──────┬──────┘ innocent
▼
┌─────────────┐
│ 2. TOOL │── eliminate? ──→ guilty → STOP (chapter 08)
└──────┬──────┘ innocent
▼
┌─────────────┐
│ 3. LOOP │── eliminate? ──→ guilty → STOP (chapter 09)
└──────┬──────┘ innocent
▼
┌─────────────┐
│ 4. MEMORY │── eliminate? ──→ guilty → STOP (chapter 10)
└──────┬──────┘ innocent
▼
┌─────────────┐
│ 5. MODEL │── eliminate? ──→ guilty → STOP (chapter 11)
└─────────────┘
cheap ────────────────────→ expensive
Top of the lineup is the prompt. Bottom is the model. Why this order? Because we walk from cheap-to-eliminate to expensive-to-eliminate.
Swapping a prompt — five minutes. Swapping the model — a day of evals. So we start where elimination is fast.
The lineup protocol — one question for each suspect¶
For every suspect, you ask one and only one question:
If I REMOVE this suspect from the picture, does the bug still appear?
If yes — innocent, move on. If no — guilty, stop. The confession is at this layer.
Simple, no? But discipline matters. People skip ahead on a hunch — "must be the model, the new version came out yesterday." They jump to suspect 5. Three days later, the bug is still there. The actual cause was suspect 1, a stray sentence that conflicted with a tool description. Walk the lineup. In order. Always.
Suspect 1 — the prompt¶
The first suspect is the prompt. The system message. The role. The instructions. The few-shot examples.
How to eliminate? Strip it down. Replace your 800-word system prompt with the bare minimum needed to make the task make sense.
ORIGINAL PROMPT MINIMAL PROMPT
"You are a careful "You are a refund agent.
refund agent. Always Use the tools to issue
verify identity. Never refunds when asked."
issue refunds over $500
without approval. Use
polite tone. Always (75 words removed)
explain your reasoning.
..."
Re-run the failing input with the minimal prompt. Same bug? Prompt is innocent. Different behavior? Prompt is guilty — chapter 07 walks you through which sentence betrayed you.
Cost: 5 minutes. Cheapest eliminator. Always first.
Suspect 2 — the tool¶
The second suspect is the tool. The function. Its schema. Its actual implementation. The way arguments are coerced. The way errors come back.
How to eliminate? Mock it. Replace the real tool with a stub that returns a known-good response.
# real tool
def issue_refund(order_id, amount):
return stripe.refunds.create(...)
# mocked for the lineup
def issue_refund(order_id, amount):
return {"status": "ok", "refund_id": "re_test_123",
"amount": amount}
Now run again. Same bug? Tool is innocent — the agent was doing the wrong thing even when the tool was perfect. Different behavior? Tool was guilty — chapter 08 covers schema drift, argument coercion, silent error swallowing.
Cost: 15-30 minutes for a clean mock. Second cheapest.
Suspect 3 — the loop¶
The third suspect is the control flow. The ReAct loop. The plan-execute scaffold. Retries. Stopping conditions. Max-step limits.
How to eliminate? Bypass the loop. Run the agent as a single call. One prompt, one model output, one tool call, done. No iteration.
LOOP MODE SINGLE-CALL MODE
plan → tool → observe → prompt + tools →
plan → tool → observe → one response with
plan → final one tool_use block
(no second turn)
If the bug still appears with no loop, the loop is innocent. If the bug only appears with the loop running, the loop is guilty — chapter 09 unpacks runaway loops, premature stops, and oscillation.
Cost: 1-2 hours. You usually need to write a small harness that calls the model once with no iteration.
Suspect 4 — memory¶
The fourth suspect is memory. The conversation history. Cross-session state. Retrieved documents. Embeddings cache. User preferences pulled from a DB.
How to eliminate? Clear everything. Run the failing case in a fresh session, with no prior turns, no retrieved context, no user profile.
WITH MEMORY CLEAN SLATE
- 12 prior turns - 0 prior turns
- 5 retrieved docs - 0 retrieved docs
- user_pref={"vip":true} - no user_pref
- session_summary="..." - no summary
Bug remains? Memory is innocent. Bug disappears? Memory is guilty — chapter 10 covers stale state, cross-session leakage, retrieval drift, embedding staleness.
Cost: 30 minutes if your harness supports it cleanly. Several hours if memory is tangled into your agent loop.
Suspect 5 — the model¶
The fifth and final suspect is the model itself. The version. The temperature. The provider. The system fingerprint.
How to eliminate? Swap. Claude Sonnet → Claude Opus. GPT-4o → GPT-4.1. Same family, different version. Or even cross-vendor — Claude → GPT — if the schemas allow.
Bug persists across all swaps? Model is innocent — the cause is elsewhere, you missed it earlier. Bug only appears on the original model? Model is guilty — chapter 11 covers version regressions, capability cliffs, refusals.
Cost: a full day of evals to know the swap is fair. The most expensive eliminator. Last in the lineup for a reason.
The decision tree — which suspect to interrogate first¶
Sometimes you have a hint from chapter 01's failure taxonomy. The bug class points to a likely suspect. Use this to prioritize within the lineup, but never skip a suspect entirely.
failure taxonomy class likely suspect
───────────────────────────── ───────────────
wrong tool chosen ──→ prompt (tool descriptions)
tool returned wrong data ──→ tool
agent looped 12 times ──→ loop
agent forgot user's earlier turn ──→ memory
agent refused a benign request ──→ model
output format suddenly wrong ──→ prompt or model
cost spiked 3x without code change ──→ model (version drift)
Even with the hint, walk the lineup in order. The hint shifts your expectation, not your protocol. Many bugs hide one layer earlier than they look.
Worked example — the refund agent returns the wrong amount¶
A complaint slip comes in. "User asked for a $50 refund. Agent issued $500." We open the case file. Reproduction works — same input, same wrong output.
Walk the lineup.
Suspect 1: Prompt. Swap to a minimal prompt: "You are a refund agent. Use tools to refund customers." Re-run. Agent still issues $500. Prompt: innocent.
Suspect 2: Tool. Mock issue_refund to log its arguments and return success. Re-run. The mock is called with amount=500. Tool: innocent (the tool is being given $500 to spend; the bug is upstream).
Suspect 3: Loop. Bypass the loop — single call. The model responds with one tool call: issue_refund(order_id="o_42", amount=50). Correct amount! Bug disappears.
LOOP MODE trace SINGLE-CALL trace
───────────── ─────────────────
plan → get_orders → amount=50 plan → issue_refund(amount=50)
plan → get_invoice → total=500 ✓ correct
plan → issue_refund(amount=500)
✗ wrong
Confession found. The loop is guilty. Without the second tool call, behavior is correct. The bug is that the planner is using the wrong field from a second observation. Why exactly? That is chapter 09's territory — but the layer is identified.
We did NOT swap the model. We did NOT touch memory. We stopped at suspect 3. That is the lineup working.
Layer-isolation patterns in production debugging¶
- Anthropic — Claude evals team: bisects regressions across prompt → tool definition → model snapshot, in that order, before escalating to model-team investigation.
- OpenAI — incident triage for ChatGPT: the on-call playbook walks the agent stack system-prompt → tool-registry → planner → context → model-snapshot, mirroring the cheap-first lineup.
- GitHub Copilot — regression hunting: isolates prompt-template and retrieval-context changes before touching the underlying model, because model swaps cost weeks of A/B testing.
- LangChain / LangSmith debug protocol: the official playbook is "diff the prompt, mock the tool, single-step the loop, clear memory, then swap model" — the lineup by another name.
- Inflection AI — agent-isolation testing for Pi: runs each suspect layer in isolation as part of CI so a regression in one layer cannot be masked by another.
- Cursor's bug-triage loop: isolates context-window changes before swapping the model when code-completion quality regresses; the role is making prompt the default first suspect.
- LangGraph node-by-node replay: lets a developer pause at one node and substitute inputs; the role is making single-node isolation a first-class debugger action.
- Promptfoo CLI — runs the same prompt against multiple models and providers; the role is making suspect-5 elimination a one-command operation.
- BAML playground — type-checked prompt + tool isolation with locked outputs; the role is exposing prompt and tool bugs at compile time instead of run time.
- Vellum's prompt sandbox — A/B variants of a prompt against a fixed dataset; the role is collapsing the suspect-1 interrogation into one screen.
- Pydantic AI — typed agent runs that surface parse failures separately from generation failures; the role is making "loop vs model" disambiguation explicit.
- Helicone diff mode — request-by-request prompt comparison; the role is letting a debugger see exactly what prompt change correlates with the regression.
- OpenAI Evals diff — eval suite reruns on two model snapshots; the role is the canonical suspect-5 elimination test.
- Anthropic console workbench — replay a request with a modified prompt; the role is making prompt-isolation a one-click action.
- AWS Bedrock model invocation logs — request/response pairs filterable by model version; the role is letting an SRE bisect by model snapshot without writing custom infra.
- Azure AI Studio prompt flow — node-graph debugger where a single node can be substituted; the role is making layer-isolation the default UI metaphor.
- MCP server inspector — view tool inputs/outputs in isolation from agent loop; the role is mocking-tool isolation without code changes.
- Honeycomb's BubbleUp on LLM spans — identifies which span dimension correlates with failure; the role is the statistical version of the lineup.
- Arize Phoenix's compare-traces — side-by-side trace inspection; the role is exposing which span differs between a working and failing trace.
- Datadog APM dimension drilldown — filter traces by model version, tool version, prompt template; the role is the SRE-flavoured version of the lineup walk.
- CrewAI's per-role debug mode — runs each agent in isolation; the role is the multi-agent variant of suspect isolation.
Recall — walking the lineup cold¶
- What is the eliminating question asked of each suspect in the lineup?
- Why is the prompt checked before the model, even when the new model version was just rolled out?
- In the refund worked example, which suspect confessed and how was it found?
- If chapter 01's taxonomy says "tool returned wrong data," which suspect is most likely — and do you still walk the rest?
Interview Q&A¶
Q: Why walk the layers in the order prompt → tool → loop → memory → model, not the other way around? A: Cost of elimination. Swapping a prompt is five minutes; swapping a model is a day of evals. Each step rules out a suspect cheaply before we spend on the next. You also reduce confounders — if you swap the model first and the bug disappears, you still do not know if the prompt was also broken.
Common wrong answer to avoid: "Because the prompt is the most likely cause" — the order is about cost of investigation, not likelihood. The model is sometimes the actual cause; we just check it last because the check is expensive.
Q: A new model version shipped yesterday and the bug started today. Why not jump straight to suspect 5? A: Correlation is not causation. The model rollout may have been coincidental — maybe a prompt template was also updated, or a tool schema changed in the same deploy. Walking the lineup catches the actual cause. Jumping to the model risks a model rollback that does not fix the bug and burns a week.
Common wrong answer to avoid: "Skip ahead when you have a strong hunch" — hunches are how you decide what to test, not what to skip. The whole point of the lineup is to stop guessing.
Q: How do you eliminate the loop as a suspect when your agent framework is built around iteration? A: Write a single-call harness outside the framework. Take the same prompt, same tool definitions, same input. Call the model once. Inspect what it would have done on turn one. If the bug is absent in that single call, the loop introduced it. If the bug is present, the loop is innocent.
Common wrong answer to avoid: "Just set max_steps=1 in the framework" — many frameworks still wrap the call in retry, output-parsing, and error-handling logic. A true elimination removes the whole control flow, not just iteration count.
Q: You reach suspect 5, swap the model, and the bug disappears. Is the model guilty? A: Probably, but verify. Swap to a third model. If the bug only appears on the original, the original model is guilty. If the bug appears on the original and one alternate but not the third, you may have a capability cliff specific to two models. Always test at least two swaps before declaring a model confession.
Common wrong answer to avoid: "Yes, ship the new model" — one swap is one data point. A capability difference between two models could also be triggered by a prompt the agent is generating that happens to be brittle on the original — fixing the prompt may be cheaper than a model migration.
Apply now (10 min)¶
Step 1 — model the exercise. Here is the lineup worksheet I would build for a known agent bug — "refund bot fails on enterprise multi-product invoices":
| Suspect | Elimination test | Estimated time | What confessing looks like |
|---|---|---|---|
| Prompt | strip system prompt to the four load-bearing rules; rerun | 10 min | bug disappears with minimal prompt → suspect 1 guilty |
| Tool | mock invoice-lookup to return the captured fixture; rerun | 20 min | bug disappears → tool returned wrong data live |
| Loop | call model once, no framework, with the same prompt + tool outputs | 30 min | bug disappears in single-call → loop introduced it |
| Memory | clear conversation summary, user profile, retrieval cache; rerun | 30 min | bug disappears with clean memory → stale state guilty |
| Model | swap to Claude Sonnet 4.6; rerun on the same fixture; then swap to a third | 1–2 days | bug disappears on swap → model guilty (verify with second swap) |
Total walk: under 4 hours up to suspect 4; suspect 5 is the only multi-day cost. The whole lineup is structured so the cheap eliminations happen first.
Step 2 — your turn. Pick one agent bug you have seen — at work or in a side project. Write the same five rows. For each row, estimate the time-cost honestly; if any row is "I cannot test this", the lineup has revealed a missing tool or fixture rather than a model bug.
Step 3 — reproduce from memory. Draw the five-suspect column from top to bottom. Next to each, write the elimination test in five words. Below the column, draw an arrow labeled "cheap → expensive" pointing down.
What you should remember¶
This chapter explained the order a debugger walks the agent stack when something is wrong. The lineup is prompt → tool → loop → memory → model, in that order, and the order is about cost of elimination, not about likelihood of guilt. A prompt diff is five minutes; a model swap is a week of evals. Walking the cheap suspects first means most bugs confess before the expensive interrogation ever begins.
You also learned why correlation with a recent change is not enough to skip the walk. A new model shipped yesterday and a bug appeared today is not a model bug — it is a coincidence until the cheaper suspects are cleared. Skipping ahead is how teams roll back the wrong thing and lose a week.
Carry this diagnostic forward: when somebody proposes a model swap to fix a regression, ask which three cheaper suspects have already been eliminated. If the answer is fewer than three, the lineup has not been walked — and the rollback will probably not fix the bug.
Remember:
- The order prompt → tool → loop → memory → model is fixed. Cost of elimination, not likelihood of guilt, sets the order.
- "The new model rolled out yesterday" is correlation, not a confession. Walk anyway.
- Each elimination test must be clean — a single-call harness outside the framework beats
max_steps=1, because the framework still wraps the call in retry and parsing. - A model that confesses on one swap needs a second swap to verify. One swap is one data point.
- A suspect you cannot eliminate is a missing fixture, not a guilty one. Fix the instrumentation before drawing conclusions.
Bridge. First suspect up — the prompt. Cheapest to eliminate, but home to subtle bugs: context bleed, conflicting instructions, role drift. Time to interrogate. → 07-prompt-layer-bugs.md