00. Debugging Agents in Production — The Five-Year-Old Version¶

You designed the agent in module 16. This module shows you what to do when it misbehaves in production at 3 AM.

Imagine a detective at a crime scene.

A user has filed a complaint slip — "your agent refunded the wrong customer." That is the call. The detective does not panic and does not guess. The detective opens the case file — the full trail of every step the agent took for that user, with timestamps, with parents and children. The detective reads each witness note — every span, every tool call, every model output. The detective lines up the suspects — was it the prompt? The tool? The loop? The memory? The model? Each one is checked, in order, until the confession emerges. A root cause. Not a guess.

Then the detective writes a lock — a regression eval — so this exact crime cannot happen again. Future versions of the agent must pass this eval before launch. The crime is solved, and the city is safer than before.

That is debugging an agent in production. Not staring at print statements. Not guessing in Slack. A disciplined investigation, supported by traces, evals, and a tool chain that lets one engineer go from complaint to root cause to permanent fix in an hour, not a week.

Most teams cannot debug their agents. They look at the final output, scratch their head, and roll back the model version. By the end of this module, you should be able to take any agent failure and follow a five-step flow: complaint → trace → suspect → confession → lock. Every step has tools. Every step has pitfalls. We walk them all.

The detective metaphor is not soft. Agents fail through sequences, not single calls. The final answer is just the smoke. The fire lived earlier. To find it, you need the case file and the lineup. Print statements miss the sequence. They miss the parent-child links. They miss what time of day the call happened. They miss whether the model version changed mid-rollout. Real debugging needs a real wall.

The placeholders you will see called back¶

Placeholder	Meaning
the case board	the dashboard or system-wide view of agent health
the case file	one end-to-end trace for a request or session
the witness note	one span inside the trace
the evidence tag	metadata on a span — user, model, tenant, cost
the crime statistics	metrics rolled up over many requests
the alarm bell	alerts that fire when something dangerous changes
the complaint slip	a customer report linked to the exact trace
the suspects	the five layers — prompt, tool, loop, memory, model
the lineup	systematic layer-by-layer isolation of which suspect broke
the confession	the root cause — verified, not guessed
the lock	the regression eval that prevents this bug from returning
the cold case	a drift-over-time failure where no single trace looks wrong

Top resources¶

OpenAI Cookbook — Tracing and debugging agents — https://cookbook.openai.com/examples/agents_sdk/tracing
LangSmith docs — Tracing and debugging — https://docs.smith.langchain.com/
Arize Phoenix — LLM observability — https://docs.arize.com/phoenix
OpenTelemetry semantic conventions for GenAI — https://opentelemetry.io/docs/specs/semconv/gen-ai/
Braintrust — Eval-driven debugging — https://www.braintrust.dev/docs
Hamel Husain — Field guide to AI evals — https://hamel.dev/blog/posts/evals/
Eugene Yan — Patterns for building LLM-based systems — https://eugeneyan.com/writing/llm-patterns/

What's coming¶

01-failure-taxonomy.md — The 8 ways an agent fails in production. Naming the bug before fixing it.
02-from-complaint-to-trace.md — Turning a user complaint into a specific trace ID. Tagging, correlation, retention.
03-reading-a-trace.md — Span anatomy. Parent-child links. Time and cost breakdown.
04-llm-specific-traces.md — Prompt, completion, token, latency. What goes into a useful LLM span.
05-reproducing-the-failure.md — Capturing inputs, seeds, model versions, time-of-day variance.
06-layer-isolation-lineup.md — The lineup — systematic prompt → tool → loop → memory → model elimination.
07-prompt-layer-bugs.md — Context bleed, conflicting instructions, role drift, instruction-following decay.
08-tool-layer-bugs.md — Schema drift, silent argument coercion, hallucinated arguments, errors masked as success.
09-loop-layer-bugs.md — Runaway, premature stop, infinite retry, oscillation between two tools.
10-memory-bugs.md — Stale state, cross-session leakage, retrieval drift, embedding staleness.
11-model-layer-bugs.md — Version regressions, capability cliffs, temperature drift, refusals.
12-multi-agent-handoff-bugs.md — Lost context, deadlock, role confusion across agents.
13-drift-detection.md — The cold case — "behaved like yesterday?" — distribution shift in inputs and outputs.
14-latency-cost-regressions.md — When the bug is not "wrong" but "slow" or "expensive."
15-debugging-tools-workflow.md — LangSmith / Phoenix / Braintrust — concrete debugging loops in each.
16-span-tagging-for-debugging.md — Evidence tags that make trace search fast — user, model, tenant, version.
17-regression-eval-as-lock.md — Turning every fixed bug into a permanent lock — the eval-set discipline.
18-postmortem-for-agents.md — The agent-specific incident template. What differs from a classical SRE postmortem.
19-data-privacy-retention.md — Protecting user data while still keeping useful traces.
20-honest-admission.md — What remains hard: non-determinism, subjective output, emergent multi-agent failure.

See the whole detective wall once more.

A complaint slip points at one trace. The case file holds the full story. Witness notes are the spans inside it. Evidence tags let you filter to the bug.

The lineup walks the suspects — prompt, tool, loop, memory, model — one at a time. The confession is the verified root cause.

The lock is the regression eval that ensures the bug never returns. The cold case is a drift that hides in aggregate trends, not in any single trace.

The alarm bell wakes you when crime spikes. The case board shows the health of the city.

That is the debugger's vocabulary. The rest of the module fills in each step.

Bridge. Before we open a trace, we must name the bug. Different failures need different evidence. The eight-failure taxonomy comes first. → 01-failure-taxonomy.md