Skip to content

20. Honest admission — what debugging agents in production still cannot solve

~14 min read. The detective wall is much better than it was three years ago. It is not yet honest. Here is what the case board still does not see, and what no eval, no postmortem, no trace can fully fix.

Built on the ELI5 in 00-eli5.md. The case board — the dashboard of every case ever solved — looks impressive from a distance. Up close, there are cold cases the lineup cannot crack, confessions that may not be true, and locks that close one door while another quietly opens. This chapter is the senior engineer's honest account of what we still cannot debug.


Non-determinism makes some bugs un-reproducible

Classical SRE has a comforting rule: same input, same code, same state, same output. That rule is broken for LLM agents. Temperature is non-zero in many deployments. Token sampling has randomness. Provider-side hardware batching changes the math subtly. Model weights themselves get rotated on the provider's side without our knowledge. So the same prompt, on the same model name, on the same day, can produce two different answers.

classical bug                 agent bug
┌──────────────────┐         ┌────────────────────────────┐
│ same input       │         │ same input                 │
│   ↓              │         │   ↓                        │
│ same output      │         │ sometimes wrong, sometimes │
│ always           │         │ right, no obvious pattern  │
└──────────────────┘         └────────────────────────────┘

The honest truth is this. We can be sure something broke — the complaint slip is real. We may never be sure why — the case file on replay shows a different answer. The confession is sometimes to a different crime than the one the user saw. First honest limit.

The discipline that makes peace with this: capture seeds when available, snapshot full prompt, context, tool outputs, model version at bug time, and replay many times rather than once. Treat the bug as a distribution of behaviours. It costs more — it is the only honest path.

"Wrong" is sometimes a judgment call no eval can codify

A request can be operationally healthy. Fast. Cheap. No error. Valid JSON. Still the answer can be wrong in a way that no automated check catches. Wrong tone. Wrong emphasis. Subtle factual error a non-expert reviewer would miss. Politically incorrect for the user's region. Legally fine but reputationally bad.

The lineup has nothing to test here. The prompt is fine. The tool returned correct data. The loop terminated. The memory was clean. The model did not refuse. And yet the answer is wrong — to a senior reader who knows the domain.

operationally green             semantically red
┌─────────────────────┐         ┌──────────────────────────────┐
│ status = ok         │         │ tone = condescending          │
│ latency = 1.2 s     │   AND   │ emphasis = wrong stakeholder  │
│ cost = $0.004       │         │ nuance = stripped             │
│ json parse = pass   │         │ legal review = uncomfortable  │
└─────────────────────┘         └──────────────────────────────┘

So what do we do? We sample. We send traces to human reviewers. We build rubric-based evals. We use LLM-as-judge — knowing it has blind spots. None of these are sufficient. All of these together are still partial. Honest interview answer: "semantic correctness is judged, not measured, and our debugging stack reflects that limit."

Emergent multi-agent failure escapes single-agent tests

Now what is the harder problem? Single-agent eval can be quite strong. Pass an input, check an output, judge it. Multi-agent systems do not work that way. A planner calls a coder. The coder calls a reviewer. The reviewer asks for memory recall. The memory call influences the next planner choice. The bug is in the interaction pattern — no single agent is wrong on its own.

planner          coder          reviewer        memory
   │                │                │              │
   ├─call──────────→│                │              │
   │                ├─call──────────→│              │
   │                │                ├─recall──────→│
   │                │                │←────state────┤
   │                │←──── modified review ─────────┤
   │←── modified plan ──────────────────────────────┤
 wrong final answer — but every individual agent passed eval

The lineup cannot find a single suspect. Every agent is operationally fine. The failure lives in the graph, not the node. Our debugging tools are built for nodes. This is one of the largest unsolved areas in production agent debugging.

Cold cases — drift is observable, root-causing it is guesswork

Drift detection is easier than drift attribution. We can see that intent distribution shifted. We can see refusal rates climbed. We can see retrieval quality dropped on the long-tail queries. But why? Did the user base change? Did a competitor's launch shift query patterns? Did the provider silently update the model? Did a prompt change six weeks ago slowly compound through cached memory?

The cold case is a drift the case board can detect. Root-causing it is often guesswork, dressed up as analysis. A senior engineer admits this: "We know it shifted. We have three theories. We can confirm none of them with our current evidence." That is the honest answer.

The "perfect trace" is still a fantasy

Trace everything, redact nothing, keep forever. Some teams say this aspirationally. The math does not allow it. Storage is finite. Privacy law is real. Sampling is forced. Redaction is forced. Retention windows are forced.

the impossible triangle
        completeness
            /\
           /  \
          /    \
         /______\
   privacy    cost

You can have any two. You cannot have all three. The debugger making peace with this triangle is mature. The debugger denying it is dangerous. Every production agent runs on a partial case file — by necessity. Sometimes the witness note you need was sampled out. Sometimes the raw tool payload was redacted before export. Sometimes the trace was already deleted by retention. This is permanent.

LLM-as-judge has known blind spots — and we still use it

Many teams use a stronger model to score a weaker model's outputs. It scales. It is also known to prefer longer answers, mirror its own style, miss errors in its own weak domains, hallucinate justifications, and disagree with human reviewers on edge cases.

We use it anyway. Why? Because human review at scale is unaffordable. Honest answer: "LLM-as-judge is our best automated verifier. It is not a good one. We sample human review on top." No one has a better solution yet.

Postmortems write better stories than the real causal chains they describe

A good agent postmortem reads well. Timeline, five whys, action items, lock applied. Beautiful.

The worry? The real causal chain is often not a chain. It is a graph with weak edges. Prompt was 80% of the cause. Retrieval was 30%. Model rollout 20%. User phrasing 40%. The sum exceeds 100% because in non-deterministic systems, causes interact and compound. The postmortem flattens this into one narrative because humans need narratives.

what we write                  what really happened
┌──────────────────┐           ┌────────────────────────────┐
│ root cause:      │           │ five partial causes,        │
│ stale retrieval  │    vs     │ interacting, compounding,   │
│ index            │           │ no single one sufficient    │
└──────────────────┘           └────────────────────────────┘

This is not a reason to skip postmortems. It is a reason to acknowledge partial causes and resist the clean-story instinct. The lock covers the dominant contributor — the runner-up causes will surface again in different combinations.

What a senior debugger sounds like

"We have good traces, good evals, good postmortems. We still cannot reproduce some bugs deterministically, cannot codify all forms of 'wrong,' cannot attribute emergent multi-agent failure cleanly, cannot root-cause every drift. The case board is sharper every quarter. It is not yet honest."

That is the answer. Maturity is naming the limits.


Public admissions of the limits of agent debugging

  • Anthropic platform team — documents how seed and temperature affect reproducibility for paying customers debugging Claude agents; the role is making non-determinism a customer-facing operational constraint.
  • Cursor agent reliability — publishes that multi-step coding sessions fail through interaction patterns no single-step eval catches; the role is naming emergent multi-step failure publicly.
  • OpenAI safety researcher posts — acknowledge LLM-as-judge has known biases and recommend sampling human review for high-stakes evaluation; the role is the canonical "judge has limits" admission.
  • Notion AI quality team — runs sampled human review on top of automated rubrics because operationally healthy traces still produce subtly wrong summaries; the role is showing the judgment-call gap as a working pattern, not a bug.
  • Klarna assistant incident reviewers — write postmortems noting "compounded cause" when prompt change, model rollout, and intent drift contributed in the same window; the role is making compound causality first-class in postmortem language.
  • Mata v. Avianca (2023) — lawyer-cited fake cases never caught pre-submission; the role is showing that missing an eval (no citation-existence check) produced real legal harm.
  • Air Canada chatbot incident (2024) — invented bereavement-fare policy; tribunal ruled airline liable; the role is exposing the coverage-is-unprovable gap with regulatory teeth.
  • Bing Chat early hallucinations — invented sources and argued with users; the role is showing that even a frontier-model deployment had no eval surface for "will the model argue with the user?".
  • Apple Intelligence summary failures (2024–25) — false news notification summaries severe enough to force a feature pause; the role is the cold-case-after-launch shape at platform scale.
  • Galactica (Meta, 2022) — pulled within three days for confidently fabricating scientific citations; the role is exposing how an eval that does not cover the deployment's actual workload fails publicly.
  • CNET / Red Ventures (2023) — quietly used AI for finance articles; more than half needed corrections; the role is the judgment-gap failure at media scale.
  • Bard launch demo error (Feb 2023) — confidently asserted James Webb took the first exoplanet image; the role is showing the demo-vs-distribution gap from chapter 01 at the highest visibility.
  • Anthropic safety eval limits posts — public reflections on what their evals do not catch; the role is normalising honest admission at frontier-lab scale.
  • OpenAI red-teaming admissions — published acknowledgments of coverage gaps in adversarial eval; the role is making "unprovable coverage" a public discipline.
  • Goodhart's law in LLM evals (academic literature) — published cases of eval-score-up / user-satisfaction-down; the role is exposing the Goodhart pattern that no calibration eliminates.
  • Multi-turn eval academic acknowledgments — published statements that trace-level evals over multi-step agent runs are an open research frontier; the role is making the emergent failure honest at the literature level.
  • Judge-calibration ceiling research — 80–90% agreement with human raters as a near-asymptote; the role is naming the LLM-judge ceiling formally.
  • Anthropic "What we learned" posts — public reflections on model-behaviour incidents; the role is the canonical industry honest-admission format.
  • Stack Overflow's ChatGPT-answer ban (2022) — banned because plausible-looking-and-wrong answers were too frequent; the role is the coverage-unprovable gap at community scale.
  • Casetext / CoCounsel post-Avianca review — citation-accuracy became a launch blocker only after public harm; the role is showing that some eval gaps only get named after an incident.
  • DuckDuckGo / Brave Search AI pull-backs — both reduced aggressive RAG synthesis after early hallucination incidents on news; the role is showing operationally-mature retreat from emergent failure modes.
  • Bloomberg GPT / JP Morgan DocLLM / Goldman internal tools — finance teams document that domain reasoning gaps persist even with strong domain-tuned retrieval; the role is making the no-eval-codifies-this gap explicit in regulated domains.

Recall — the five honest gaps

  • Why is the SRE rule "same input → same output" broken for LLM agents?
  • What kind of failure escapes the entire lineup (prompt, tool, loop, memory, model)?
  • Why is drift detection easier than drift attribution?
  • Why does the postmortem template tend to under-represent the real causal graph?

Interview Q&A

Q: Why can't you always reproduce an agent bug even when you have the exact prompt, model, and context? A: Non-determinism is structural. Sampling temperature, provider-side batching, and silent weight rotations all introduce variance. The bug is a distribution of behaviors, not a single behavior. You replay many times to characterize it, not once to reproduce it. Common wrong answer to avoid: "Set temperature to zero and the bug will reproduce" — temperature zero reduces variance but does not eliminate it; provider-side changes still apply.

Q: Why doesn't the layer-by-layer lineup work for multi-agent systems? A: The lineup tests each suspect — prompt, tool, loop, memory, model — in isolation. Multi-agent failures emerge from interaction between agents, where every individual agent passes its own eval but the graph produces a wrong final answer. The cause is the edge, not the node. Common wrong answer to avoid: "Just test each agent more carefully" — no amount of single-agent testing catches a failure that lives in the interaction pattern.

Q: Why is LLM-as-judge used in production despite known biases? A: Human review does not scale to millions of traces. LLM-as-judge is the only automated verifier we have. We accept its biases (length preference, style mirroring, weak-domain blind spots) and sample human review on top for high-stakes flows. The honest answer is "best available, not good." Common wrong answer to avoid: "LLM-as-judge is accurate enough" — published research shows clear, measurable biases; we use it pragmatically, not because it is accurate.

Q: Why do agent postmortems often understate the real cause? A: Agent failures are usually not single-cause. Prompt, retrieval, model version, and user phrasing each contribute partially and interact. Humans flatten this into a clean story because narratives are easier to act on. The runner-up causes are still real and will surface in different combinations later. Common wrong answer to avoid: "Pick the dominant cause and move on" — applying a lock only to the dominant cause leaves the system exposed to the same compound failure with different weights next quarter.


Apply now (10 min)

Step 1 — model the exercise. Here is the honest-limit triage table I would build for three real bugs from a refund-bot stack:

Bug Honest limit Mitigation
Refund-amount answer disagrees on two replays of the same trace non-determinism replay 20×, treat as distribution; track variance per slice
Bot summary is technically correct but tone violates brand guide judgment-call no eval codifies sample 5% of "passed" traces for human review weekly
Two agents in handoff produce a wrong answer; each agent passes its own eval emergent multi-agent add interaction-level eval (graph eval), not just node-level
Eval pass rate steady, CSAT down four points Goodhart / coverage unprovable refresh eval set with last-week-low-CSAT traces; check rubric drift
Citation references a clause that no longer exists; original citation worked at retrieval time cold-case drift freshness alert on retrieval index; postmortem flagged as cold case

Each honest limit has a mitigation, but none of them is resolution. The mitigations buy more case files; they do not close the gap.

Step 2 — your turn. Write three bugs from your current agent stack. For each, mark which honest limit applies — non-determinism, judgment call no eval codifies, emergent multi-agent, cold-case drift, redaction loss, judge blind spot, or compound causality. Then write one mitigation per bug — a sampling strategy, a human review trigger, or a multi-cause lock.

Step 3 — reproduce from memory. Draw the impossible triangle — completeness, privacy, cost. Mark which corner your team currently prioritises. Write one sentence on what the case board misses because of that choice, and one sentence on which cold cases you suspect are hiding in the gap.


What you should remember

This chapter explained what the rest of the module quietly assumed away. Five honest gaps remain after every suspect has been interrogated, every lock added, every case file redacted. Non-determinism makes some bugs distributions rather than reproductions. Some failures are judgment calls no eval codifies. Multi-agent interaction patterns produce failures no single-agent test catches. Drift is observable but rarely fully attributable. LLM-as-judge has known blind spots and we still ship it because human review does not scale.

You also learned that postmortems flatten compound causation into clean narratives because narratives are easier to act on — but the runner-up causes are still real and surface in different combinations later. The mature posture is not pretending the case board is complete; it is naming the cold cases out loud so the team knows which gaps are open.

Carry this diagnostic forward: when somebody claims the agent system is "fully observable", ask which of the five honest gaps they have closed. None of the five is closable today. The senior move is humility — running mitigations on each gap, sampling human review on the judgment cases, and writing postmortems that admit the runner-up cause rather than hide it.

Remember:

  • Five honest gaps remain after every lock is added. Name them when planning.
  • Non-determinism turns some bugs into distributions. Replay many times, not once.
  • LLM-as-judge has limits we accept. Sample human review on top for high-stakes flows.
  • Multi-agent failures live on the edges between agents, not inside any node.
  • Postmortems flatten compound causation. The runner-up cause is real and will surface again.

Bridge. Debugging tells us what did go wrong. Sometimes we cannot debug fast enough — the harm is already done. So the next discipline is preventing harmful behavior before the complaint slip ever gets written. Guardrails and safety controls — explicit limits that fire before the model speaks. → ../../03_ai_security_safety/00_safety_guardrail_design/00-eli5.md