Skip to content

01. Failure taxonomy — name the bug in ten seconds

~14 min read. Eight failure types. Each with a signature. Each with a first move.

Built on the ELI5 in 00-eli5.md. The complaint slip — a user's raw report of pain — must first be tagged with a failure type before any case file is opened, before any lineup runs.


Why naming comes before fixing

A user writes, "Your bot did something weird." That sentence is emotionally clear and operationally useless. A retry will not help. A rollback may waste a week. The first step is naming the bug, because classical SRE taxonomy — transient, persistent, silent — was built for services that either return a 500 or do not. Agents fail along a different axis. An agent can return 200 OK with a wrong answer. It can loop forever and burn budget. It can hallucinate an order_id that never existed. None of these fit the old buckets.

What follows: eight agent-specific failure classes. Each leaves a fingerprint in the case file. Each points at a specific suspect layer. Name the bug in ten seconds and the rest of the investigation has direction; skip the naming step and every later move is a guess wearing a lab coat.

Decision tree — from complaint to failure type

Before any table, the picture. A complaint slip lands. We ask three questions in order.

complaint arrives
  did the agent finish?
      ├── no  ─ ran forever ────────→ STUCK-LOOP / RUNAWAY
      │       │
      │       └─ refused to start ──→ REFUSAL CASCADE
  yes, it finished
      ├── trajectory weird? ─── yes ──┐
      │                                │
      │                                ├── tool arg fabricated ─→ HALLUCINATED ARG
      │                                ├── tool 200 but no effect → SILENT SUCCESS
      │                                └── feature gone after upgrade → CAPABILITY REGRESSION
      ├── trajectory sane, output wrong ──→ WRONG-OUTPUT
      ├── output fine, bill spiked ───────→ COST BLOW-UP
      └── nothing looks wrong, metrics sliding ──→ DRIFT (cold case)

That is the full lineup entry point — three questions, eight buckets.

The eight failure types

1) Wrong-output (the silent majority)

Agent returned an answer. Spans look healthy. Latency normal. Cost normal. But the answer is factually wrong. Most common failure. Easiest to miss in dashboards.

Signature. Trace ends with status=ok. User thumbs-down. Hides in. The final LLM span's output text. Fastest detection. Complaint-linked traces plus offline eval on the answer. Suspect layer. Prompt or model. Sometimes retrieved context.

2) Stuck-loop / runaway

Agent calls the same tool ten times. Or oscillates between two tools. Or refuses to call the stop tool. Token budget melts. User sees a spinner that never resolves.

Signature. Trace span count >> p99. Total tokens >> p99. Hides in. Repeating sibling spans under one parent. Fastest detection. Span-count alarm; "max steps reached" log. Suspect layer. Loop controller, or model's stop-condition reasoning.

3) Hallucinated argument

Model invents a value — order_id="ORD-9999999" that does not exist, or customer_email="john@unknown.com" that the model made up. Tool returns 404 or empty result. Agent often barrels on as if the lookup succeeded.

Signature. Tool span input does not match any real entity. Hides in. The tool call's input JSON. Fastest detection. Cross-check tool input against known IDs at log time. Suspect layer. Model, with prompt as accomplice.

4) Refusal cascade

Model refuses a legitimate request. "I cannot help with that." The whole flow halts. Downstream agents wait for an output that never comes. One refusal can trigger retries that all refuse, draining budget.

Signature. Output text matches refusal patterns. Trace short. Hides in. The first LLM span's completion. Fastest detection. Refusal-phrase regex over completion text + spike alert. Suspect layer. Model safety policy, or upstream prompt injection.

5) Silent success

Tool returned 200 OK. Latency normal. But it did nothing useful. A "refund" tool that accepted the call and refunded zero dollars. A "send email" that returned a fake message ID for testing mode. Agent reads success=true and moves on.

Signature. Tool span ok, but business state did not change. Hides in. The gap between tool reply and real-world effect. Fastest detection. Reconcile tool calls against side-effect logs (DB, webhook). Suspect layer. Tool itself (often a stale staging endpoint), or prompt parsing.

6) Capability regression

Last week agent could draft SQL. This week, after a model version bump, the same prompt fails. Spans look normal. Output is just worse — often on one type of input.

Signature. A test set that passed on model-2025-04-01 fails on model-2025-05-01. Hides in. The model version evidence tag. Fastest detection. Regression eval pinned to model deployment. Suspect layer. Model. Almost always model.

7) Cost blow-up

Output quality fine. But cost-per-conversation jumped 10x overnight. Maybe tokens-per-call doubled. Maybe retries are tripling. Maybe a tool description ballooned.

Signature. Cost percentile spike with stable success rate. Hides in. Token counts on each LLM span, retry counts on tool spans. Fastest detection. Per-trace cost histogram, broken by model and prompt version. Suspect layer. Prompt size, loop controller, or retry policy.

8) Drift (the cold case)

No single trace looks wrong. But week-over-week thumbs-up rate slid from 92% to 78%. Inputs shifted. Or retrieved docs aged. Or users found new edge cases.

Signature. Aggregate metric trend. No single failing trace. Hides in. The crime statistics, not any one case file. Fastest detection. Rolling baseline + input distribution monitor. Suspect layer. Memory, retrieval, or input population itself.

This is the cold case from the ELI5. Three suspects — memory, retrieval freshness, input drift — leave no fingerprint on one trace. They show up only when you zoom out.

Diagnostic table

Failure First signal Where to look Likely suspect
Wrong-output Thumbs-down on a normal-looking trace Final LLM span output Prompt / model / retrieved context
Stuck-loop Span count >> p99, cost >> p99 Sibling-span repetition Loop controller / model stop logic
Hallucinated arg Tool 404 or empty result Tool span input JSON Model + prompt
Refusal cascade Refusal regex in completion First LLM span text Model safety / prompt injection
Silent success Business state unchanged Tool reply vs DB / webhook Tool endpoint / prompt parsing
Capability regression Eval pass rate drop after deploy Model version tag Model
Cost blow-up $/conversation spike Token counts, retry counts Prompt size / loop / retry policy
Drift Aggregate metric slide Crime statistics dashboard Retrieval / memory / input shift

Worked example — one support agent, three failures in one week

A customer-support agent at a SaaS company. Same agent, three different failure types in seven days.

Tuesday 9 AM — wrong-output. User: "Bot told me my annual plan is refundable. It is not." On-call opens the trace. All spans ok. Retrieval pulled refund_policy_v17.md — that doc was outdated last month. Confession in 4 minutes: stale retrieved context. Failure type — wrong-output. Fix: re-index; add a regression eval (a lock) that asks the same question.

Wednesday 11 AM — silent success. User: "I asked for a refund. Bot said done. I never got the money." The issue_refund tool returned {"status":"ok","ref":"test-mode"}. A staging URL had leaked into production config. Agent had no way to know. Confession in 8 minutes. Failure type — silent success. Fix: assert side-effect on tool reply; alert if test-mode appears in production.

Thursday 3 PM — cost blow-up. Finance pings on-call. Yesterday's spend was 11x baseline. No quality complaints. The on-call pulls the cost histogram. Median tokens-per-call doubled. Reason: someone shipped a new prompt with a 4,000-token few-shot block. Confession in 6 minutes. Failure type — cost blow-up. Fix: revert prompt; add token-budget assertion on PR.

Three different bugs, three different first moves. With one playbook ("look at logs") each one takes hours; with a taxonomy each one took minutes. The lineup is fast because the on-call walked in already knowing which suspect to interrogate first.

A note on overlap

Real bugs sometimes wear two hats. A hallucinated arg can cause a stuck-loop (model retries the bad call). A capability regression can look like wrong-output for one feature. A cost blow-up can hide a stuck-loop underneath.

Use the taxonomy as a starting bucket, not a final verdict. The bucket tells you which evidence to fetch first. The confession comes after the investigation, not before.


Failure taxonomy across shipped agent stacks

  • Anthropic Claude — incident response engineer: classifies a wave of "weird answers" after a model rollout as capability regression, not random noise, by pinning to the new model version tag.
  • OpenAI ChatGPT — reliability engineer: the August 2024 quality complaints traced to a tokenizer change manifested as wrong-output across many languages, not a refusal or outage.
  • GitHub Copilot — platform engineer: treats "completion never returns" alerts as stuck-loop in the agent layer, distinct from upstream model 5xx errors.
  • Notion AI — support engineer: tags every thumbs-down by failure type (wrong-output vs refusal vs silent-tool-fail) so engineering pulls the right traces instead of one undifferentiated pile.
  • Linear Asks — on-call: flags cost-blow-up separately from quality regressions; a Slack alert fires when p95 tokens-per-conversation exceeds the rolling baseline by 2x.
  • PagerDuty for AI services: routes incidents by failure class — stuck-loop alarms wake the agent-runtime team, refusal-cascade alarms wake the safety-prompt team, drift alarms wake the data team — so one tag decides who gets paged.
  • Air Canada chatbot (2024 tribunal): post-incident review classified the failure as wrong-output on the policy-violation slice; the taxonomy named what no demo had measured, which is what regulators required in the public record.
  • Sentry for LLM apps: groups exceptions by failure shape rather than stack trace, so 200 identical hallucinated-argument 404s collapse into one issue with a count.
  • Datadog LLM Observability: the "error type" facet in the trace explorer is the taxonomy made queryable; engineers filter failure_type:silent_success to find tool calls that lied.
  • LangSmith feedback tags: each thumbs-down is annotated with a failure-class tag, which feeds the dataset that becomes the next regression eval.
  • Arize Phoenix monitors: ships preset detectors for hallucination, refusal, runaway loops, and drift — five of the eight buckets out of the box, with the rest configurable.
  • Honeycomb LLM tracing: distinguishes "high span count" from "high latency" so a stuck-loop never gets misfiled as a slow-tool incident in the on-call inbox.
  • AWS Bedrock CloudWatch dashboards: ship a "guardrail blocked" panel separate from a "tool error" panel, because refusal-cascade and silent-success need different mitigations.
  • Microsoft Azure App Insights for AI agents: the "failure type" dimension lets a single dashboard show the eight-class breakdown over a release window.
  • Google Vertex AI Agent Builder: distinguishes "tool failed" from "tool unused" — the silent-success case where the agent skipped a needed tool — because the two demand different fixes.
  • Hamel Husain's eval playbook (industry blog): the canonical "look at your data" loop teaches teams to label failure type first, because patches built without a label tend to regress something else.
  • Cursor postmortems (public commits): changelog entries explicitly name the failure class fixed ("tool argument hallucination on rename") so future engineers know which regression eval guards the fix.
  • Perplexity citation-fail dashboards: treats unsupported-citation as a distinct class from wrong-output, because the fix lives in retrieval, not in the answer prompt.
  • Slack AI summary incidents: distinguishes "summary too short" (capability regression after a model swap) from "summary fabricated" (hallucination) — different on-call rotations entirely.
  • LangFuse failure analytics: the dashboard pivots traces by user-tagged failure category, so a PM can ask "how many silent-success cases this week?" without filing an engineering ticket.
  • Comet Opik incident view: groups recurring failures by signature so a stuck-loop pattern bubbles up even when no single trace tripped the latency alarm.
  • Helicone alert routing: alerts on cost-blow-up percentiles per model version, which is the taxonomy slice that finance, not engineering, watches first.
  • Vercel AI SDK error traces: error envelopes carry a kind field aligned with the agent taxonomy, not the HTTP code, because the SDK's authors learned that 200 OK lies in agents.

Recall — name eight failure types cold

  1. Which failure type is the most common in production, and why is it the hardest to detect from dashboards alone?
  2. A user says "I never got my refund" but the trace shows the tool returned 200 OK. Which failure type is this?
  3. Why does drift not show up in any single case file, and where must you look instead?
  4. Name the three questions in the decision tree, in order.

Interview Q&A

Q: Why is "wrong-output" the most common agent failure and the hardest to monitor? A: The trajectory looks healthy. All spans return ok. Latency and cost are normal. The bug lives inside the final text — which dashboards do not read. You need user feedback or offline eval to catch it.

Common wrong answer to avoid: "Because models hallucinate." Hallucination is one cause, but the deeper issue is that the system reports success when the answer is wrong. Detection requires content evaluation, not infrastructure metrics.

Q: How do you distinguish a stuck-loop from a slow response in a single trace? A: A slow response has one or two long spans. A stuck-loop has many repeating sibling spans under the same parent — usually the same tool name called over and over, or two tools alternating. Span count, not span duration, is the tell.

Common wrong answer to avoid: "Look at total latency." High latency happens for many reasons (slow model, slow downstream tool). It does not separate loop from slowness. Span count and tool-name repetition do.

Q: A tool returns 200 OK but business state did not change. Why is this not a classical tool failure, and how do you detect it? A: Classical monitoring trusts the HTTP code. The tool itself reports success. The agent reads success and proceeds. Detection needs an out-of-band check — reconcile the tool's claimed effect against the actual side effect (DB row, webhook, email). The trace alone is not enough.

Common wrong answer to avoid: "Add a retry." Retrying a tool that lies produces more lies. The fix is verification of the side effect, not repetition of the call.

Q: Why is drift treated as a separate failure class instead of just "many wrong-outputs added up"? A: The response is different. Wrong-output has a fixable root cause in one trace. Drift has no single guilty trace — the input population shifted, retrieved corpus aged, or user behavior evolved. Fix is re-indexing or expanding the eval set, not patching a prompt.

Common wrong answer to avoid: "Just retrain the model." Drift often has nothing to do with the model. It is usually data — stale retrieval, shifted inputs, outdated tool descriptions. Retraining without finding the data cause wastes weeks.


Apply now (10 min)

Step 1 — model the exercise. Take the Tuesday wrong-output bug from the worked example. Here is the row I would write before opening any code:

Field Value
Failure type wrong-output
First signal thumbs-down on a trace where every span is ok
Where to look final LLM span output + retrieved-doc evidence tag
Suspect layer retrieved context (stale refund_policy_v17.md)
First fix move reindex; add a regression eval (a lock) that asks "is the annual plan refundable?"

Notice the row points at retrieval, not the model, because the trace's doc_version tag is the discriminating evidence. The taxonomy choice is what made that tag worth reading first.

Step 2 — your turn. Take three recent bugs from any system you work on. For each, write the failure type (from the eight), the first signal you would alert on, and the suspect layer you would interrogate first. If a bug does not fit, write why — that mismatch is real signal that your taxonomy needs a new bucket.

Step 3 — reproduce from memory. Draw the three-question decision tree from "complaint received" down to the eight failure buckets. No peeking. Then write one diagnostic-table row for each bucket: first signal, where to look, suspect. Connect at least one bucket back to the eval-discipline rule from module 24 — a quality claim covers only the population the measurement sampled — and name which failure class breaks that rule loudest.

What you should remember

This chapter explained why a complaint slip that says "your bot did something weird" cannot be fixed until it has a name. Agents fail along axes that classical SRE never had to model: a 200 OK with a wrong answer, a loop that burns budget without ever erroring, a tool that succeeds while changing nothing. The eight-bucket taxonomy — wrong-output, stuck-loop, hallucinated arg, refusal cascade, silent success, capability regression, cost blow-up, drift — gives every complaint a fingerprint, and every fingerprint points the lineup at a specific suspect layer before any code is read.

The opening failure (the support-agent week of three different bugs) collapsed from "look at logs for hours" to "minutes per bug" because each one was tagged at the door. Tuesday's wrong-output sent the on-call to the retrieved-doc tag; Wednesday's silent success sent them to a side-effect reconciliation; Thursday's cost blow-up sent them to a token histogram. Same investigator, three different first moves, because the bucket decided which evidence to fetch.

Carry this diagnostic forward: when a complaint lands, run the three-question decision tree before opening the trace UI. Did it finish? Was the trajectory weird? Did the bill spike? The bucket you land in is not the confession — that comes after the lineup — but it is the only honest way to choose which case file to open and which witness note to read first. If you see a bug that fits two buckets, follow both branches in parallel rather than collapsing it prematurely; overlap is information, not noise.

Remember:

  • A 200 OK does not mean an agent succeeded. Trace status is necessary, never sufficient.
  • Drift leaves no fingerprint on a single case file. It lives in the crime statistics, and you must zoom out to see it.
  • Stuck-loop is a span-count problem, not a latency problem. Counting siblings beats reading durations.
  • Silent success is detected out-of-band — reconcile tool replies with the side effect (DB, webhook, email), not with each other.
  • Capability regression is almost always the model, and almost always caught by a regression eval pinned to deployment.
  • The taxonomy is a starting bucket, not a verdict. The confession still arrives after the investigation, not before.

Bridge. We can name the bug in ten seconds. But naming is not solving. The next move is fetching the case file — the trace — from a vague user complaint. How do we go from "your bot did something weird" to one exact trace ID? → 02-from-complaint-to-trace.md