Skip to content

01. Failure taxonomy — name the illness before prescribing treatment

~13 min read. Reliability starts when the team stops calling every bad event just "an error."

Built on the ELI5 in 00-eli5.md. The triage desk — fast classification of what kind of failure arrived — is the first job in any serious AI system.


1) First picture: four failure families

One red dashboard is not enough. We need categories. Otherwise the team picks the wrong fix.

incoming bad event
      ├── transient ──→ brief glitch, may recover soon
      ├── persistent ─→ stays broken until something changes
      ├── silent ─────→ answer looks fine but is wrong or unsafe
      └── cascading ──→ one broken step poisons later steps

The simple version: The triage desk exists for this. If you classify badly, you retry what needed rollback, or you escalate what only needed patience. The production problem:

Many AI teams classify only by HTTP code. That is too shallow. A 200 OK can still hide a silent failure. A 503 may be transient, or may signal a provider-wide outage. The class determines the playbook.

That is why the vitals monitor feeds the triage desk. The monitor sees symptoms. The desk names the disease.

2) Transient versus persistent failures

A transient failure is short-lived. The world outside your code changed for a moment. Examples are rate limits, temporary network loss, or one overloaded model replica. These often improve with time.

That is where the retry dose may help. A persistent failure keeps failing until you change something. Examples are bad credentials, schema drift, broken prompt templates, or a tool endpoint removed yesterday.

Retrying these only adds load.

same request repeated three times

attempt 1  rate limit 429  ── transient candidate
attempt 2  rate limit 429  ── maybe still transient
attempt 3  success         ── yes, retry helped

attempt 1  JSON schema mismatch
attempt 2  JSON schema mismatch
attempt 3  JSON schema mismatch  ── persistent, stop retrying

The difference is practical: both feel like failure. Only one deserves patience. Suppose Suppose a tool call to fetch_order_status fails. Case A.

11:00:01 — timeout. 11:00:03 — timeout. 11:00:07 — success. The provider was slow. This smells transient. Case B.

11:00:01 — response field order_state missing. 11:00:03 — response field order_state missing. 11:00:07 — response field order_state missing. Now the contract changed. This is persistent. The triage desk should label Case A for backoff.

It should label Case B for code or config repair.

3) Loud failures versus silent failures

Loud failures announce themselves. Timeout. Crash. Malformed JSON. Tool exception. The user sees an obvious break.

Silent failures are more dangerous. The response looks polished. The content is wrong, unsafe, hallucinated, or actioned on the wrong entity.

That is why AI reliability is tricky. Your vitals monitor cannot stop at uptime. It must also watch quality symptoms.

loud failure                    silent failure
┌─────────────────────┐        ┌─────────────────────┐
│ request timed out   │        │ request succeeded   │
│ user sees error     │        │ user gets wrong SKU │
│ pager may fire      │        │ no pager by default │
└─────────────────────┘        └─────────────────────┘

The production problem: Silent failures can be more expensive. A bad chatbot answer may create refunds. A wrong compliance summary may create legal risk. A duplicate tool action may charge money twice. For example, a loan-assistant model returns: "Applicant income verified." The API call succeeded. The JSON parsed. Latency was normal. But the retrieval step fetched last year's file.

The output is silently wrong. The practical response: Use acceptance checks, reference checks, entity validation, and human review on risky classes.

The senior doctor often matters most for silent failures.

4) Cascading failures are system failures, not local failures

A cascading failure starts in one place, then spreads. This is common in agent pipelines. One bad step feeds the next step. The later steps trust poisoned context.

planner fails slightly
wrong tool chosen
wrong data fetched
final answer sounds confident

See the danger. The first failure may be tiny. The last failure may be huge. The sealed ward idea matters here. If one component looks infected, stop letting it contaminate others.

For example, a shopping assistant does four steps.

  1. Intent classifier labels the user as asking for a return.
  2. Agent opens the returns tool.
  3. Tool fetches return policy.
  4. Model answers with return instructions. But the user actually asked for warranty repair. Step 1 failed silently. Steps 2 to 4 were perfectly healthy. The full pipeline still failed. That is cascading failure. The local health of later steps did not save the patient.

5) A practical triage matrix

So what should your team record for each failure? At least four labels.

  • Visibility: loud or silent.
  • Duration: transient or persistent.
  • Scope: single request or cascading.
  • Harm: low, medium, high. That gives much better actionability.
failure card
┌─────────────────────────────┐
│ class: silent + cascading   │
│ duration: persistent        │
│ harm: high                  │
│ first action: isolate route │
└─────────────────────────────┘

Now your triage desk can route smartly. Transient plus loud? Try the retry dose. Persistent plus loud? Open the sealed ward and repair. Silent plus high harm?

Bring the senior doctor fast. This is how reliability becomes operational. Not poetic. Concrete.


Where this lives in the wild

  • GitHub Copilot Chat — platform SRE: classifies repeated 503 model failures as transient first, but escalates to persistent provider outage after the retry budget is exhausted.
  • Intercom Fin — support AI operations lead: treats fluent but policy-wrong support answers as silent failures and routes them into audit queues.
  • Perplexity — search reliability engineer: labels missing citation JSON as a loud parse failure, but labels stale citation sources as silent reliability debt.
  • Klarna assistant — payments workflow owner: marks duplicated refund actions as cascading failures because one mistaken agent step can trigger several downstream financial events.
  • Morgan Stanley internal knowledge assistant — risk engineer: records stale-document answers as silent persistent failures because the retrieval index stays wrong until rebuilt.

Pause and recall

  • Why is a 200 OK not enough to classify an AI request as healthy?
  • What makes a failure transient rather than persistent?
  • Why are silent failures often more dangerous than loud ones?
  • How can one early misclassification create a cascading pipeline failure?

Interview Q&A

Q: Why classify failures by transient versus persistent instead of only by status code? A: Status codes show symptoms. Recovery policy depends on whether time can fix the issue or whether the system itself must change. Common wrong answer to avoid: "Because persistent failures always return 500." Many persistent failures return 200 or 400 while still staying broken. Q: Why are silent failures often harder operationally than loud crashes? A: Loud crashes trigger obvious alarms. Silent failures pass basic health checks and damage trust, money, or safety before humans notice. Common wrong answer to avoid: "Because silent failures are rare." They are often common in generative systems. Q: Why should cascading failure be treated as a system property and not merely a bad component? A: Later steps can amplify earlier mistakes, so local component uptime does not represent end-to-end reliability. Common wrong answer to avoid: "Because cascading failure only happens in multi-region outages." It also happens inside one agent workflow. Q: Why is retrying a persistent schema mismatch worse than doing nothing? A: Retries consume budget, add load, and delay the real fix while the contract remains incompatible. Common wrong answer to avoid: "Because retries are cheap compared with debugging." Cheap retries can still worsen incidents materially.


Apply now (5 min)

Exercise. List five failure events from one AI product you know. For each, label it as transient or persistent, loud or silent, and local or cascading. Then write the first response action.

Sketch from memory. Draw the four-branch triage desk picture. Add one note for where the retry dose helps, where the sealed ward helps, and where the senior doctor is required.


Bridge. Classification tells us what kind of trouble arrived. Next we need the vitals monitor that detects the trouble quickly enough to matter. → 02-error-detection.md