00. Reliability & Failure Management for AI Systems — The Five-Year-Old Version¶

You already know how to build AI features. Now learn how to keep them safe when things go wrong.

Imagine a busy hospital emergency department. Patients keep arriving. Some are truly critical. Some only look critical. Some get better after one more dose. Some need to be moved away from a contaminated ward. Some need another hospital. Some need the senior doctor immediately. And sometimes the best job is simple: keep the patient stable first.

An AI system has the same operating shape. A user request arrives like a patient. The system checks what kind of trouble it is; that is the triage desk. It watches health signals; that is the vitals monitor. If one model call fails for a temporary reason, it may try again carefully; that is the retry dose. If one model or tool starts harming many requests, traffic stops going there; that is the sealed ward.

If the main path is unavailable, the system sends the work somewhere else; that is the backup ambulance. If the machine is uncertain, or the stakes are high, a human takes over; that is the senior doctor. And if nothing ideal is available, the product should still avoid panic. It gives the user the safest partial help possible; that is the stability kit.

The whole point is that reliability is not only uptime. A model can reply with broken JSON. A tool can run twice and charge twice. A router can keep sending traffic to a dying provider. An agent can fail on step three, then poison steps four and five. The app may look alive, but the patient is not okay.

Reliable AI systems need detection, classification, retries, cutoffs, fallbacks, timeouts, deduplication, human escalation, chaos drills, and incident response. This module teaches those pieces as one pressure chain: first name the failure, then detect it, decide whether a retry is safe, isolate bad paths, fall back without lying, degrade honestly, respect time budgets, prevent duplicate side effects, escalate risky cases, stop cascades, drill bad days, and run incidents without group panic.

The triage desk is classification. The vitals monitor is health detection. The retry dose is a careful repeat attempt.

The sealed ward is a circuit breaker. The backup ambulance is a fallback path. The senior doctor is human-in-the-loop review.

The stability kit is graceful degradation. If any later file feels abstract, come back here.

Picture the emergency room again.

The placeholders you will see called back¶

Placeholder	Meaning
triage desk	Fast classification of what kind of failure just arrived.
vitals monitor	Health signals like timeouts, parse checks, refusal rates, and quality alarms.
retry dose	A careful repeat attempt, not blind hammering.
sealed ward	A circuit breaker that isolates a bad model or tool path.
backup ambulance	A fallback model, agent, cache, or alternate workflow.
senior doctor	Human escalation for high-risk or low-confidence cases.
stability kit	Reduced but safe service when full quality is unavailable.
---

What's coming¶

01-failure-taxonomy.md — sort AI failures into transient, persistent, silent, and cascading classes.
02-error-detection.md — detect timeouts, malformed outputs, refusals, and hidden bad states.
03-retry-backoff.md — apply exponential backoff, jitter, and retry budgets without causing storms.
04-circuit-breakers.md — open, close, and half-open bad paths before they infect the whole system.
05-fallback-strategies.md — switch to smaller models, alternate agents, caches, or rule-based backups.
06-graceful-degradation.md — keep the user stable with partial answers and honest limits.
07-timeout-management.md — divide time budgets across steps and streaming stages.
08-idempotency-dedup.md — make retries safe and prevent duplicate tool execution.
09-human-escalation.md — decide when to route uncertain or high-stakes cases to people.
10-cascading-failure.md — stop one weak step from collapsing the full pipeline.
11-chaos-testing-ai.md — inject failures on purpose and test the bad-day playbook.
12-incident-response.md — runbooks, rollback, comms, and post-mortems for AI incidents.
13-honest-admission.md — face the gaps we still cannot measure or prevent cleanly.

Bridge. Before fixing failures, classify them properly. A good emergency room starts at the triage desk. → 01-failure-taxonomy.md