11. Chaos testing for AI — practice the bad day before the bad day arrives¶
~14 min read. Reliability claims are weak until the team has deliberately broken the system and watched it recover.
Built on the ELI5 in 00-eli5.md. The vitals monitor, sealed ward, and stability kit should all be tested before a real emergency, because chaos drills tell us whether the hospital behaves well under stress.
1) First picture: inject failure on purpose, but with guardrails¶
Chaos testing is not random destruction. It is controlled fault injection. You break one thing intentionally, then observe whether detection, containment,
fallback, and recovery behave correctly.
The simple version: If the answer to any box is no, your reliability story is incomplete. The goal is not chaos for entertainment. The goal is confidence in the playbook.
2) AI systems need AI-specific chaos cases¶
Traditional chaos tests often focus on hosts, networks, and service crashes. Those still matter. But AI systems have extra failure shapes.
- malformed JSON from the model,
- false refusal bursts,
- retrieval returning empty results,
- stale context injection,
- tool returning semantically wrong but valid data,
- confidence estimator drift.
If you test only 500 errors, you miss silent AI-specific failure modes. The triage desk must be tested on semantic trouble too.
3) Start with containment questions, not heroics¶
Now what should a chaos drill prove? At minimum:
- did detectors fire,
- did breakers open when needed,
- did fallback activate,
- did degraded messaging stay honest,
- did side effects remain safe,
- did humans get paged only when appropriate.
The simple version: Notice what is missing. We are not asking, "Did the system avoid all visible impact?" Sometimes small controlled impact is acceptable. We ask whether the design behaved as intended.
4) Worked example: inject malformed tool arguments¶
Suppose an order-assistant agent depends on structured tool calls. Chaos experiment: For 15 minutes, 10% of model responses contain invalid tool arguments. What should happen?
invalid tool arguments injected
│
├── schema detector catches them
├── one retry with repaired prompt allowed
├── breaker may open if rate stays high
├── fallback to read-only FAQ mode
└── no write action executes
See the flow. Now write expected metrics.
- parse failure rate rises,
- write-tool executions do not rise,
- degraded-mode share rises modestly,
- complaint rate stays bounded,
- no duplicate order modifications occur. If write actions still happen, your containment failed. If the UI claims normal operation, your stability kit messaging failed. If humans get paged instantly for low-risk cases, your senior doctor thresholds are noisy.
That is why drills matter.
5) Pre-mortems turn vague fear into test cases¶
Now what is a pre-mortem? Before the incident, the team imagines, "Six months later this feature caused a serious failure. What likely happened?" Then convert answers into drills.
pre-mortem prompts
- what silent failure would embarrass us most?
- what dependency is single-point critical?
- what fallback have we never exercised?
- what action would be hardest to reverse?
The simple version: A good pre-mortem generates concrete chaos cases. For example, a team says, "Our biggest fear is a support bot issuing wrong refunds during provider instability." That becomes three tests.
- Model returns malformed refund arguments.
- Refund tool acknowledgment is delayed.
- Breaker opens on the main model during peak traffic. Now you are not merely anxious. You are rehearsing.
6) Run chaos progressively, not recklessly¶
Do not start with production-wide failure injection. Begin small.
See the discipline. Also define stop conditions.
- complaint spike above threshold,
- degraded mode above threshold,
- unexpected write action,
- on-call load exceeds safe level. The sealed ward applies to chaos too. You must contain the experiment itself.
7) Measure recovery quality, not only failure handling¶
Many teams stop measuring after fallback triggers. But recovery matters. Did the breaker close properly? Did queues drain? Did degraded mode exit cleanly? Did humans get duplicate tickets after recovery?
recovery scorecard
┌──────────────────────────────┐
│ detection delay = 8 s │
│ breaker open time = 40 s │
│ degraded exit clean = yes │
│ duplicate actions = 0 │
│ human queue backlog = normal │
└──────────────────────────────┘
A messy recovery is still a reliability defect. The vitals monitor should stay active after the fault injection ends.
8) Chaos testing should include humans too¶
Now a senior point. The sociotechnical system matters. Not only code. Can on-call find the runbook? Can support explain degraded mode correctly? Can reviewers clear escalations fast enough?
For example, run a drill where the main model is forced open by breaker. Watch:
- whether on-call recognizes the pattern,
- whether the incident channel opens quickly,
- whether customer-facing copy matches reality,
- whether human review queues absorb the spillover. That is real preparedness. Not dashboard theatre.
Where this lives in the wild¶
- GitHub Copilot — reliability engineering team: can simulate model endpoint
503bursts and verify that per-model breakers, fallback completions, and telemetry all behave as designed. - Intercom Fin — support automation owner: injects malformed citation or tool-call outputs in pre-production to confirm that read-only degraded responses appear instead of unsafe account actions.
- Perplexity — answer quality engineer: runs drills where retrieval returns empty or stale sources to validate that synthesis does not pretend grounded certainty.
- Klarna assistant — payments risk operator: tests delayed acknowledgments on payment-adjacent tools to ensure duplicate actions do not occur under retry pressure.
- Cursor — agent platform lead: exercises repository-tool failures and verification outages so autonomous edit mode can degrade safely into suggestion-only mode.
Pause and recall¶
- Why is chaos testing more than random failure injection?
- What AI-specific chaos cases matter beyond normal 5xx testing?
- How does a pre-mortem help build useful chaos experiments?
- Why should recovery quality be measured after the injected fault ends?
Interview Q&A¶
Q: Why should chaos tests validate fallback and degraded messaging, not just detector firing? A: Detection alone does not protect users; the user-facing recovery path determines real reliability. Common wrong answer to avoid: "Because detectors are usually already correct." Detectors are necessary but not sufficient. Q: Why do AI systems need semantic chaos cases in addition to infrastructure chaos cases? A: Many harmful failures in AI systems are fluent, structured, and semantically wrong rather than crashing outright. Common wrong answer to avoid: "Because semantic faults are easier to generate." Ease is irrelevant; representativeness matters. Q: Why roll out chaos tests progressively instead of jumping directly to broad production injection? A: Progressive rollout limits blast radius while still testing realistic behavior under controlled exposure. Common wrong answer to avoid: "Because production chaos is never acceptable." It can be acceptable with safeguards. Q: Why include humans in AI chaos drills? A: Reliability depends on on-call, support, and review operations as much as on automated code paths. Common wrong answer to avoid: "Because humans are the fallback of last resort anyway." They are active components of the system design.
Apply now (5 min)¶
Exercise. Write one chaos test for your AI workflow. Name the injected fault, expected detector, expected containment, expected user-visible behavior, and stop condition.
Sketch from memory. Draw the chaos loop from planned fault to recovery scorecard. Label where the vitals monitor, sealed ward, stability kit, and senior doctor should each appear.
Bridge. Drills prepare us, but real incidents still happen. Next we learn how to run the actual emergency room during an AI outage. → 12-incident-response.md