Skip to content

11. Chaos testing for AI — practice the bad day before the bad day arrives

~14 min read. Reliability claims are weak until the team has deliberately broken the system and watched it recover.

Built on the ELI5 in 00-eli5.md. The vitals monitor, sealed ward, and stability kit should all be tested before a real emergency, because chaos drills tell us whether the hospital behaves well under stress.


1) First picture: inject failure on purpose, but with guardrails

Chaos testing is not random destruction. It is controlled fault injection. You break one thing intentionally, then observe whether detection, containment,

fallback, and recovery behave correctly.

planned fault
detect?
contain?
fallback or degrade?
recover and learn?

The simple version: If the answer to any box is no, your reliability story is incomplete. The goal is not chaos for entertainment. The goal is confidence in the playbook.

2) AI systems need AI-specific chaos cases

Traditional chaos tests often focus on hosts, networks, and service crashes. Those still matter. But AI systems have extra failure shapes.

  • malformed JSON from the model,
  • false refusal bursts,
  • retrieval returning empty results,
  • stale context injection,
  • tool returning semantically wrong but valid data,
  • confidence estimator drift.
    AI chaos menu
    ┌──────────────────────────────────┐
    │ add 3 s latency to model calls   │
    │ return invalid tool arguments    │
    │ drop 20% of citations            │
    │ force fallback path for 10 min   │
    │ replay duplicate tool events     │
    └──────────────────────────────────┘
    

If you test only 500 errors, you miss silent AI-specific failure modes. The triage desk must be tested on semantic trouble too.

3) Start with containment questions, not heroics

Now what should a chaos drill prove? At minimum:

  • did detectors fire,
  • did breakers open when needed,
  • did fallback activate,
  • did degraded messaging stay honest,
  • did side effects remain safe,
  • did humans get paged only when appropriate.
    chaos checklist
    ┌──────────────────────────────┐
    │ detector fired?              │
    │ blast radius contained?      │
    │ user got safe response?      │
    │ runbook link used?           │
    │ recovery state validated?    │
    └──────────────────────────────┘
    

The simple version: Notice what is missing. We are not asking, "Did the system avoid all visible impact?" Sometimes small controlled impact is acceptable. We ask whether the design behaved as intended.

4) Worked example: inject malformed tool arguments

Suppose an order-assistant agent depends on structured tool calls. Chaos experiment: For 15 minutes, 10% of model responses contain invalid tool arguments. What should happen?

invalid tool arguments injected
        ├── schema detector catches them
        ├── one retry with repaired prompt allowed
        ├── breaker may open if rate stays high
        ├── fallback to read-only FAQ mode
        └── no write action executes

See the flow. Now write expected metrics.

  • parse failure rate rises,
  • write-tool executions do not rise,
  • degraded-mode share rises modestly,
  • complaint rate stays bounded,
  • no duplicate order modifications occur. If write actions still happen, your containment failed. If the UI claims normal operation, your stability kit messaging failed. If humans get paged instantly for low-risk cases, your senior doctor thresholds are noisy.

That is why drills matter.

5) Pre-mortems turn vague fear into test cases

Now what is a pre-mortem? Before the incident, the team imagines, "Six months later this feature caused a serious failure. What likely happened?" Then convert answers into drills.

pre-mortem prompts
- what silent failure would embarrass us most?
- what dependency is single-point critical?
- what fallback have we never exercised?
- what action would be hardest to reverse?

The simple version: A good pre-mortem generates concrete chaos cases. For example, a team says, "Our biggest fear is a support bot issuing wrong refunds during provider instability." That becomes three tests.

  1. Model returns malformed refund arguments.
  2. Refund tool acknowledgment is delayed.
  3. Breaker opens on the main model during peak traffic. Now you are not merely anxious. You are rehearsing.

6) Run chaos progressively, not recklessly

Do not start with production-wide failure injection. Begin small.

chaos rollout ladder
sandbox → staging → shadow traffic → tiny prod slice → wider prod slice

See the discipline. Also define stop conditions.

  • complaint spike above threshold,
  • degraded mode above threshold,
  • unexpected write action,
  • on-call load exceeds safe level. The sealed ward applies to chaos too. You must contain the experiment itself.

7) Measure recovery quality, not only failure handling

Many teams stop measuring after fallback triggers. But recovery matters. Did the breaker close properly? Did queues drain? Did degraded mode exit cleanly? Did humans get duplicate tickets after recovery?

recovery scorecard
┌──────────────────────────────┐
│ detection delay = 8 s        │
│ breaker open time = 40 s     │
│ degraded exit clean = yes    │
│ duplicate actions = 0        │
│ human queue backlog = normal │
└──────────────────────────────┘

A messy recovery is still a reliability defect. The vitals monitor should stay active after the fault injection ends.

8) Chaos testing should include humans too

Now a senior point. The sociotechnical system matters. Not only code. Can on-call find the runbook? Can support explain degraded mode correctly? Can reviewers clear escalations fast enough?

For example, run a drill where the main model is forced open by breaker. Watch:

  • whether on-call recognizes the pattern,
  • whether the incident channel opens quickly,
  • whether customer-facing copy matches reality,
  • whether human review queues absorb the spillover. That is real preparedness. Not dashboard theatre.

Where this lives in the wild

  • GitHub Copilot — reliability engineering team: can simulate model endpoint 503 bursts and verify that per-model breakers, fallback completions, and telemetry all behave as designed.
  • Intercom Fin — support automation owner: injects malformed citation or tool-call outputs in pre-production to confirm that read-only degraded responses appear instead of unsafe account actions.
  • Perplexity — answer quality engineer: runs drills where retrieval returns empty or stale sources to validate that synthesis does not pretend grounded certainty.
  • Klarna assistant — payments risk operator: tests delayed acknowledgments on payment-adjacent tools to ensure duplicate actions do not occur under retry pressure.
  • Cursor — agent platform lead: exercises repository-tool failures and verification outages so autonomous edit mode can degrade safely into suggestion-only mode.

Pause and recall

  • Why is chaos testing more than random failure injection?
  • What AI-specific chaos cases matter beyond normal 5xx testing?
  • How does a pre-mortem help build useful chaos experiments?
  • Why should recovery quality be measured after the injected fault ends?

Interview Q&A

Q: Why should chaos tests validate fallback and degraded messaging, not just detector firing? A: Detection alone does not protect users; the user-facing recovery path determines real reliability. Common wrong answer to avoid: "Because detectors are usually already correct." Detectors are necessary but not sufficient. Q: Why do AI systems need semantic chaos cases in addition to infrastructure chaos cases? A: Many harmful failures in AI systems are fluent, structured, and semantically wrong rather than crashing outright. Common wrong answer to avoid: "Because semantic faults are easier to generate." Ease is irrelevant; representativeness matters. Q: Why roll out chaos tests progressively instead of jumping directly to broad production injection? A: Progressive rollout limits blast radius while still testing realistic behavior under controlled exposure. Common wrong answer to avoid: "Because production chaos is never acceptable." It can be acceptable with safeguards. Q: Why include humans in AI chaos drills? A: Reliability depends on on-call, support, and review operations as much as on automated code paths. Common wrong answer to avoid: "Because humans are the fallback of last resort anyway." They are active components of the system design.


Apply now (5 min)

Exercise. Write one chaos test for your AI workflow. Name the injected fault, expected detector, expected containment, expected user-visible behavior, and stop condition.

Sketch from memory. Draw the chaos loop from planned fault to recovery scorecard. Label where the vitals monitor, sealed ward, stability kit, and senior doctor should each appear.


Bridge. Drills prepare us, but real incidents still happen. Next we learn how to run the actual emergency room during an AI outage. → 12-incident-response.md