Skip to content

10. Incident drills and readiness — practice the firebreak before the fire

~11 min read. A runbook you have never practiced is a theory. A kill switch nobody has pulled is a rumor.

Continues from 09-postmortem-evals-and-locks.md. The after-action lock changes the system after one fire. Drills make sure the runbook wall works before the next one.

The previous chapter converted incidents into locks so the same class is harder to repeat. That solves recurrence after a real failure, but it does not prove the team can execute under pressure. This chapter moves from learning after incidents to practicing before them.


1) The wall — incident readiness is not a document

Many teams have an incident response doc. Fewer teams know whether the prompt rollback actually works, whether the model route can be changed safely, whether support knows the customer language, or whether the on-call can find retrieval candidates at 3 AM.

Readiness is demonstrated behavior:

can page -> can snapshot -> can contain -> can communicate -> can restore

If the team has not practiced those moves, the first real customer incident becomes the test.


2) The AI incident drill menu

Run drills that match real AI failure shapes.

Drill Scenario Skill tested
Prompt rollback new prompt gives unsafe advice prompt versioning and restore
Retrieval stale-policy old policy outranks current index trace and freshness gate
Tool runaway agent loops on failed action tool kill switch and budget cap
Privacy leak cross-tenant candidate appears access-control snapshot and disable
Judge drift eval judge approves bad answer calibration and human review
Cost spike token use doubles after rollout budget alert and route change
Model route regression fallback model gives weaker answers model routing rollback
Soft failure plausible wrong answer in critical slice human review and incident declaration

Each drill should produce a measured result: time to declare, time to snapshot, time to firebreak, time to status update, and quality of postmortem lock.


3) Worked example — refund drill

A quarterly refund drill could simulate this:

setup:
  stale enterprise refund policy chunk is inserted into staging index
  reranker is disabled for test traffic
  model fallback route is forced

expected:
  eval slice detects stale-policy answer
  on-call declares sev-2 drill
  snapshot captures retrieval candidates and feature flags
  firebreak disables refund recommendation mode
  status board update written in 15 minutes
  after-action lock proposed

The point is not to trick the team. The point is to prove the runbook wall is usable when the pattern is realistic.


4) Why not drill only infrastructure outages

The tempting alternative is to reuse ordinary SRE drills: service down, database failover, queue backlog.

Those are still useful. They do not test the hardest AI incident muscles: semantic failure detection, prompt rollback, retrieval trace reconstruction, model route control, guardrail interpretation, human review, and soft customer communication.

AI incident drills should include green-dashboard failures. If every drill starts with a red CPU graph, the team will underreact to plausible wrong answers.


5) Production signals — readiness score

The first metric is drill pass rate by capability, not by "did the meeting happen."

Track:

  • time to fire captain
  • time to snapshot package
  • time to firebreak
  • time to first status update
  • percentage of high-risk flows with tested rollback
  • percentage of serious incidents with completed locks

The misleading metric is number of runbooks. Ten runbooks nobody has used are weaker than one drilled runbook with real levers.


6) Boundary — drills should not become theater

A drill is valuable when it exercises real tools, real dashboards, realistic traces, and real decision rights.

It becomes theater when everyone knows the answer, no risky lever is touched, no timing is measured, and no readiness gap is tracked.

The pathology is compliance rehearsal. The team performs the ritual and learns nothing about whether the system can be contained.


Recall checkpoint

  • Why is a runbook not enough?
  • Which AI-specific drill tests soft failure response?
  • What readiness metrics matter?
  • How do drills become theater?

Interview Q&A

Q: How do you know an AI team is incident-ready? A: They have tested paging, snapshotting, firebreaks, status updates, restore criteria, and postmortem locks for realistic AI failure modes.

Common wrong answer to avoid: "They have a runbook." A runbook that has never been executed is unproven.

Q: What drills are unique to AI systems? A: Prompt rollback, stale retrieval, tool runaway, cross-tenant candidate leak, judge drift, cost spike, model route regression, and plausible wrong-answer incidents.

Common wrong answer to avoid: "Use the same drills as backend services." AI adds semantic and evidence-chain failures.

Q: What should every drill measure? A: Time to owner, snapshot, containment, first update, restore criteria, and after-action lock quality.

Common wrong answer to avoid: "Whether the team eventually solved it." Readiness is about controlled response, not heroic recovery.


Apply now (10 min)

Model the exercise. Design the refund stale-policy drill with setup, expected signals, firebreak, and success metrics.

Your turn. Write one drill for a tool runaway or privacy leak in your own AI system.

Reproduce from memory. Explain why an untested kill switch is only a rumor.


What you should remember

This chapter explained incident drills and readiness. The important idea is that incident response has to be practiced with AI-specific failure modes, not only documented.

Carry this diagnostic forward: every high-risk AI feature should have a tested firebreak, a reconstructable snapshot, and a drill that proves both.

Remember:

  • Readiness is demonstrated behavior.
  • Drills should test semantic failures, not only outages.
  • Measure time to owner, snapshot, firebreak, and update.
  • A kill switch is real only after someone has pulled it safely.

Bridge. Drills make the team stronger, but even mature incident response has limits. The final chapter names what AI incident response still cannot promise. → 11-honest-admission.md