10. Incident drills and readiness — practice the firebreak before the fire¶

~11 min read. A runbook you have never practiced is a theory. A kill switch nobody has pulled is a rumor.

Continues from 09-postmortem-evals-and-locks.md. The after-action lock changes the system after one fire. Drills make sure the runbook wall works before the next one.

The previous chapter converted incidents into locks so the same class is harder to repeat. That solves recurrence after a real failure, but it does not prove the team can execute under pressure. This chapter moves from learning after incidents to practicing before them.

1) The wall — incident readiness is not a document¶

Many teams have an incident response doc. Fewer teams know whether the prompt rollback actually works, whether the model route can be changed safely, whether support knows the customer language, or whether the on-call can find retrieval candidates at 3 AM.

Readiness is demonstrated behavior:

can page -> can snapshot -> can contain -> can communicate -> can restore

If the team has not practiced those moves, the first real customer incident becomes the test.

Run drills that match real AI failure shapes.

Drill	Scenario	Skill tested
Prompt rollback	new prompt gives unsafe advice	prompt versioning and restore
Retrieval stale-policy	old policy outranks current	index trace and freshness gate
Tool runaway	agent loops on failed action	tool kill switch and budget cap
Privacy leak	cross-tenant candidate appears	access-control snapshot and disable
Judge drift	eval judge approves bad answer	calibration and human review
Cost spike	token use doubles after rollout	budget alert and route change
Model route regression	fallback model gives weaker answers	model routing rollback
Soft failure	plausible wrong answer in critical slice	human review and incident declaration

Each drill should produce a measured result: time to declare, time to snapshot, time to firebreak, time to status update, and quality of postmortem lock.

3) Worked example — refund drill¶

A quarterly refund drill could simulate this:

setup:
  stale enterprise refund policy chunk is inserted into staging index
  reranker is disabled for test traffic
  model fallback route is forced

expected:
  eval slice detects stale-policy answer
  on-call declares sev-2 drill
  snapshot captures retrieval candidates and feature flags
  firebreak disables refund recommendation mode
  status board update written in 15 minutes
  after-action lock proposed

The point is not to trick the team. The point is to prove the runbook wall is usable when the pattern is realistic.

4) Why not drill only infrastructure outages¶

The tempting alternative is to reuse ordinary SRE drills: service down, database failover, queue backlog.

Those are still useful. They do not test the hardest AI incident muscles: semantic failure detection, prompt rollback, retrieval trace reconstruction, model route control, guardrail interpretation, human review, and soft customer communication.

AI incident drills should include green-dashboard failures. If every drill starts with a red CPU graph, the team will underreact to plausible wrong answers.

5) Production signals — readiness score¶

The first metric is drill pass rate by capability, not by "did the meeting happen."

Track:

time to fire captain
time to snapshot package
time to firebreak
time to first status update
percentage of high-risk flows with tested rollback
percentage of serious incidents with completed locks

The misleading metric is number of runbooks. Ten runbooks nobody has used are weaker than one drilled runbook with real levers.

6) Boundary — drills should not become theater¶

A drill is valuable when it exercises real tools, real dashboards, realistic traces, and real decision rights.

It becomes theater when everyone knows the answer, no risky lever is touched, no timing is measured, and no readiness gap is tracked.

The pathology is compliance rehearsal. The team performs the ritual and learns nothing about whether the system can be contained.

Recall checkpoint¶

Why is a runbook not enough?
Which AI-specific drill tests soft failure response?
What readiness metrics matter?
How do drills become theater?

Interview Q&A¶

Q: How do you know an AI team is incident-ready? A: They have tested paging, snapshotting, firebreaks, status updates, restore criteria, and postmortem locks for realistic AI failure modes.

Common wrong answer to avoid: "They have a runbook." A runbook that has never been executed is unproven.

Q: What drills are unique to AI systems? A: Prompt rollback, stale retrieval, tool runaway, cross-tenant candidate leak, judge drift, cost spike, model route regression, and plausible wrong-answer incidents.

Common wrong answer to avoid: "Use the same drills as backend services." AI adds semantic and evidence-chain failures.

Q: What should every drill measure? A: Time to owner, snapshot, containment, first update, restore criteria, and after-action lock quality.

Common wrong answer to avoid: "Whether the team eventually solved it." Readiness is about controlled response, not heroic recovery.

Apply now (10 min)¶

Model the exercise. Design the refund stale-policy drill with setup, expected signals, firebreak, and success metrics.

Your turn. Write one drill for a tool runaway or privacy leak in your own AI system.

Reproduce from memory. Explain why an untested kill switch is only a rumor.

What you should remember¶

This chapter explained incident drills and readiness. The important idea is that incident response has to be practiced with AI-specific failure modes, not only documented.

Carry this diagnostic forward: every high-risk AI feature should have a tested firebreak, a reconstructable snapshot, and a drill that proves both.

Remember:

Readiness is demonstrated behavior.
Drills should test semantic failures, not only outages.
Measure time to owner, snapshot, firebreak, and update.
A kill switch is real only after someone has pulled it safely.

Bridge. Drills make the team stronger, but even mature incident response has limits. The final chapter names what AI incident response still cannot promise. → 11-honest-admission.md