10. Incident drills and readiness — practice the firebreak before the fire¶
~11 min read. A runbook you have never practiced is a theory. A kill switch nobody has pulled is a rumor.
Continues from 09-postmortem-evals-and-locks.md. The after-action lock changes the system after one fire. Drills make sure the runbook wall works before the next one.
The previous chapter converted incidents into locks so the same class is harder to repeat. That solves recurrence after a real failure, but it does not prove the team can execute under pressure. This chapter moves from learning after incidents to practicing before them.
1) The wall — incident readiness is not a document¶
Many teams have an incident response doc. Fewer teams know whether the prompt rollback actually works, whether the model route can be changed safely, whether support knows the customer language, or whether the on-call can find retrieval candidates at 3 AM.
Readiness is demonstrated behavior:
If the team has not practiced those moves, the first real customer incident becomes the test.
2) The AI incident drill menu¶
Run drills that match real AI failure shapes.
| Drill | Scenario | Skill tested |
|---|---|---|
| Prompt rollback | new prompt gives unsafe advice | prompt versioning and restore |
| Retrieval stale-policy | old policy outranks current | index trace and freshness gate |
| Tool runaway | agent loops on failed action | tool kill switch and budget cap |
| Privacy leak | cross-tenant candidate appears | access-control snapshot and disable |
| Judge drift | eval judge approves bad answer | calibration and human review |
| Cost spike | token use doubles after rollout | budget alert and route change |
| Model route regression | fallback model gives weaker answers | model routing rollback |
| Soft failure | plausible wrong answer in critical slice | human review and incident declaration |
Each drill should produce a measured result: time to declare, time to snapshot, time to firebreak, time to status update, and quality of postmortem lock.
3) Worked example — refund drill¶
A quarterly refund drill could simulate this:
setup:
stale enterprise refund policy chunk is inserted into staging index
reranker is disabled for test traffic
model fallback route is forced
expected:
eval slice detects stale-policy answer
on-call declares sev-2 drill
snapshot captures retrieval candidates and feature flags
firebreak disables refund recommendation mode
status board update written in 15 minutes
after-action lock proposed
The point is not to trick the team. The point is to prove the runbook wall is usable when the pattern is realistic.
4) Why not drill only infrastructure outages¶
The tempting alternative is to reuse ordinary SRE drills: service down, database failover, queue backlog.
Those are still useful. They do not test the hardest AI incident muscles: semantic failure detection, prompt rollback, retrieval trace reconstruction, model route control, guardrail interpretation, human review, and soft customer communication.
AI incident drills should include green-dashboard failures. If every drill starts with a red CPU graph, the team will underreact to plausible wrong answers.
5) Production signals — readiness score¶
The first metric is drill pass rate by capability, not by "did the meeting happen."
Track:
- time to fire captain
- time to snapshot package
- time to firebreak
- time to first status update
- percentage of high-risk flows with tested rollback
- percentage of serious incidents with completed locks
The misleading metric is number of runbooks. Ten runbooks nobody has used are weaker than one drilled runbook with real levers.
6) Boundary — drills should not become theater¶
A drill is valuable when it exercises real tools, real dashboards, realistic traces, and real decision rights.
It becomes theater when everyone knows the answer, no risky lever is touched, no timing is measured, and no readiness gap is tracked.
The pathology is compliance rehearsal. The team performs the ritual and learns nothing about whether the system can be contained.
Recall checkpoint¶
- Why is a runbook not enough?
- Which AI-specific drill tests soft failure response?
- What readiness metrics matter?
- How do drills become theater?
Interview Q&A¶
Q: How do you know an AI team is incident-ready? A: They have tested paging, snapshotting, firebreaks, status updates, restore criteria, and postmortem locks for realistic AI failure modes.
Common wrong answer to avoid: "They have a runbook." A runbook that has never been executed is unproven.
Q: What drills are unique to AI systems? A: Prompt rollback, stale retrieval, tool runaway, cross-tenant candidate leak, judge drift, cost spike, model route regression, and plausible wrong-answer incidents.
Common wrong answer to avoid: "Use the same drills as backend services." AI adds semantic and evidence-chain failures.
Q: What should every drill measure? A: Time to owner, snapshot, containment, first update, restore criteria, and after-action lock quality.
Common wrong answer to avoid: "Whether the team eventually solved it." Readiness is about controlled response, not heroic recovery.
Apply now (10 min)¶
Model the exercise. Design the refund stale-policy drill with setup, expected signals, firebreak, and success metrics.
Your turn. Write one drill for a tool runaway or privacy leak in your own AI system.
Reproduce from memory. Explain why an untested kill switch is only a rumor.
What you should remember¶
This chapter explained incident drills and readiness. The important idea is that incident response has to be practiced with AI-specific failure modes, not only documented.
Carry this diagnostic forward: every high-risk AI feature should have a tested firebreak, a reconstructable snapshot, and a drill that proves both.
Remember:
- Readiness is demonstrated behavior.
- Drills should test semantic failures, not only outages.
- Measure time to owner, snapshot, firebreak, and update.
- A kill switch is real only after someone has pulled it safely.
Bridge. Drills make the team stronger, but even mature incident response has limits. The final chapter names what AI incident response still cannot promise. → 11-honest-admission.md