Skip to content

10. Drills and game days

~9 min read. Apparatus rusts between incidents. The drill plane is the gym that keeps the apparatus fit — scheduled exercises, named scenarios, observed outcomes, and a readiness score that feeds back into apparatus design.

Continues from 09-postmortem-capture.md. This chapter develops the drill plane. Recurring concepts in bold: drill calendar, scenario library, readiness score, chaos injection, dry-run vs. live drill, drill postmortem.

The previous chapter ensured incidents teach the apparatus. This chapter ensures the apparatus stays exercised between incidents.


What an AI drill is

An AI drill is a planned exercise that runs a scenario through the apparatus — page reaches on-call, runbook is followed, escalation occurs as needed — with observed outcomes recorded and a readiness score computed.

Three drill shapes:

  • Tabletop. A scenario is walked through verbally; the on-call narrates what they would do; the facilitator probes for gaps. Lowest cost; lowest realism.
  • Dry-run. The scenario is executed against a staging or canary environment; mitigations are real but the user impact is simulated. Medium cost; high realism for the apparatus's behaviour.
  • Live drill (game day). The scenario is injected into the production apparatus (with appropriate safety) — synthetic alerts, simulated provider outages, controlled cost spikes. Highest cost; highest realism.

A mature drill calendar uses all three: tabletops monthly per surface, dry-runs quarterly, live drills semi-annually.


The scenario library

A drill is only as good as its scenarios. The library should cover:

  • One scenario per failure family (quality, drift, cost, safety, long-tail).
  • One scenario per runbook in the catalogue.
  • One scenario for each new postmortem (drilled within a month of the incident closure).
  • One adversarial scenario per quarter (an unusual or unexpected shape, to test improvisation).

The library is versioned; each scenario has a description, expected apparatus behaviour, observation checklist, and a debrief template.

A useful scenario template:

- Title: e.g., "Provider refusal posture tightens overnight"
- Trigger: how the drill starts (synthetic alert, injected behaviour, narrated event)
- Expected apparatus behaviour: which alert fires, which on-call is paged,
  which runbook is opened, which escalations occur
- Observation checklist: signals to record at each step
- Time budget: minutes from trigger to expected resolution
- Debrief questions: what worked, what failed, what is the apparatus update

The readiness score

The drill plane's output is a readiness score per apparatus instance. A useful score has 4-6 dimensions:

Dimension What it measures
Alert recall Did the expected alert fire?
Alert precision Did the alert payload have what the on-call needed?
Rotation reach Was the page acknowledged within SLO?
Runbook fitness Did the runbook lead to correct action?
Escalation latency Did each escalation hop respond within SLO?
Apparatus update What change did the drill motivate?

The score is computed per drill; the per-apparatus rolling score is the apparatus's health metric.


A worked example — the provider refusal drift drill

The Hyderabad fintech runs a quarterly drill on the provider drift runbook. The scenario:

  • Trigger. At 14:00 Tuesday, the drill facilitator injects a synthetic drift signal at the gateway: refusal rate for the primary model rises from 1.2% to 8.4% over 20 minutes.
  • Expected behaviour. The provider-drift watch alert should fire by 14:25; the on-call should acknowledge by 14:30; the runbook should be opened; the on-call should pin the prior model version by 14:40; the verification should confirm refusal rate returns to baseline by 14:45.
  • What happened. Alert fired at 14:22 (good). On-call acknowledged at 14:24 (good). Runbook opened; on-call followed steps 1-3 cleanly. At step 4 (decide mitigation), the on-call paused — the runbook listed three options but the on-call was unsure which to pick. Escalated to the lead at 14:38 (later than ideal). Lead clarified the priority order; mitigation executed at 14:43. Verification at 14:48.

Readiness score. - Alert recall: 1.0 (fired) - Alert precision: 0.9 (payload was complete; one minor field missing) - Rotation reach: 1.0 (within SLO) - Runbook fitness: 0.7 (step 4 was ambiguous; the on-call needed escalation) - Escalation latency: 0.9 (lead responded within SLO; the escalation itself was unnecessary if runbook had been clearer) - Apparatus update: runbook step 4 to be rewritten with explicit priority order

Drill postmortem follow-ups. Runbook step 4 update; alert payload's missing field added; scenario added to the next quarter's drill set.

The drill cost roughly 4 person-hours; the apparatus update prevents a real incident from running 5-10 minutes longer than necessary at step 4. The return is favourable.


Chaos injection

For live drills, chaos injection is the technical mechanism for producing the trigger:

  • Synthetic alerts. The alert engine has a "test mode" that fires a real alert without a real signal; payload includes a drill: true flag the on-call sees.
  • Injected behaviour. The gateway has hooks to inject drift signals, latency spikes, error responses, or refusal rate climbs for a controlled duration on a controlled traffic share.
  • Cost simulation. The cost telemetry pipeline accepts simulated spikes for a controlled tenant or feature.
  • Safety classifier triggers. Synthetic classifier hits on test inputs.

The injection mechanisms need to be safe — no real user impact, no real provider calls if the simulated behaviour is destructive, clear drill: true markers so the on-call can confirm it is a drill but still respond as if real.


Tabletop drills

For lower-frequency or sensitive scenarios, tabletop drills are useful:

  • Facilitator narrates the scenario verbally.
  • On-call narrates their response in real-time.
  • Facilitator probes: "what dashboard do you open?", "what is the rollback command?", "who do you escalate to next?"
  • Gaps surface immediately; no real injection is needed.

Tabletops scale: every on-call can run them weekly with a 30-minute investment; over a quarter, every scenario in the library is touched.


Drill participation as a metric

The drill calendar's value depends on participation. Useful metrics:

  • Per-engineer participation. Each engineer in a rotation should participate in at least 4 drills per year.
  • Per-scenario coverage. Each scenario in the library should be drilled at least annually.
  • Drill calendar adherence. Drills scheduled vs. drills run; cancellations are tracked.

A team with high drill calendar adherence and high per-engineer participation has a healthy drill plane. Low adherence means drills are being deprioritised; the apparatus will degrade.


Operational signals

Healthy. Drill calendar runs on cadence. Every scenario in the library is drilled at least annually. Each engineer participates in 4+ drills per year. Drill postmortems produce apparatus updates.

First degrading metric. Drill cancellation rate climbing. The calendar exists but exercises are being skipped; the apparatus is rusting silently.

Misleading metric. Drill count. A team can run many low-quality drills that produce no apparatus updates. The metric to watch is apparatus updates per drill.

Expert graph. Readiness score over time per apparatus instance, drill participation per engineer, drill scenario coverage per quarter. The combination shows whether the drill plane is investing where the apparatus needs it.


Boundary of applicability

Strong fit. Teams with non-trivial apparatus and ongoing AI surface area. Drills justify their cost through prevented real incidents.

Pathology. A team treating drills as training only. Drills validate apparatus design; treating them as people-training misses half their value. The output of every drill should be at least one apparatus update or a confirmed-still-correct apparatus surface.

Scale limit. Very large platforms run many drills; the meta-problem is drill quality across the portfolio. The pattern is a platform-team drill review quarterly, with intervention on rotations whose readiness scores are stalled.


Failure-prone assumption

The seductive wrong belief: drills are nice-to-have when there is time. They are not nice-to-have; they are how the apparatus stays alive. Without drills, the apparatus's documented design and its real behaviour diverge silently; the first real incident is the first validation.

The correct belief: drills are operational engineering with their own budget and own SLOs. Treating them as overhead is treating the apparatus's maintenance cost as optional.


Where this appears in production

  • A fintech runs monthly tabletops per surface, quarterly dry-runs, semi-annual live game days; readiness scores trend up.
  • A telecom AI schedules drills as recurring calendar events; cancellations are rare.
  • A consumer chatbot ran drills for one quarter then stopped; the next real incident exposed multiple stale runbooks.
  • A healthtech AI has chaos injection at the gateway tested; live game days are safe.
  • A coding assistant has tabletops monthly; every new runbook is tabletop-tested within a month of authoring.
  • A retail AI has a scenario library with 18 scenarios; rotates through them across the year.
  • A logistics AI does live game days quarterly; readiness score is reviewed in leadership.
  • A government AI treats drills as regulatory artefacts; they happen but produce few apparatus updates.
  • A B2B SaaS runs drills with engineers from adjacent teams as observers; cross-team learning improves.
  • A travel platform has the drill calendar on a public team dashboard; drill investment is visible.
  • A payments AI has readiness score as a quarterly engineering metric.
  • A legal AI has the postmortem drill loop — every postmortem produces a drill scenario.
  • A staffing AI has a "drill of the month" with the apparatus's hardest scenario.
  • A search-rerank service has chaos injection at the model gateway; drills are realistic.
  • A document AI uses synthetic alerts for tabletop variations; on-calls practice payload reading.
  • A media AI has drill postmortems with the same template as real postmortems; consistency is preserved.
  • An ad-tech AI runs drills as part of new engineer onboarding into the rotation.
  • A real-estate AI has drill scenarios for cross-team coordination (upstream changes); the scenarios are joint.
  • A medical AI has drills that include the regulator notification step; the step is exercised regularly.
  • A small SaaS runs one drill a year; readiness is unknown until the next real incident.

Recall / checkpoint

  1. Name the three drill shapes and what each is best for.
  2. What is in the scenario library?
  3. What are the dimensions of the readiness score?
  4. What is chaos injection and what does it enable?
  5. How do tabletop drills scale across a team's rotation?
  6. What metric distinguishes a healthy drill plane from a busy-but-stalled one?
  7. Why is "drills are training" a partial framing?

Interview Q&A

Q1. A team runs drills but the readiness score is not improving. What is the apparatus failure? The drills are happening but not producing apparatus updates, or the updates are not landing. The diagnosis is to check the drill postmortems: are follow-ups being captured? Are they being closed? If drills produce updates but the updates do not close, the failure is enforcement (same as postmortem follow-ups). If drills do not produce updates, the drills are running on autopilot without observation — the facilitator is not probing, the on-call is performing, the gaps go unrecorded. Common wrong answer to avoid: "run more drills" — quantity does not fix observation quality.

Q2. Walk through running a tabletop drill. Facilitator picks a scenario from the library; gathers the on-call engineer(s); narrates the scenario (one or two sentences); asks the on-call what they would do. The on-call narrates: what alert fires, what dashboard they open, what runbook they follow, what command they execute. Facilitator probes for gaps — what if the rollback fails? what if the lead doesn't respond? — and notes them. Total time 30-60 minutes. Output: a debrief document with what worked, what didn't, and apparatus updates. Common wrong answer to avoid: "tabletops are too informal to be useful" — they are the cheapest exercise of the apparatus and find real gaps.

Q3. The team wants to skip live drills because they are "too expensive." How do you respond? Live drills are the only exercises that validate apparatus behaviour under real conditions — real paging system, real dashboards, real escalation channels. Their cost is bounded (a few engineer-hours, plus the engineering work to enable safe chaos injection); their value is in the apparatus gaps they find that tabletops and dry-runs miss. Semi-annual live drills are typically sufficient. Skipping them entirely means the apparatus's behaviour at real load is untested. Common wrong answer to avoid: "tabletops are good enough" — tabletops miss tooling gaps, integration gaps, and timing issues that live drills surface.

Q4. How do you choose what to put in the scenario library? One scenario per failure family; one per runbook; one per recent postmortem; one adversarial per quarter. The library is a living artefact — it grows with each postmortem, prunes scenarios that have been validated and have low recurrence risk. The aim is coverage of the apparatus's documented surface, not exhaustive enumeration. Common wrong answer to avoid: "everything we can imagine" — the library bloats; engagement degrades.

Q5. What is the difference between a drill postmortem and an incident postmortem? Structure is the same; content differs. Drill postmortems focus on apparatus behaviour (did the alert fire? was the runbook clear? did the escalation reach the right person?) rather than incident impact (which the drill simulated). Follow-ups are apparatus updates exclusively. The consistency in structure is intentional: the team treats drill outcomes with the same rigour as real outcomes. Common wrong answer to avoid: "drills don't need postmortems" — the postmortem is what makes the drill produce apparatus updates.

Q6. The team's drill cancellation rate is climbing. What is the structural fix? Treat drills as engineering work with budgets and SLOs, not as ceremonies that happen "when there's time." Drill calendar events are not negotiable per quarter; cancellations require lead approval. Engineering capacity is allocated explicitly for drill work. If the structural pressure is real (a major incident, a launch crunch), the drill is rescheduled, not skipped. The discipline is to make drill cost visible and to ensure it is treated as load-bearing. Common wrong answer to avoid: "lower the drill quality so they're cheaper" — degrades the apparatus's validation; better to do fewer high-quality drills than many low-quality ones.


Design / debug exercise (10 minutes)

Modelled example. Walk through the worked example (the provider refusal drift drill). Identify the readiness score components and the apparatus updates the drill produced.

Your turn. Pick one runbook from your team's catalogue. Design a tabletop scenario for it: trigger, expected behaviour, observation checklist, debrief questions. Run the tabletop with a colleague and capture the readiness score.

Reproduce from memory. Write the readiness score dimensions and what each measures from memory. The signal of internalisation is that the dimensions land in under two minutes; the test is that you can score a hypothetical drill quickly.


Operational memory

This chapter explained the drill plane: scheduled exercises with named scenarios, observed outcomes, and a readiness score that feeds back into apparatus design. The important idea is that drills are operational engineering with their own budget and SLOs; treating them as overhead is treating apparatus maintenance as optional.

You learned to run tabletop, dry-run, and live drills; to score readiness across multiple dimensions; and to integrate the drill plane with the postmortem plane through scenario libraries and drill postmortems. That solves the opening failure because the apparatus's design and its real behaviour stay aligned through exercise.

Carry this diagnostic forward: when a team has runbooks but has not drilled them, the runbooks are speculative until proven by exercise. The first real incident is the first validation; that is too late.

Remember:

  • Three shapes: tabletop, dry-run, live game day.
  • Scenario library covers families, runbooks, postmortems, and adversarial cases.
  • Readiness score has 4-6 dimensions; rolling score is apparatus health.
  • Drill postmortems use the same template as incident postmortems.
  • Drills are engineering work; cancellations are tracked and remediated.

Bridge. The apparatus exists, learns, and stays exercised. The humans who run it are the next concern — on-call load, fairness, and burnout. The next chapter is the rotation plane's other half: the people who sustain the apparatus. → 11-oncall-health-and-burnout.md