05. Runbook authoring¶
~10 min read. A trained on-call engineer with no runbook is debugging. With a runbook, they are executing. The runbook plane is where the apparatus encodes what a competent first response looks like — as a versioned, executable, freshness-gated document scoped to a single failure shape.
Continues from
04-rotation-and-ownership.md. This chapter develops the runbook plane. Recurring concepts in bold: runbook card, executable step, freshness gate, versioned with the system, one-failure-one-card, kill path, provenance trail.
The previous chapter put trained engineers on rotations. This chapter is what those engineers have in their hands when the page arrives.
What an AI runbook is¶
An AI runbook is a versioned, scoped document that walks an on-call engineer through the first ten minutes of a specific failure shape — with executable steps, named kill paths, and the context required to escalate confidently.
Three properties matter:
- Scoped to one failure. A 30-page document covering everything is not a runbook; it is a wiki. The runbook is a card scoped to a specific paging condition or failure shape.
- Executable. Each step is something the engineer can run, click, or paste — not "investigate the issue" or "look at the logs." Steps reference specific commands, dashboard links, and rollback hooks.
- Versioned with the system. The runbook lives in a versioned repo, evolves with the system it documents, and has a "last validated" date that gates its use.
A common pathology: runbooks living in a wiki, written once, never updated, referenced by a stale URL. The team has nominal runbooks; the apparatus has none.
The runbook card shape¶
A runbook card has five sections:
1. Identification
- Card title (the failure shape)
- Paging conditions that trigger this card
- Last validated date and author
- Severity guidance
2. First ten minutes
- Numbered executable steps
- Each step: action, expected output, what to do if output differs
3. Diagnosis
- Likely causes ranked by frequency
- The signals that distinguish each
- Decision tree to the next action
4. Mitigation
- Kill paths: rollback command, feature flag off, traffic shed, tenant quota
- Each kill path: command, blast radius, who can authorise
5. Escalation
- When to escalate (criteria, not just "if not resolved")
- To whom (named role from the rotation roster)
- What context to hand over
The five sections are the spec. A card missing any section is a card that will fail in the moment.
The "executable step" discipline¶
The most common runbook flaw is steps that are descriptive rather than executable:
- Descriptive (bad): "Check the gateway dashboard for refusal rate changes."
- Executable (good): "Open [dashboard link]. The refusal rate panel is in the top-right. Compare the last 30 minutes to the 7-day baseline (shown as the dashed line). A deviation of more than 2σ is significant. If significant, proceed to step 4; otherwise, proceed to step 3."
The executable step tells the engineer exactly what to do, what to observe, and what decision to make. Under stress, descriptive steps degenerate into improvisation; executable steps stay executable.
A second discipline: every step has an "expected output" and a "what if it differs." The runbook does not assume happy path.
A worked example — the quality regression card¶
The Bengaluru insurance SaaS team writes a runbook for the quality alert. The card:
Identification.
- Title: Quality regression — sliced eval score below baseline.
- Triggers: Quality alert P2 or P1 for any intent slice.
- Last validated: 2026-05-14 by Arjun Rao.
- Severity: P2 unless the slice is claim_initiation or affects an enterprise tenant (then P1).
First ten minutes.
1. Acknowledge the page within 5 minutes. Open the incident channel. Post the alert payload.
2. Open the alert link. Identify the slice, the baseline-vs-current score, and the deploy ID if present.
3. If a deploy ID is present and within the deploy-anchored window, the deploy is the prime suspect. Proceed to step 6.
4. If no deploy is in the window, open the affected calls' traces (sample IDs in the payload). Look for: provider drift, retrieval staleness, an upstream system change. Note any pattern.
5. Decision: deploy-caused regression → step 6. Provider drift → switch to the provider-drift card. Retrieval staleness → switch to the retrieval-staleness card. Pattern unclear → escalate (step 11).
6. Confirm the deploy and its scope: which prompts or models, which features, which tenants.
7. Execute the rollback command from the payload. Command: release-mgr rollback --deploy-id <id>. Expected output: confirmation with the previous version restored. If the command fails, proceed to step 10.
8. Verify the rollback: wait 5 minutes, recheck the quality alert. The alert should clear or move toward baseline. If not, proceed to step 9.
9. If the alert does not clear, the regression may be unrelated to the deploy. Escalate to the lead-on-call (rotation roster); switch to step 11.
10. If the rollback command fails, escalate to the release management on-call. Hand over: deploy ID, error from the command, current alert state.
11. Escalation handover: include the alert payload, the actions taken, the current state, and the suspected cause.
Diagnosis. - Most likely (60%+): a recent prompt or model deploy. - Next likely: provider drift; check the gateway dashboard. - Less likely: retrieval index staleness; check the index freshness signal.
Mitigation.
- Rollback command (above).
- Feature flag for the affected intent: flags set <feature>.<intent>.enabled=false. Blast radius: that intent only.
- Tenant-level disable (last resort): tenant-cli disable <tenant>. Blast radius: that tenant's AI features.
- Authorisation: the on-call can execute rollback and feature flag; tenant disable requires the lead.
Escalation. - After step 5 if the cause is unclear. - After step 9 if the rollback did not resolve. - After step 10 if the rollback failed. - Target: lead-on-call (from rotation roster). Hand-over context: alert payload, steps taken, current state.
The card is short. It is also complete — an engineer paged at 03:00 can act through it in 10-15 minutes. The chapter's claim is that this is the minimum the apparatus owes the rotation.
The freshness gate¶
Runbooks rot. Systems evolve faster than documents; the rollback command changes, the dashboard URL moves, the escalation target leaves the team. A runbook that has not been validated in six months is a probable failure under load.
The freshness gate is a policy that prevents stale runbooks from being trusted blindly:
- Every runbook has a
last validateddate. - The validation is either a drill (the runbook was used in a drill recently) or a manual walk-through (an engineer walked through every step on the current system).
- A runbook past its freshness budget (commonly 90-180 days) is flagged as stale. The on-call's tooling shows the warning when opening the card.
- The team is expected to refresh stale runbooks before the next real incident, not after.
The gate is enforced by tooling, not by good intentions. Wiki pages with no enforcement mechanism are stale runbooks waiting to happen.
Versioning with the system¶
Runbooks live in a versioned repo, ideally in the same repo as the feature they document. Pull requests touching the feature should touch the runbook when the runbook is affected. Code review can include runbook review.
The pattern looks like:
features/account_assistant/
prompts/
intent_router.txt
intent_router.tests
runbooks/
quality_regression.md
cost_spike.md
provider_drift.md
safety_violation.md
README.md
The benefits are concrete:
- The runbook is co-located with the system; the runbook author sees the system change.
- Pull requests can include runbook updates; reviewers can flag missing updates.
- Runbook history is searchable and bisectable like code.
- The freshness gate can use the runbook's git history to flag staleness.
What a runbook is not¶
A runbook is not a postmortem document. The postmortem captures what happened; the runbook captures what to do next time. The postmortem may produce a runbook update, but it is not itself the runbook.
A runbook is not a debugging guide. Debugging a novel failure is a different activity, with no playbook. The runbook covers the cases the team has seen and decided in advance how to handle.
A runbook is not a wiki article. The wiki may have background, design notes, and architectural diagrams. The runbook is the executable card. Confusing the two is the most common pathology.
A runbook is not training material. The training gate may walk an engineer through runbooks; the runbook itself assumes the reader is trained.
Operational signals¶
Healthy. Every paging condition has a corresponding runbook card. Cards are within the freshness budget. Drills regularly exercise the cards. On-call engineers report that runbooks save time during real incidents.
First degrading metric. Runbook freshness drifting past the budget on more cards. The team is not refreshing on cadence.
Misleading metric. Number of runbook documents. A wiki with 80 pages can have zero runbooks by this chapter's definition. The metric to watch is freshness-gated card count per paging condition.
Expert graph. Per-card freshness, per-card drill participation, per-card on-call satisfaction. A card that exists but is stale, never drilled, and gets a "needed but unhelpful" rating from on-call is a card that will fail in the moment.
Boundary of applicability¶
Strong fit. Teams with named paging conditions and recurring failure shapes. The full runbook discipline pays off.
Pathology. A team writing runbooks for hypothetical failures rather than observed ones. The catalogue grows; the cards are speculative; engineers do not trust them. The discipline is to write runbooks for failures the team has observed or drilled.
Scale limit. Very large platforms have hundreds of paging conditions. The pattern is to share runbook templates across similar failure shapes, with feature-specific deviations called out.
Failure-prone assumption¶
The seductive wrong belief: a comprehensive wiki is sufficient. It is not. A wiki is a knowledge base; a runbook is an action card. Engineers under stress read action cards, not knowledge bases. The wiki has value (background, design context, architectural diagrams) but it does not substitute for a runbook.
The correct belief: runbooks are operational artefacts with their own engineering investment. Treating runbook authoring as a separate engineering practice — with code review, versioning, freshness gates, and drill validation — is how the apparatus stays alive.
Where this appears in production¶
- A growth-stage SaaS writes runbooks as wiki pages; the freshness gate is not enforced; six months in, half the cards are stale.
- A fintech co-locates runbooks with feature code; PR reviewers flag missing updates; runbook health stays high.
- A telecom AI has runbooks per paging condition; mean time to action drops from 22 to 7 minutes.
- A consumer chatbot writes only descriptive runbooks ("check the dashboard"); engineers improvise; mean time to action is high.
- A healthtech AI runs drills that exercise every runbook quarterly; staleness is caught and remediated.
- A coding assistant has a runbook for every paging condition; new engineers can act on a real page after their training gate.
- A retail AI has a wiki with 60 pages but no scoped runbooks; engineers cannot find the action they need under stress.
- A logistics AI has runbooks with rollback commands that no longer match the deploy tooling; the freshness gate catches this in a quarterly drill.
- A government AI treats runbooks as audit artefacts; they exist but are not exercised; they fail in the real incident.
- A B2B SaaS has runbook templates that share structure across features; new features get a runbook in a few hours.
- A payments AI has the rollback command in every runbook payload; engineers paste and execute; mean time to action is under 5 minutes.
- A travel platform has runbooks owned by the previous primary, transferred on handoff; ownership stays current.
- A legal AI writes runbooks for novel failures after each postmortem; the catalogue grows organically.
- A media AI has a runbook author named on every card; the author is the first escalation hop if the runbook is unclear.
- A staffing AI has cards with explicit "what if step fails" branches; under stress, the engineer never has to improvise.
- A document AI has a runbook generator that produces a skeleton from the paging condition spec; engineers fill in the steps.
- An ad-tech AI measures runbook satisfaction in the on-call survey; cards rated low are refreshed.
- A real-estate AI has cards in markdown in the same repo as the feature; PR diffs show the runbook change alongside the system change.
- A search-rerank service runs a drill on every newly written runbook before considering it "live."
- A medical AI has runbooks that include the regulator notification step; the step is exercised in drills.
Recall / checkpoint¶
- Name the five sections of a runbook card.
- What is an executable step, and how does it differ from a descriptive one?
- What is the freshness gate, and how is it enforced?
- Why are runbooks versioned with the feature they document?
- Distinguish a runbook from a wiki article, a postmortem, and training material.
- What is the most common pathology that produces stale runbooks?
- How does drill participation relate to runbook health?
Interview Q&A¶
Q1. A team's runbooks are wiki pages last edited a year ago. Walk through the apparatus failure and the remediation. The runbooks have rotted. Under stress, engineers either follow stale steps (and act on outdated commands) or ignore them (and improvise). Either is a failure. The remediation is to move runbooks to a versioned repo co-located with features, enforce a freshness gate (90-180 days) via tooling, and run drills that validate cards. The pattern is treating runbooks as operational engineering, not wiki content. Common wrong answer to avoid: "tell the team to update the wiki" — wiki edits without tooling enforcement decay; the structural problem returns.
Q2. Walk through the difference between an executable step and a descriptive one. A descriptive step says "check the gateway dashboard for refusal rate changes" — it leaves the engineer to find the dashboard, identify the panel, decide what counts as a change, and decide what to do. An executable step gives the link, names the panel, specifies the baseline comparison, defines the significant threshold, and names the next step. The executable step survives stress; the descriptive step degrades into improvisation. Common wrong answer to avoid: "we trust engineers to figure it out" — under stress, structured action beats trusted improvisation.
Q3. Why is the freshness gate non-optional? Because runbooks rot faster than systems do. The deploy command changes; the dashboard URL moves; the escalation target leaves the team. Without the gate, the team has no signal that a runbook is now wrong; the first signal is a failed action in a real incident. The gate, enforced by tooling, surfaces staleness before the incident does. The cost is bounded: each refresh is a focused engineering task, not an open-ended one. Common wrong answer to avoid: "we'll keep them updated as we change things" — change-driven updates miss cases the change author did not realise touched the runbook.
Q4. The team wants to write runbooks for every conceivable failure. Walk through your pushback. Runbooks for hypothetical failures degrade trust. Engineers cannot validate that a hypothetical runbook works because the failure has not been observed; the card becomes speculative. Over time, the catalogue grows and engineers do not know which cards are real and which are imagined. The discipline is to write runbooks for paging conditions that exist (drilled or real) and update them as failures are observed. Hypothetical scenarios are training material, not runbooks. Common wrong answer to avoid: "more coverage is better" — past a point, coverage degrades engagement.
Q5. Where should runbooks live, and why does the location matter? In a versioned repo co-located with the feature they document. Co-location ensures the runbook is visible during feature changes; PR reviewers can flag missing updates; the runbook history is bisectable; the freshness gate can use git history. The alternative — a separate wiki or document store — separates runbook evolution from system evolution, and the runbook drifts. Common wrong answer to avoid: "wiki for accessibility" — accessibility is a tooling problem; the runbook can be published to a wiki for read access while authored in the repo.
Q6. How does the runbook plane interact with the drill plane? Drills are the runbook plane's validation mechanism. A drill exercises a runbook on a synthetic but realistic scenario; gaps in the runbook surface as the drill participants struggle. The drill produces runbook updates (or new runbooks for newly observed scenarios). A runbook that has not been exercised in a drill is unvalidated; the freshness gate may flag this as a separate dimension of staleness. Common wrong answer to avoid: "drills are separate from runbook authoring" — they are the validation arm of authoring.
Design / debug exercise (10 minutes)¶
Modelled example. Walk through the worked example (the quality regression runbook). Verify the five sections are populated. Identify any step that is descriptive rather than executable. Identify any branch that has no "what if it fails" path.
Your turn. Take one paging condition from your team's apparatus. Draft a runbook card with all five sections. Specifically, write each first-ten-minutes step as executable (action, expected output, what if different). Estimate where the card would fail in a drill.
Reproduce from memory. Draw the runbook card structure (the five sections, with what each contains). The signal of internalisation is that the structure lands in under two minutes; the test of mastery is that you can author a runbook for a new paging condition without rereading this chapter.
Operational memory¶
This chapter explained the runbook plane: scoped, executable, versioned, freshness-gated cards that walk an on-call through a specific failure shape. The important idea is that runbooks are operational engineering artefacts, not wiki content; their value comes from discipline (executable steps, freshness gates, drill validation), not from volume.
You learned to author a runbook card with five sections, distinguish executable from descriptive steps, enforce freshness via tooling, and co-locate runbooks with feature code. That solves the opening failure because the trained on-call now has actionable cards, not a wiki to grep.
Carry this diagnostic forward: when a team says "we have runbooks," ask to see one card. The card answers whether the apparatus has a runbook plane or has runbook-shaped wiki pages.
Remember:
- Five sections: identification, first ten minutes, diagnosis, mitigation, escalation.
- Steps are executable: action, expected output, what if different.
- Freshness gate is enforced by tooling, not good intentions.
- Versioned with the system; co-located with feature code.
- Validated by drills; rotted runbooks fail real incidents.
Bridge. The runbook tells the on-call what to do alone. The escalation plane tells them whom to bring in when alone is not enough. The next chapter is the discipline of the escalation graph — from on-call to specialist to provider. → 06-escalation-paths.md