00. AI runbooks and on-call operations — First-principles overview¶
Module 05_ai_incident_operations taught you how to run a live AI incident. This module is the matching discipline for what has to exist before the incident — the alerts, the rotation, the runbooks, the escalation policy, the postmortem pipeline, and the drills that turn a chaotic page into a competent response.
A platform engineer at a Hyderabad fintech is paged at 02:14 on a Sunday because the AI underwriting agent is misclassifying high-risk applications as low-risk. The engineer is on a generic SRE rotation that was never re-scoped for AI features. The alert payload says error_rate=0.03 — within SLO. The runbook the engineer finds is a 2019 document titled "How to restart the recommender service" and has no mention of model version, prompt version, retrieval index, eval scores, or provider health. The engineer spends thirty minutes finding a person on the AI team who has any context. The AI team's lead is on holiday; their backup has not been formally trained for the rotation. By 04:00 the incident is contained, but the postmortem reveals that the team has six AI services, one runbook, no AI-specific alerts on quality regression, no rotation, no drill cadence, and no postmortem template that captures eval delta or prompt diff. The honest answer is not that the team failed the incident; it is that the team had never built the oncall apparatus AI requires.
That apparatus is the subject of this module. The previous module taught you to handle the incident. This one teaches you to design the system that makes handling possible.
What an AI on-call apparatus is, in one sentence¶
An AI on-call apparatus is the standing infrastructure of alerts, rotations, runbooks, escalation paths, postmortem capture, and drill cadence that makes a competent AI incident response possible — sized for the AI-specific failure modes that classic SRE on-call assumptions do not cover.
Read the sentence left to right.
- Standing infrastructure — not the live incident, not the postmortem template alone, but the long-lived apparatus that operates between incidents.
- AI-specific failure modes — silent quality regression, prompt drift, retrieval staleness, provider behaviour shifts, cost runaway, safety boundary violations. None of these match the classic uptime-driven model that most SRE rotations were built for.
- Classic SRE on-call assumptions — symptom alerts, runbook-as-stale-document, generic rotation, postmortem-as-blame-shield. Each of these breaks on AI systems and has to be re-designed.
If a company has shipped AI features without consciously re-designing its on-call apparatus, the question is not whether it will have an AI incident; it is whether the team will recognise what kind of incident it is in time to contain it.
The six surfaces of the apparatus¶
Every AI on-call apparatus has six load-bearing surfaces. Memorise them; the rest of the module is consequences.
| Surface | One-liner | Pressure it answers |
|---|---|---|
| The alert plane | Surface AI-specific signals — quality, drift, cost, safety — as paging conditions, not just dashboards | invisibility: classic uptime alerts miss the failures that matter most |
| The rotation plane | Who is on call for which AI surface, with the right context and backup | coverage: AI ownership is split across model, prompt, retrieval, data, and product |
| The runbook plane | Versioned, executable documents that match the live failure shape | freshness: stale runbooks are worse than no runbooks |
| The escalation plane | The path from front-line on-call to the engineer, lead, and external party who can resolve | depth: AI failures often need the prompt owner, model owner, or provider on the call |
| The postmortem plane | The capture that turns one incident into eval coverage, alert improvement, and runbook update | learning: incidents that do not change the system repeat |
| The drill plane | The exercises that test the apparatus before real incidents do | atrophy: untested apparatus degrades silently |
The module's twelve chapters are these six surfaces explored, then synthesised. The final two files are the architect checklist and the honest admission.
What this module is not about¶
- Running a live incident. That is
05_ai_incident_operations. This module is what has to exist for that one to be runnable. - Service uptime monitoring. Generic SRE monitoring is assumed; the module focuses on the AI-specific layer that sits on top.
- Model evaluation design. Covered in
04_ai_product_evals. This module consumes eval signals as alert inputs; it does not design them. - Vendor incident response. Provider outages are routed via the model gateway (
01_model_gateway_provider_ops); this module handles the on-call response to gateway-surfaced incidents.
The recurring vocabulary¶
These terms appear in every chapter.
| Name | Surface | What it is |
|---|---|---|
| the quality alert | Alert | a paging signal driven by eval-on-production-traffic, not by error rate |
| the prompt-version page | Alert | a paging signal triggered by anomaly after a prompt change |
| the provider-drift watch | Alert | a paging signal triggered by provider behaviour shift detected at the gateway |
| the cost-spike page | Alert | a paging signal triggered by tenant- or feature-level cost anomaly |
| the rotation roster | Rotation | the named primary/backup pairing for an AI surface, with weekly cadence |
| the runbook card | Runbook | a single executable document scoped to one failure shape, versioned with the system |
| the escalation graph | Escalation | the directed graph from on-call to lead to specialist to provider |
| the postmortem template | Postmortem | the structured capture that names cause, blast radius, eval delta, and follow-ups |
| the drill calendar | Drill | the recurring exercise schedule with named scenarios and scoring |
| the readiness score | Drill | the apparatus's measured health: alert coverage, runbook freshness, rotation training |
The journey: build the apparatus, then keep it alive¶
This module has two acts.
Act 1 — Build the apparatus (files 01–06). Why classic on-call fails for AI; the apparatus anatomy; alert design; rotation and ownership; runbook authoring; escalation paths. By file 06 the apparatus exists as a defensible standing service.
Act 2 — Operate the apparatus (files 07–11). The specific runbook families; postmortem capture; drills; on-call health and burnout. The apparatus does not become more powerful; it becomes resilient to time, scale, and the people who staff it.
Synthesis (files 12–13). Architect checklist and honest admission.
Memory map¶
| # | File | Surface | Pressure answered | What it adds |
|---|---|---|---|---|
| 01 | why-classic-oncall-fails-for-ai | — | uptime is not the right frame | the case that forces the apparatus to exist |
| 02 | the-oncall-apparatus | All | what the apparatus actually contains | the six surfaces as one architecture |
| 03 | alert-design-for-ai-systems | Alert | AI failures hide from classic alerts | quality, drift, cost, safety paging conditions |
| 04 | rotation-and-ownership | Rotation | AI surfaces span teams | primary/backup, training, handoff discipline |
| 05 | runbook-authoring | Runbook | runbooks rot the moment they are written | versioning, executable steps, freshness gates |
| 06 | escalation-paths | Escalation | first-line on-call rarely has the fix | named graph from on-call to specialist to provider |
| — milestone: apparatus is defensible — | ||||
| 07 | degraded-quality-runbooks | Runbook | silent quality regression is the hardest AI failure | specific runbooks for the quality family |
| 08 | provider-and-cost-runbooks | Runbook | provider and cost incidents have distinct shapes | specific runbooks for these families |
| 09 | postmortem-capture | Postmortem | incidents teach only when captured | template, eval-delta requirement, follow-up enforcement |
| 10 | drills-and-game-days | Drill | apparatus rusts between incidents | drill cadence, scenarios, scoring |
| 11 | oncall-health-and-burnout | Rotation | the apparatus runs on humans | load metrics, alert hygiene, fairness |
| — milestone: apparatus is operable — | ||||
| 12 | architect-checklist | Synthesis | completeness | 20-item design and operate checklist |
| 13 | honest-admission | Boundaries | humility | what apparatus design cannot solve |
Three traversal paths use this map. Prerequisite path — top to bottom. Failure path — when a page wakes you, match the failure family to the runbook chapter and the apparatus failure to the build chapter. Synthesis path — pick a surface and a runbook family and ask how they compose (alert design + degraded quality runbooks = how the page payload pre-loads the runbook the on-call needs).
How this module relates to its neighbours¶
05_ai_incident_operations— runs the live incident this apparatus enables. Read both as a pair.01_model_gateway_provider_ops— surfaces provider drift, cost runaway, and quota exhaustion that this module's alerts consume.02_telemetry_feedback_loops— feeds eval-on-production-traffic signals that drive quality alerts.13_prompt_lifecycle_operations— owns the prompt-version-change event that triggers prompt anomaly watches.03_data_access_governance— defines the safety boundary whose violation is a paging condition.03_ai_release_management— the release process whose canary and rollback hooks the apparatus depends on.
Top resources¶
- Google SRE Workbook — On-Call — https://sre.google/workbook/on-call/
- Atlassian Incident Management Handbook — https://www.atlassian.com/incident-management
- Resilience Engineering Association — Postmortems — https://resilience-engineering.org/
- PagerDuty — Incident Response Documentation — https://response.pagerduty.com/
These are the SRE baselines. The AI-specific surface on top is the contribution of this module.
What's coming¶
- 01-why-classic-oncall-fails-for-ai.md — uptime is not the right frame.
- 02-the-oncall-apparatus.md — the six surfaces as a service architecture.
- 03-alert-design-for-ai-systems.md — quality, drift, cost, safety as paging conditions.
- 04-rotation-and-ownership.md — who is on call for which AI surface.
- 05-runbook-authoring.md — versioned, executable, never stale.
- 06-escalation-paths.md — the graph from on-call to provider.
- 07-degraded-quality-runbooks.md — the silent-regression family.
- 08-provider-and-cost-runbooks.md — provider outage, cost spike, quota exhaustion.
- 09-postmortem-capture.md — incidents that change the system.
- 10-drills-and-game-days.md — testing the apparatus.
- 11-oncall-health-and-burnout.md — the humans behind the apparatus.
- 12-architect-checklist.md — twenty items.
- 13-honest-admission.md — limits.
Bridge. Before designing alerts, rotations, or runbooks, we feel why the apparatus is needed at all. Classic SRE on-call was built around uptime and error-rate symptoms; AI systems fail through quality, drift, cost, and safety in ways the classic apparatus does not catch. The first chapter is that diagnosis. → 01-why-classic-oncall-fails-for-ai.md