Skip to content

00. AI runbooks and on-call operations — First-principles overview

Module 05_ai_incident_operations taught you how to run a live AI incident. This module is the matching discipline for what has to exist before the incident — the alerts, the rotation, the runbooks, the escalation policy, the postmortem pipeline, and the drills that turn a chaotic page into a competent response.


A platform engineer at a Hyderabad fintech is paged at 02:14 on a Sunday because the AI underwriting agent is misclassifying high-risk applications as low-risk. The engineer is on a generic SRE rotation that was never re-scoped for AI features. The alert payload says error_rate=0.03 — within SLO. The runbook the engineer finds is a 2019 document titled "How to restart the recommender service" and has no mention of model version, prompt version, retrieval index, eval scores, or provider health. The engineer spends thirty minutes finding a person on the AI team who has any context. The AI team's lead is on holiday; their backup has not been formally trained for the rotation. By 04:00 the incident is contained, but the postmortem reveals that the team has six AI services, one runbook, no AI-specific alerts on quality regression, no rotation, no drill cadence, and no postmortem template that captures eval delta or prompt diff. The honest answer is not that the team failed the incident; it is that the team had never built the oncall apparatus AI requires.

That apparatus is the subject of this module. The previous module taught you to handle the incident. This one teaches you to design the system that makes handling possible.


What an AI on-call apparatus is, in one sentence

An AI on-call apparatus is the standing infrastructure of alerts, rotations, runbooks, escalation paths, postmortem capture, and drill cadence that makes a competent AI incident response possible — sized for the AI-specific failure modes that classic SRE on-call assumptions do not cover.

Read the sentence left to right.

  • Standing infrastructure — not the live incident, not the postmortem template alone, but the long-lived apparatus that operates between incidents.
  • AI-specific failure modes — silent quality regression, prompt drift, retrieval staleness, provider behaviour shifts, cost runaway, safety boundary violations. None of these match the classic uptime-driven model that most SRE rotations were built for.
  • Classic SRE on-call assumptions — symptom alerts, runbook-as-stale-document, generic rotation, postmortem-as-blame-shield. Each of these breaks on AI systems and has to be re-designed.

If a company has shipped AI features without consciously re-designing its on-call apparatus, the question is not whether it will have an AI incident; it is whether the team will recognise what kind of incident it is in time to contain it.


The six surfaces of the apparatus

Every AI on-call apparatus has six load-bearing surfaces. Memorise them; the rest of the module is consequences.

Surface One-liner Pressure it answers
The alert plane Surface AI-specific signals — quality, drift, cost, safety — as paging conditions, not just dashboards invisibility: classic uptime alerts miss the failures that matter most
The rotation plane Who is on call for which AI surface, with the right context and backup coverage: AI ownership is split across model, prompt, retrieval, data, and product
The runbook plane Versioned, executable documents that match the live failure shape freshness: stale runbooks are worse than no runbooks
The escalation plane The path from front-line on-call to the engineer, lead, and external party who can resolve depth: AI failures often need the prompt owner, model owner, or provider on the call
The postmortem plane The capture that turns one incident into eval coverage, alert improvement, and runbook update learning: incidents that do not change the system repeat
The drill plane The exercises that test the apparatus before real incidents do atrophy: untested apparatus degrades silently

The module's twelve chapters are these six surfaces explored, then synthesised. The final two files are the architect checklist and the honest admission.


What this module is not about

  • Running a live incident. That is 05_ai_incident_operations. This module is what has to exist for that one to be runnable.
  • Service uptime monitoring. Generic SRE monitoring is assumed; the module focuses on the AI-specific layer that sits on top.
  • Model evaluation design. Covered in 04_ai_product_evals. This module consumes eval signals as alert inputs; it does not design them.
  • Vendor incident response. Provider outages are routed via the model gateway (01_model_gateway_provider_ops); this module handles the on-call response to gateway-surfaced incidents.

The recurring vocabulary

These terms appear in every chapter.

Name Surface What it is
the quality alert Alert a paging signal driven by eval-on-production-traffic, not by error rate
the prompt-version page Alert a paging signal triggered by anomaly after a prompt change
the provider-drift watch Alert a paging signal triggered by provider behaviour shift detected at the gateway
the cost-spike page Alert a paging signal triggered by tenant- or feature-level cost anomaly
the rotation roster Rotation the named primary/backup pairing for an AI surface, with weekly cadence
the runbook card Runbook a single executable document scoped to one failure shape, versioned with the system
the escalation graph Escalation the directed graph from on-call to lead to specialist to provider
the postmortem template Postmortem the structured capture that names cause, blast radius, eval delta, and follow-ups
the drill calendar Drill the recurring exercise schedule with named scenarios and scoring
the readiness score Drill the apparatus's measured health: alert coverage, runbook freshness, rotation training

The journey: build the apparatus, then keep it alive

This module has two acts.

Act 1 — Build the apparatus (files 01–06). Why classic on-call fails for AI; the apparatus anatomy; alert design; rotation and ownership; runbook authoring; escalation paths. By file 06 the apparatus exists as a defensible standing service.

Act 2 — Operate the apparatus (files 07–11). The specific runbook families; postmortem capture; drills; on-call health and burnout. The apparatus does not become more powerful; it becomes resilient to time, scale, and the people who staff it.

Synthesis (files 12–13). Architect checklist and honest admission.


Memory map

# File Surface Pressure answered What it adds
01 why-classic-oncall-fails-for-ai uptime is not the right frame the case that forces the apparatus to exist
02 the-oncall-apparatus All what the apparatus actually contains the six surfaces as one architecture
03 alert-design-for-ai-systems Alert AI failures hide from classic alerts quality, drift, cost, safety paging conditions
04 rotation-and-ownership Rotation AI surfaces span teams primary/backup, training, handoff discipline
05 runbook-authoring Runbook runbooks rot the moment they are written versioning, executable steps, freshness gates
06 escalation-paths Escalation first-line on-call rarely has the fix named graph from on-call to specialist to provider
— milestone: apparatus is defensible —
07 degraded-quality-runbooks Runbook silent quality regression is the hardest AI failure specific runbooks for the quality family
08 provider-and-cost-runbooks Runbook provider and cost incidents have distinct shapes specific runbooks for these families
09 postmortem-capture Postmortem incidents teach only when captured template, eval-delta requirement, follow-up enforcement
10 drills-and-game-days Drill apparatus rusts between incidents drill cadence, scenarios, scoring
11 oncall-health-and-burnout Rotation the apparatus runs on humans load metrics, alert hygiene, fairness
— milestone: apparatus is operable —
12 architect-checklist Synthesis completeness 20-item design and operate checklist
13 honest-admission Boundaries humility what apparatus design cannot solve

Three traversal paths use this map. Prerequisite path — top to bottom. Failure path — when a page wakes you, match the failure family to the runbook chapter and the apparatus failure to the build chapter. Synthesis path — pick a surface and a runbook family and ask how they compose (alert design + degraded quality runbooks = how the page payload pre-loads the runbook the on-call needs).


How this module relates to its neighbours


Top resources

  • Google SRE Workbook — On-Call — https://sre.google/workbook/on-call/
  • Atlassian Incident Management Handbook — https://www.atlassian.com/incident-management
  • Resilience Engineering Association — Postmortems — https://resilience-engineering.org/
  • PagerDuty — Incident Response Documentation — https://response.pagerduty.com/

These are the SRE baselines. The AI-specific surface on top is the contribution of this module.


What's coming

  1. 01-why-classic-oncall-fails-for-ai.md — uptime is not the right frame.
  2. 02-the-oncall-apparatus.md — the six surfaces as a service architecture.
  3. 03-alert-design-for-ai-systems.md — quality, drift, cost, safety as paging conditions.
  4. 04-rotation-and-ownership.md — who is on call for which AI surface.
  5. 05-runbook-authoring.md — versioned, executable, never stale.
  6. 06-escalation-paths.md — the graph from on-call to provider.
  7. 07-degraded-quality-runbooks.md — the silent-regression family.
  8. 08-provider-and-cost-runbooks.md — provider outage, cost spike, quota exhaustion.
  9. 09-postmortem-capture.md — incidents that change the system.
  10. 10-drills-and-game-days.md — testing the apparatus.
  11. 11-oncall-health-and-burnout.md — the humans behind the apparatus.
  12. 12-architect-checklist.md — twenty items.
  13. 13-honest-admission.md — limits.

Bridge. Before designing alerts, rotations, or runbooks, we feel why the apparatus is needed at all. Classic SRE on-call was built around uptime and error-rate symptoms; AI systems fail through quality, drift, cost, and safety in ways the classic apparatus does not catch. The first chapter is that diagnosis. → 01-why-classic-oncall-fails-for-ai.md