00. AI runbooks and on-call operations — First-principles overview¶

Module 05_ai_incident_operations taught you how to run a live AI incident. This module is the matching discipline for what has to exist before the incident — the alerts, the rotation, the runbooks, the escalation policy, the postmortem pipeline, and the drills that turn a chaotic page into a competent response.

A platform engineer at a Hyderabad fintech is paged at 02:14 on a Sunday because the AI underwriting agent is misclassifying high-risk applications as low-risk. The engineer is on a generic SRE rotation that was never re-scoped for AI features. The alert payload says error_rate=0.03 — within SLO. The runbook the engineer finds is a 2019 document titled "How to restart the recommender service" and has no mention of model version, prompt version, retrieval index, eval scores, or provider health. The engineer spends thirty minutes finding a person on the AI team who has any context. The AI team's lead is on holiday; their backup has not been formally trained for the rotation. By 04:00 the incident is contained, but the postmortem reveals that the team has six AI services, one runbook, no AI-specific alerts on quality regression, no rotation, no drill cadence, and no postmortem template that captures eval delta or prompt diff. The honest answer is not that the team failed the incident; it is that the team had never built the oncall apparatus AI requires.

That apparatus is the subject of this module. The previous module taught you to handle the incident. This one teaches you to design the system that makes handling possible.

What an AI on-call apparatus is, in one sentence¶

An AI on-call apparatus is the standing infrastructure of alerts, rotations, runbooks, escalation paths, postmortem capture, and drill cadence that makes a competent AI incident response possible — sized for the AI-specific failure modes that classic SRE on-call assumptions do not cover.

Read the sentence left to right.

Standing infrastructure — not the live incident, not the postmortem template alone, but the long-lived apparatus that operates between incidents.
AI-specific failure modes — silent quality regression, prompt drift, retrieval staleness, provider behaviour shifts, cost runaway, safety boundary violations. None of these match the classic uptime-driven model that most SRE rotations were built for.
Classic SRE on-call assumptions — symptom alerts, runbook-as-stale-document, generic rotation, postmortem-as-blame-shield. Each of these breaks on AI systems and has to be re-designed.

If a company has shipped AI features without consciously re-designing its on-call apparatus, the question is not whether it will have an AI incident; it is whether the team will recognise what kind of incident it is in time to contain it.

The six surfaces of the apparatus¶

Every AI on-call apparatus has six load-bearing surfaces. Memorise them; the rest of the module is consequences.

Surface	One-liner	Pressure it answers
The alert plane	Surface AI-specific signals — quality, drift, cost, safety — as paging conditions, not just dashboards	invisibility: classic uptime alerts miss the failures that matter most
The rotation plane	Who is on call for which AI surface, with the right context and backup	coverage: AI ownership is split across model, prompt, retrieval, data, and product
The runbook plane	Versioned, executable documents that match the live failure shape	freshness: stale runbooks are worse than no runbooks
The escalation plane	The path from front-line on-call to the engineer, lead, and external party who can resolve	depth: AI failures often need the prompt owner, model owner, or provider on the call
The postmortem plane	The capture that turns one incident into eval coverage, alert improvement, and runbook update	learning: incidents that do not change the system repeat
The drill plane	The exercises that test the apparatus before real incidents do	atrophy: untested apparatus degrades silently

The module's twelve chapters are these six surfaces explored, then synthesised. The final two files are the architect checklist and the honest admission.

What this module is not about¶

Running a live incident. That is 05_ai_incident_operations. This module is what has to exist for that one to be runnable.
Service uptime monitoring. Generic SRE monitoring is assumed; the module focuses on the AI-specific layer that sits on top.
Model evaluation design. Covered in 04_ai_product_evals. This module consumes eval signals as alert inputs; it does not design them.
Vendor incident response. Provider outages are routed via the model gateway (01_model_gateway_provider_ops); this module handles the on-call response to gateway-surfaced incidents.

The recurring vocabulary¶

These terms appear in every chapter.

Name	Surface	What it is
the quality alert	Alert	a paging signal driven by eval-on-production-traffic, not by error rate
the prompt-version page	Alert	a paging signal triggered by anomaly after a prompt change
the provider-drift watch	Alert	a paging signal triggered by provider behaviour shift detected at the gateway
the cost-spike page	Alert	a paging signal triggered by tenant- or feature-level cost anomaly
the rotation roster	Rotation	the named primary/backup pairing for an AI surface, with weekly cadence
the runbook card	Runbook	a single executable document scoped to one failure shape, versioned with the system
the escalation graph	Escalation	the directed graph from on-call to lead to specialist to provider
the postmortem template	Postmortem	the structured capture that names cause, blast radius, eval delta, and follow-ups
the drill calendar	Drill	the recurring exercise schedule with named scenarios and scoring
the readiness score	Drill	the apparatus's measured health: alert coverage, runbook freshness, rotation training

The journey: build the apparatus, then keep it alive¶

This module has two acts.

Act 1 — Build the apparatus (files 01–06). Why classic on-call fails for AI; the apparatus anatomy; alert design; rotation and ownership; runbook authoring; escalation paths. By file 06 the apparatus exists as a defensible standing service.

Act 2 — Operate the apparatus (files 07–11). The specific runbook families; postmortem capture; drills; on-call health and burnout. The apparatus does not become more powerful; it becomes resilient to time, scale, and the people who staff it.

Synthesis (files 12–13). Architect checklist and honest admission.

Memory map¶

#	File	Surface	Pressure answered	What it adds
01	why-classic-oncall-fails-for-ai	—	uptime is not the right frame	the case that forces the apparatus to exist
02	the-oncall-apparatus	All	what the apparatus actually contains	the six surfaces as one architecture
03	alert-design-for-ai-systems	Alert	AI failures hide from classic alerts	quality, drift, cost, safety paging conditions
04	rotation-and-ownership	Rotation	AI surfaces span teams	primary/backup, training, handoff discipline
05	runbook-authoring	Runbook	runbooks rot the moment they are written	versioning, executable steps, freshness gates
06	escalation-paths	Escalation	first-line on-call rarely has the fix	named graph from on-call to specialist to provider
	— milestone: apparatus is defensible —
07	degraded-quality-runbooks	Runbook	silent quality regression is the hardest AI failure	specific runbooks for the quality family
08	provider-and-cost-runbooks	Runbook	provider and cost incidents have distinct shapes	specific runbooks for these families
09	postmortem-capture	Postmortem	incidents teach only when captured	template, eval-delta requirement, follow-up enforcement
10	drills-and-game-days	Drill	apparatus rusts between incidents	drill cadence, scenarios, scoring
11	oncall-health-and-burnout	Rotation	the apparatus runs on humans	load metrics, alert hygiene, fairness
	— milestone: apparatus is operable —
12	architect-checklist	Synthesis	completeness	20-item design and operate checklist
13	honest-admission	Boundaries	humility	what apparatus design cannot solve

Three traversal paths use this map. Prerequisite path — top to bottom. Failure path — when a page wakes you, match the failure family to the runbook chapter and the apparatus failure to the build chapter. Synthesis path — pick a surface and a runbook family and ask how they compose (alert design + degraded quality runbooks = how the page payload pre-loads the runbook the on-call needs).

How this module relates to its neighbours¶

05_ai_incident_operations — runs the live incident this apparatus enables. Read both as a pair.
01_model_gateway_provider_ops — surfaces provider drift, cost runaway, and quota exhaustion that this module's alerts consume.
02_telemetry_feedback_loops — feeds eval-on-production-traffic signals that drive quality alerts.
13_prompt_lifecycle_operations — owns the prompt-version-change event that triggers prompt anomaly watches.
03_data_access_governance — defines the safety boundary whose violation is a paging condition.
03_ai_release_management — the release process whose canary and rollback hooks the apparatus depends on.

Top resources¶

Google SRE Workbook — On-Call — https://sre.google/workbook/on-call/
Atlassian Incident Management Handbook — https://www.atlassian.com/incident-management
Resilience Engineering Association — Postmortems — https://resilience-engineering.org/
PagerDuty — Incident Response Documentation — https://response.pagerduty.com/

These are the SRE baselines. The AI-specific surface on top is the contribution of this module.

What's coming¶

01-why-classic-oncall-fails-for-ai.md — uptime is not the right frame.
02-the-oncall-apparatus.md — the six surfaces as a service architecture.
03-alert-design-for-ai-systems.md — quality, drift, cost, safety as paging conditions.
04-rotation-and-ownership.md — who is on call for which AI surface.
05-runbook-authoring.md — versioned, executable, never stale.
06-escalation-paths.md — the graph from on-call to provider.
07-degraded-quality-runbooks.md — the silent-regression family.
08-provider-and-cost-runbooks.md — provider outage, cost spike, quota exhaustion.
09-postmortem-capture.md — incidents that change the system.
10-drills-and-game-days.md — testing the apparatus.
11-oncall-health-and-burnout.md — the humans behind the apparatus.
12-architect-checklist.md — twenty items.
13-honest-admission.md — limits.

Bridge. Before designing alerts, rotations, or runbooks, we feel why the apparatus is needed at all. Classic SRE on-call was built around uptime and error-rate symptoms; AI systems fail through quality, drift, cost, and safety in ways the classic apparatus does not catch. The first chapter is that diagnosis. → 01-why-classic-oncall-fails-for-ai.md