02. The on-call apparatus¶
~12 min read. The previous chapter showed why the classic apparatus fails on AI. This chapter is the prescription — the six surfaces of an AI on-call apparatus as a single service architecture, so the rest of the module can develop each surface in turn.
Continues from
01-why-classic-oncall-fails-for-ai.md. The alert plane, rotation plane, runbook plane, escalation plane, postmortem plane, and drill plane are the six surfaces that compose the apparatus. The module's later chapters expand each surface in detail; this chapter is the map.
The previous chapter named five failure families the classic apparatus misses. This chapter names the apparatus that catches them. The aim is the same shape the gateway module took in its anatomy chapter: a small number of named surfaces, each with a defined input, output, and ownership, that together compose the production discipline.
What the apparatus is, anatomically¶
The AI on-call apparatus is the named composition of six surfaces — alerts, rotations, runbooks, escalations, postmortems, drills — designed so that an AI-specific failure produces a competent response within the SLO the team has chosen.
Read the sentence right to left.
- Within the SLO — the apparatus's output is a time-to-contain. The SLO is chosen; the apparatus delivers.
- Competent response — the on-call has the alert payload, the runbook, the escalation graph, and the context to act in the first ten minutes.
- AI-specific failure — the apparatus's input is a paging condition from one of the five families.
- Six surfaces — the apparatus is not a single tool; it is the standing composition of six things.
If a team has fewer than six surfaces, the apparatus has a gap. The most common pattern is three: alerts (often classic only), rotation (often generic), and stale runbooks. Postmortem, drill, and escalation surfaces are usually under-invested.
The six surfaces, with input, output, owner¶
Surface 1 — Alert plane¶
Input. Signals from production: eval-on-traffic scores, gateway drift detectors, cost telemetry, safety classifiers, latency and error rate. Also user feedback signals (thumbs-down, support ticket volume on AI surfaces).
Output. Paging conditions wired to a paging system (PagerDuty, Opsgenie, internal). Each paging condition has: severity, payload, runbook link, owning team.
Owner. The platform team typically owns the alert wiring; the feature team owns the thresholds for their feature.
The alert plane is the apparatus's eyes. Chapter 03 develops it in detail.
Surface 2 — Rotation plane¶
Input. The set of AI surfaces in production; the engineers who can act on each; the shift cadence (weekly, daily); the holiday and time-zone constraints.
Output. A named primary/backup roster per AI surface, with on-call hours, handoff discipline, and training requirements.
Owner. The engineering lead per AI surface defines the rotation; the platform team enforces the policy (no single-engineer rotation, defined backup, training before rotation entry).
The rotation plane is the apparatus's body. Chapter 04 develops it.
Surface 3 — Runbook plane¶
Input. The catalogued failure families; the AI surface's specific systems (model, prompt, retrieval, tools, gateway); the rollback hooks.
Output. A set of versioned runbook documents, each scoped to one failure shape, each tested in a drill.
Owner. The runbook author is named in the document; the platform team enforces the freshness gate.
The runbook plane is the apparatus's hands. Chapter 05 develops it.
Surface 4 — Escalation plane¶
Input. The on-call's reach (primary, backup); the lead's reach; the specialist's reach (model owner, prompt owner, retrieval owner, security on-call, legal on-call); the provider's reach.
Output. A directed graph from page received to specialist resolved, with response-time SLOs at each hop.
Owner. The platform team owns the graph; each named hop owner has signed off on their response-time SLO.
The escalation plane is the apparatus's reach. Chapter 06 develops it.
Surface 5 — Postmortem plane¶
Input. Closed incidents from any surface.
Output. A populated postmortem document with five mandatory fields: cause, blast radius, eval delta, follow-up actions with owners and dates, alert-or-runbook-or-drill changes.
Owner. The incident commander writes the postmortem; the platform team enforces the template and the follow-up closure rate.
The postmortem plane is the apparatus's memory. Chapter 09 develops it.
Surface 6 — Drill plane¶
Input. The drill calendar; the scenario library; the participants.
Output. Recurring drill exercises with named scenarios, observed outcomes, and a readiness score that feeds back into the apparatus design.
Owner. The platform team owns the calendar; the rotation teams participate.
The drill plane is the apparatus's gym. Chapter 10 develops it.
The apparatus as a service diagram¶
+----------------------+
| Signals layer |
| (evals, gateway, |
| cost, safety, UX) |
+----------+-----------+
|
v
+----------+-----------+
| Alert plane (1) |
| paging conditions |
+----------+-----------+
|
page -----+----- ticket
| |
v v
+---------------+------+ +----+----------+
| Rotation plane (2) | | Backlog |
| primary + backup | | apparatus |
| + training | | improvements |
+---------+------------+ +---------------+
|
| uses
v
+---------+------------+ escalates if needed
| Runbook plane (3) |--------------------+
| executable docs | |
+----------------------+ v
+-----------+----------+
| Escalation plane (4)|
| graph to specialists|
+-----------+----------+
|
v
+-----------+----------+
| Postmortem plane (5)|
| capture + follow-up |
+-----------+----------+
|
v
+-----------+----------+
| Drill plane (6) |
| exercise + score |
+----------------------+
The signals layer is upstream of the apparatus; the gateway, evals, telemetry, and product surfaces produce signals that the alert plane consumes. The flow from alert to drill is the live-incident path. The drill loop closes back into apparatus improvements.
A worked example — building the apparatus for one feature¶
A Pune SaaS company is shipping its first agent — an account-management assistant. The platform engineer is designing the apparatus for this single feature, end-to-end. Walking through each surface:
Alert plane. Five paging conditions wired: (a) eval-on-traffic score drops by 5% over a rolling 1-hour window, (b) provider drift detector at the gateway fires, (c) tenant cost anomaly at 3σ from rolling baseline, (d) safety classifier confirms a violation on production output, (e) feedback signal anomaly within 30 minutes of a prompt or model deploy. Each has severity, payload schema, runbook link, owning team.
Rotation plane. Primary and backup pair from the account-management agent team; one-week shifts; handoff every Monday at 10:00; training requirement of two drill participations before entering rotation.
Runbook plane. Five runbook cards — one per paging condition. Each lives in a versioned repo, has an "author," a "last validated" date, and steps that link to executable kill paths in the platform's tooling.
Escalation plane. Graph defined: on-call → backup (10 min) → feature lead (15 min) → relevant specialist (20 min) → provider on-call (30 min if needed). Each named hop has signed off on the SLO.
Postmortem plane. Template adopted. The first postmortem after launch is reviewed by the platform team to verify the eval-delta and follow-up fields are populated.
Drill plane. Monthly drill calendar with rotating scenarios. The first three months exercise one scenario per failure family.
By month three the apparatus exists, has fired on real signals, has been exercised in drills, and is producing apparatus improvements via postmortem follow-ups. The agent's mean time to contain has been measured and trended.
Why all six surfaces are required¶
A common temptation is to ship the apparatus partially — "we'll get the alerts wired first, runbooks can come later." Each surface alone has a known failure mode:
- Alerts without runbooks — the on-call is paged with no procedure; the time to context is high.
- Runbooks without rotation training — the runbook exists; the on-call has never seen it before being paged.
- Rotation without escalation — the on-call is paged; the runbook says "escalate"; the path is undefined.
- Escalation without postmortem — the incident is resolved; nothing changes; the next incident is the same.
- Postmortem without drill — the follow-ups are written; never validated; the apparatus's behaviour on the next real incident is unknown.
- Drill without alerts wired correctly — the apparatus is exercised on synthetic scenarios that do not match what production produces.
The apparatus's value is multiplicative across surfaces. Three out of six is closer to half the value than three-quarters, because the chain breaks at the missing link.
Operational signals¶
Healthy. All six surfaces exist for each shipped AI feature. Alert coverage matches the failure-family matrix. Runbooks are within their freshness budget. Drills are on cadence. Postmortem follow-ups are closing within their SLO.
First degrading metric. A new AI feature ships and the surface checklist is incomplete. The platform team has not enforced the gate. The next incident on that feature exposes the gap.
Misleading metric. Apparatus document count. A team can have a wiki full of runbook drafts and still have an apparatus that does not fire. The metric to watch is whether the apparatus produces a competent response to a real or drilled signal.
Expert graph. The matrix of AI features × surfaces, with cell colour reflecting maturity (red, yellow, green). The aggregate maturity score over time trends apparatus health.
Boundary of applicability¶
Strong fit. Multi-feature, multi-team AI platforms. The full six-surface apparatus is justified per feature and shared across the platform.
Pathology. A small team treating apparatus design as a one-time build. The apparatus is a living system; surfaces decay (runbooks rot, drills lapse, rotations lose context). A team that builds the apparatus and walks away will find it gone in 18 months.
Scale limit. Very large platforms may federate the apparatus — central teams own infrastructure (alert wiring, drill calendar, postmortem template); feature teams own content (thresholds, runbook text, scenario participation). The six surfaces remain; their ownership distributes.
Failure-prone assumption¶
The seductive wrong belief: the apparatus is a one-time build that delivers ongoing value. It is not. The apparatus is more like a CI pipeline than a database — it works when actively maintained and degrades silently when not. Runbooks rot the day after they are written; rotations lose context after a quarter; drills lapse if not scheduled; postmortem follow-ups slip if not enforced.
The correct belief: the apparatus is a standing service with its own SLOs and its own engineering work. Treating it as platform code with maintenance cost is the only way it stays alive.
Where this appears in production¶
- A growth-stage SaaS treats apparatus as a Q1 OKR; ships all six surfaces; declares done; by Q4 the runbooks are stale and the drill calendar is empty.
- A large bank's AI platform federates: the central platform team owns wiring and templates; each line of business owns its rotation and runbook content.
- A telecom AI ships apparatus alongside the first feature; the same pattern is reused for the next four features; apparatus maturity is high across the portfolio.
- A media AI ships alerts first; runbooks lag by three months; the first real incident on a new feature has no procedure and takes 90 minutes longer than it should.
- A fintech writes the postmortem template before the first incident; follow-ups are tracked from day one; year-on-year apparatus maturity improves.
- An e-commerce platform federates poorly — feature teams own everything; central team owns nothing; apparatus consistency degrades across features.
- A payments AI treats apparatus as separate from feature work; apparatus is always "next sprint"; ships never happen.
- A government AI budgets apparatus engineering work alongside feature engineering; apparatus health is reported in the same review as feature health.
- A retail AI has the alert and runbook planes mature; rotation is generic SRE; first real AI incident is escalated through wrong channels.
- A logistics AI uses a vendor for paging (PagerDuty), in-house for runbooks (a versioned repo), and Notion for postmortems; the integration is brittle but functional.
- An ad-tech AI runs drills monthly; the readiness score is on the team's leadership dashboard; investment in apparatus is visible to leaders.
- A B2B platform has six AI features but only three apparatus instances; some features share apparatus with others poorly suited to their failure shapes.
- A travel AI ships an apparatus before launch; the first three drills surface enough gaps that launch is delayed by two weeks; in retrospect, the cheapest two weeks the team has bought.
- A staffing AI has all six surfaces nominally; the actual response to a real incident reveals that the runbook for the fired paging condition was missing.
- A medical AI treats apparatus as a regulatory artefact; the apparatus is audit-ready but operationally inert; real incidents take longer than they should.
- A coding assistant has apparatus integrated with their CI/CD; runbook updates ship through the same code review as feature changes.
- A real-estate AI has a strong alert plane and weak rotation plane; engineers are paged but the wrong engineers; mean time to context is 25 minutes.
- An insurance AI runs quarterly apparatus reviews; gaps surface; remediation is tracked.
- A legal AI has no postmortem template; the same incident has happened three times; each is treated as new.
- A traffic AI ships apparatus as part of the production-readiness review (PRR); features without apparatus do not ship.
Recall / checkpoint¶
- Name the six surfaces of the AI on-call apparatus.
- For each surface, name the input and the output.
- What is the typical pattern when teams ship the apparatus partially, and what is the consequence?
- Why is alert-plane wiring without runbook authoring a recipe for slow incident response?
- What is the difference between treating the apparatus as a one-time build and as a standing service?
- What metric distinguishes apparatus document count from apparatus health?
- How does federation help large platforms, and what does it require of the central team?
Interview Q&A¶
Q1. A team has wired four AI-specific alerts and written three runbooks. The lead says "we have an on-call apparatus." Walk through your assessment. Two of six surfaces are partially present. The team has alerts and runbooks; the rotation, escalation, postmortem, and drill planes are unaddressed. The likely failure is: on the next real incident, the on-call has the alert and a runbook, but no defined backup, no escalation graph, no template to capture lessons, and no drill that has rehearsed the apparatus. The mean time to contain will be high; the incident will not produce apparatus improvements; the same shape will recur. The assessment is that apparatus value is multiplicative, not additive — three of six is much less than half. Common wrong answer to avoid: "they have a start" — a start without the rest of the chain often performs worse than no apparatus at all because the partial response masks the gap.
Q2. Walk through the input, output, and owner for the escalation plane. Input is the set of named hops — primary on-call, backup, feature lead, model owner, prompt owner, retrieval owner, security on-call, legal on-call, provider on-call. Output is a directed graph from page received to specialist resolved, with response-time SLOs at each hop. Owner is the platform team for the graph itself; each named hop owner has signed off on their response-time SLO. The graph is published, versioned, and exercised in drills. The drill is the validation that the SLOs are real. Common wrong answer to avoid: "escalation is informal — engineers know who to ask" — informal escalation works at small scale and breaks the moment the team grows or the apparatus is exercised under load.
Q3. The apparatus is built; six months later it has degraded. What metrics catch the degradation, and what is the remediation? Metrics: runbook freshness (days since last validated), drill participation (drills run vs. drills scheduled), postmortem follow-up closure rate, alert coverage matrix change vs. feature shipment count, mean time to context on real incidents. Remediation: apparatus engineering work scheduled alongside feature work; runbook freshness as a release gate; drill calendar with named owners; postmortem follow-ups tracked to closure with the same rigour as product backlogs. Common wrong answer to avoid: "ask the team to be more diligent" — the apparatus's degradation is a structural problem; willpower does not fix it.
Q4. The platform team owns the apparatus, but each feature team has its own. How do you reconcile? Federation. The central platform team owns the apparatus infrastructure — alert wiring patterns, runbook templates, drill calendar mechanics, postmortem templates. The feature teams own the content — their thresholds, their runbook text, their scenarios, their postmortems. The central team enforces the policy (no feature ships without surface coverage); the feature teams produce the surface content. This avoids both extremes — a central team that does not know the feature, and a feature team that reinvents apparatus per ship. Common wrong answer to avoid: "central team owns everything" or "feature teams own everything" — both extremes have known failure modes.
Q5. Why is the drill plane non-optional, and what happens when teams skip it? The drill plane is the only surface that exercises the rest of the apparatus before a real incident. Without drills, the apparatus's documented design and its real behaviour can diverge silently — a runbook step that does not work, an escalation hop whose owner has changed teams, an alert payload that misses critical context. Teams that skip drills are running unvalidated apparatus; the first real incident is the first validation. The cost of finding apparatus gaps in a real incident is much higher than finding them in a drill. Common wrong answer to avoid: "drills are training, not validation" — drills are both, but skipping the validation is the more expensive omission.
Q6. A team is shipping a new feature and asks whether they can defer apparatus build for one quarter. What is your answer? No, with explanation. Apparatus deferral is the most common failure mode — every feature ships before apparatus, and apparatus is "always next quarter." The discipline is to make apparatus a production-readiness review (PRR) gate. The feature ships when six surfaces are present, even minimally. Minimal apparatus is acceptable; absent apparatus is not. The cost of doing apparatus alongside the first ship is much lower than retro-fitting it after the first incident. Common wrong answer to avoid: "they can defer if they accept the risk" — the team accepting the risk usually does not own the risk; users and the next on-call do.
Design / debug exercise (10 minutes)¶
Modelled example. Take the worked example in the chapter (the account-management agent). Verify each of the six surfaces has input, output, owner, and at least one named artefact. Identify which surfaces would be hardest to validate without a drill, and design a drill that would validate them.
Your turn. Pick one AI feature you own or know well. Build the six-surface table for it. Fill in (a) what exists today, (b) what exists in partial form, (c) what is absent. Estimate the surface-area gap relative to the feature's user-facing risk.
Reproduce from memory. Draw the apparatus diagram from this chapter, with the six surfaces and the signal flow. The signal that you have internalised this chapter is that you can name the six surfaces, their owners, and the value of each in the chain — without rereading.
Operational memory¶
This chapter explained the AI on-call apparatus as a single architecture with six named surfaces — alerts, rotations, runbooks, escalations, postmortems, drills. The important idea is that the apparatus is multiplicative across surfaces, not additive; missing one surface degrades the whole chain.
You learned to name each surface, its input, its output, its owner, and the failure mode of skipping it. That solves the opening failure because the rest of the module develops each surface in turn; once you can name them, you can build them.
Carry this diagnostic forward: when someone says "we have an on-call apparatus," ask which of the six surfaces they have, and which they do not. The honest answer is usually fewer than they thought.
Remember:
- Six surfaces: alerts, rotations, runbooks, escalations, postmortems, drills.
- Each surface has an input, an output, and a named owner.
- Apparatus value is multiplicative, not additive.
- The apparatus is a standing service, not a one-time build.
- Federation works at scale; central owns infrastructure, features own content.
Bridge. The apparatus is the architecture. The alert plane is its eyes — and the first surface to develop in detail, because every other surface activates from a page. The next chapter is the discipline of designing AI-specific alerts. → 03-alert-design-for-ai-systems.md