04. Rotation and ownership¶
~10 min read. Alerts produce pages. The rotation plane is where those pages land — the standing roster of who is on call for which AI surface, with the context, training, and backup to act in the first ten minutes.
Continues from
03-alert-design-for-ai-systems.md. This chapter develops the rotation plane. Recurring concepts in bold: rotation roster, primary, backup, handoff discipline, training gate, time-zone fairness, single-engineer rotation.
A page that fires perfectly into a void is a wasted alert. The rotation plane is the discipline that ensures every paging condition has a defined, trained, on-shift engineer to receive it — and a defined backup when the primary cannot answer.
What an AI rotation is¶
An AI rotation is the named primary/backup pairing for an AI surface, with shift cadence, handoff discipline, and a training gate that prevents under-prepared engineers from entering the rotation.
Three properties matter more than for a classic SRE rotation:
- Per-surface, not per-team. A team owning four AI features may need four rotations (or four well-defined sub-rotations), because the context, runbooks, and escalation graph differ per feature. A single rotation across heterogeneous features dilutes the on-call's knowledge.
- Training-gated. Engineers do not enter the rotation by tenure. They enter after passing a training gate — typically two drill participations and a runbook walk-through with the previous primary.
- Time-zone-fair. AI features serving global users page at all hours. A rotation that puts the same engineers on-call every weekend or every night will burn them out and degrade apparatus health.
The roster shape¶
A minimal roster has:
- Primary — the engineer who receives the first page.
- Backup — the engineer paged if the primary does not acknowledge within the SLO (typically 5-10 minutes).
- Lead-on-call — the engineering lead reachable for severity decisions and resource unblocking.
- Subject-matter on-call — the prompt owner, model owner, or retrieval owner reachable as the next escalation hop.
For a small team, the primary may also be the subject-matter expert; for a large platform, these are different people. The rotation explicitly names each role; "ask the team Slack channel" is not a role.
A common shape:
Week of 2026-05-26:
Primary: Priya Iyer (account-management agent)
Backup: Rahul Mehta (account-management agent)
Lead: Anjali Singh (lead for the agent team)
Prompt SME: Arjun Rao (prompt owner)
Model SME: Gateway team on-call (per gateway rotation)
Retrieval SME: Data platform on-call (per data rotation)
Provider escalation: Anthropic enterprise channel
The roster is published, versioned, and accessible from the alert payload. The on-call should not have to look it up under stress.
The handoff discipline¶
Rotations change hands. The handoff is where context is lost or preserved.
Minimum handoff content:
- Open incidents. Anything fired in the last week that is not fully closed (alerts cleared but postmortems open, follow-ups in flight).
- Deploys in the window. Prompt or model deploys in the last 48-72 hours; the deploy-anchored windows may still be active.
- Known noisy alerts. Alerts that have been flagged for tuning but not yet retuned.
- Active drills. Any drill the new primary is expected to lead or participate in.
- Watch items. Soft signals that have not crossed paging thresholds but the previous primary noticed.
The handoff happens live (synchronous meeting) plus written (handoff document). Written-only handoffs lose context; live-only handoffs lose persistence.
The training gate¶
Engineers entering a rotation should have passed:
- Two drill participations. At least one as observer, one as primary or backup.
- A runbook walk-through. The new engineer reads each runbook for the surface and walks the previous primary through their understanding.
- A tooling check. The engineer has access to: paging system, gateway dashboards, deploy log, rollback commands, postmortem template, escalation directory.
This is not bureaucracy. Each gap (no drills, no runbook reading, no tooling access) maps directly to a real incident the rotation will fail. The gate is the cheapest possible insurance.
A worked example — the multi-surface team¶
A Bengaluru SaaS company has one platform team owning four AI features: an account assistant, a search reranker, a document extractor, and an internal coding helper. The naive design puts all four under one rotation; the platform team takes turns being on-call for everything.
The naive design fails in two ways. The primary on a Sunday is paged about the document extractor; they have not touched its prompts in six months; the runbook references systems they have not used. Mean time to context is 35 minutes. Second, the document extractor is paged twice a quarter; the other three are paged twice a week; the rotation engineer's mental model is dominated by the noisier features.
The redesign:
- Tier 1 rotation. Account assistant, search reranker, coding helper — high page volume, similar shape. One shared rotation across the engineers who know all three.
- Tier 2 rotation. Document extractor — low page volume, distinct shape. Two engineers who own it cover it; the lead is the backup.
- Lead-on-call. A single rotation across the team's leads, used as second-line escalation for both tiers.
Mean time to context drops to 8 minutes; engineers feel the rotation is fair; the document extractor's quality improves because its owners now see its alerts directly.
Single-engineer rotation — the failure pattern to refuse¶
A rotation with one engineer (no backup) is not a rotation; it is a single point of failure. The engineer's holiday, illness, or unresponsive phone produces an immediate apparatus failure. The platform team must refuse single-engineer rotations even when the feature team protests it is "just a small feature."
Acceptable patterns when staffing is thin:
- Primary plus an explicit backup. The backup is a real human with the access and the runbook; they may need a brief training before the rotation starts.
- Shared rotation with a sibling team. Two small teams pool their engineers; each rotation has primary from one team, backup from the other.
- Platform-on-call as backup. The platform team's standing rotation is the backup for thin feature rotations.
The pattern to refuse is "Priya is on-call for this feature, full stop." The pattern is fragile and silent until the silent failure happens.
Operational signals¶
Healthy. Every AI surface has a primary, a backup, a lead, and named SME contacts. The training gate is enforced. Handoffs include a live and a written component. Page-acknowledgement times are within SLO.
First degrading metric. Page-acknowledgement time creeping up. The primary or backup is slow to respond; the apparatus is degrading.
Misleading metric. Number of rotations. A team can have many rotations with low quality; the metric to watch is whether each rotation has a backup, training-gated entry, and clean handoffs.
Expert graph. Per-rotation health: page-acknowledgement time, drill participation rate, engineer satisfaction (surveyed quarterly). The combination catches rotations that are nominally healthy but degrading.
Boundary of applicability¶
Strong fit. Multi-feature, multi-engineer teams where AI surfaces have distinct shapes. The full rotation discipline pays for itself.
Pathology. A two-engineer team treating rotation as ceremony rather than load distribution. The discipline matters; the form may be lightweight. Document who is the primary today and who is the backup; that is enough.
Scale limit. Very large platforms (dozens of rotations) face the meta-problem of rotation health across the portfolio. The solution is a platform-team rotation health review monthly, with intervention on rotations that are degrading.
Failure-prone assumption¶
The seductive wrong belief: the on-call rotation can absorb new AI features without re-design. It cannot. Each new AI surface adds context, runbooks, deploys, and escalation paths that the existing rotation engineers may not have. The correct belief: rotation capacity scales with rotation training; adding a feature without training the rotation is adding load without adding capacity. Every new feature ships with a rotation training plan or the feature does not ship.
Where this appears in production¶
- A growth-stage SaaS runs one rotation across five AI features; mean time to context is 30+ minutes; rotation split improves to 8 minutes.
- A bank uses a federated rotation: each line of business owns its primary; central platform on-call is the backup.
- A telecom AI enforces training gate; engineers entering rotation participate in two drills first; post-launch incidents are handled cleanly.
- A consumer chatbot had single-engineer rotation; engineer went on holiday; the next page rang into voicemail; incident lasted four hours.
- A retail AI publishes the roster in the alert payload; on-call never has to look up the SME contact.
- A coding assistant team does live handoff every Monday plus a written summary; handoff document is referenced during the week's incidents.
- A travel platform has handoffs that are written-only; the new primary missed context on three open follow-ups; incidents took longer.
- A fintech has rotations per AI domain (lending, payments, fraud); engineers stay deep in one domain.
- A logistics AI lets the rotation roster drift out of sync with the team; new hires not added; departures not removed; pages reach the wrong people.
- A government AI ties rotation entry to a documented training checklist; the checklist is audited quarterly.
- A B2B SaaS had primaries paged at 03:00 every weekend; rotation rebalance distributed the load; engineer satisfaction recovered.
- A media AI uses a shared rotation across two small teams; load distribution is sustainable.
- A payments AI lead-on-call is a separate rotation from primary-on-call; the lead handles severity decisions and resource asks.
- A legal AI has a subject-matter on-call per legal domain (corporate, employment, IP); the SME hops are short.
- A healthtech AI has time-zone fairness as a stated rotation principle; engineers in different time zones rotate weekend slots fairly.
- A staffing AI has rotation health reported quarterly; degrading rotations are remediated by re-training or rebalancing.
- An ad-tech AI has page acknowledgement SLO at 5 minutes for P1; missed acknowledgements trigger backup; the apparatus enforces the SLO.
- A search-rerank service has the gateway team's rotation as backup for low-volume features; the cost is shared.
- A document AI rotation training includes a drill on a real archived incident; engineers see what the apparatus produces under load.
- A real-estate AI has rotation surveys quarterly; a rotation flagged as exhausting is rebalanced.
Recall / checkpoint¶
- Name the four roles in an AI rotation roster.
- What is the training gate, and what does it require?
- What is the handoff discipline, and why does it have both live and written components?
- Why is a single-engineer rotation a failure pattern?
- When does it make sense to split one rotation into two?
- What signal tells you a rotation is degrading?
- How does rotation capacity scale with new AI features?
Interview Q&A¶
Q1. A team has one rotation across four AI features; mean time to context is 30 minutes. Walk through the diagnosis and the fix. The rotation is heterogeneous; the primary on any given day has shallow context on most features. The fix is to split into rotations that group features by shape and shared context. A high-page-volume tier shares one rotation across engineers who know all the features; a low-page-volume tier has its own rotation. Lead-on-call is a separate rotation that serves all tiers as second-line. After split, the primary on-call has deep context for the features they cover, and mean time to context drops materially. Common wrong answer to avoid: "everyone should know everything" — possible at small scale, infeasible past a few features without quality degradation.
Q2. The team protests that they only have two engineers and cannot staff a rotation with a backup. What is your response? Single-engineer rotation is not acceptable; the failure modes (holiday, illness, missed page) are silent until they happen. The patterns that work: shared rotation with a sibling team, platform-team on-call as backup for the thin feature, or postpone the launch until staffing supports a rotation. The platform team should not allow the launch with a single-engineer rotation; the cost is silent and the recovery is expensive. Common wrong answer to avoid: "we'll be careful" — careful is not an apparatus surface.
Q3. Walk through the training gate for entering a new AI rotation. Two drill participations (one observer, one primary or backup). A runbook walk-through with the previous primary — the new engineer reads each runbook and explains their understanding. A tooling check confirming access to paging system, dashboards, deploy log, rollback commands, postmortem template, escalation directory. The gate is not bureaucracy; each gap maps directly to a real incident the rotation will fail. Common wrong answer to avoid: "we'll let them shadow for a week" — shadowing without structured exercises and tool access leaves real gaps.
Q4. How do you measure rotation health? Page-acknowledgement time, drill participation rate, engineer satisfaction (quarterly survey). The combination catches rotations that are nominally healthy but degrading. A rotation with rising acknowledgement times and stable participation rate may indicate alert noise; one with stable times and falling satisfaction may indicate burnout. The metrics inform the apparatus design, not just the people. Common wrong answer to avoid: "incident count" — incidents reflect the system being protected, not the rotation's health.
Q5. The handoff between primaries is written-only and incidents are slipping through the gap. What is the apparatus failure? Written-only handoff loses the synchronous Q&A that catches edge cases — the new primary cannot ask "what about that alert from Tuesday that we kept open?" The fix is to add a live component (15-30 minutes synchronous handoff every cadence) plus the written summary. The written summary is the persistence; the live is the comprehension. Both are needed. Common wrong answer to avoid: "make the written summary longer" — comprehension does not scale linearly with prose volume.
Q6. Why is rotation capacity tied to training, not just to engineer count? An untrained engineer in the rotation can answer the page but cannot act competently; they will spend the first 20 minutes asking questions, escalating, or guessing. The apparatus's effective capacity is the number of trained engineers, not the number of bodies. Each new AI feature is a training investment for the rotation; without that investment, the rotation has nominally more bodies but the same or lower effective capacity. Common wrong answer to avoid: "we'll add an engineer to the rotation" — adding a body without training is adding load without capacity.
Design / debug exercise (10 minutes)¶
Modelled example. Walk through the worked example (the Bengaluru SaaS rotation split). Verify the redesigned rotation has primary, backup, lead-on-call, and SME contacts for each surface, with the training gate enforced and handoff discipline established.
Your turn. Pick a team or set of features. Draw the current rotation. Identify: single-engineer rotations, missing backups, ungated entries, written-only handoffs. The list is your next sprint of rotation work.
Reproduce from memory. Write the roster shape from this chapter (the four roles and what each does) without rereading. The signal of internalisation is that the four roles land in under two minutes with their owners and SLOs.
Operational memory¶
This chapter explained the rotation plane: named primary and backup per AI surface, training-gated entry, handoff discipline, and explicit refusal of single-engineer rotations. The important idea is that rotation capacity scales with training, not with body count, and that the rotation is the apparatus's body — the place where alerts become action.
You learned to design rotations per AI surface, enforce a training gate before entry, and run both live and written handoffs. That solves the opening failure because a page that lands in a competently staffed rotation produces action; a page that lands in an under-staffed rotation produces delay.
Carry this diagnostic forward: when a team adds an AI feature, ask "what is the training plan for the rotation?" If the answer is none, the feature is shipping with a coverage gap.
Remember:
- Per-surface rotations beat one-rotation-for-everything past a few features.
- Training-gated entry: two drills, a runbook walk-through, a tooling check.
- Live plus written handoff; never one without the other.
- Refuse single-engineer rotations; they are silent failure points.
- Rotation health is page-acknowledgement time plus drill participation plus engineer satisfaction.
Bridge. Pages land on trained engineers. Trained engineers act through runbooks. The next chapter is the runbook plane — how to author, version, and keep fresh the executable documents an on-call relies on. → 05-runbook-authoring.md