06. Escalation paths¶
~9 min read. The on-call has the alert, the rotation, and the runbook. When those are not enough, the escalation plane is what gets the right specialist on the call. The discipline is a named graph from on-call to provider, with SLOs at each hop.
Continues from
05-runbook-authoring.md. This chapter develops the escalation plane. Recurring concepts in bold: escalation graph, named hop, hop SLO, handover context, provider escalation, specialist on-call.
A runbook that ends with "escalate" is incomplete. The escalation plane is what makes "escalate" executable — a directed graph from the on-call to the engineer, lead, specialist, or provider who can resolve the incident.
What an escalation graph is¶
An escalation graph is a directed structure from primary on-call through backup, lead, named specialists, and provider contacts, with a response-time SLO at each hop and a handover-context contract.
Three properties matter:
- Named. Each hop is a role with a current owner. "The team Slack channel" is not a hop; "Prompt SME on-call: Arjun Rao, this week" is.
- SLO at each hop. Response time is contracted at each hop (5 min for backup, 10 min for lead, etc.). Without SLOs, escalations stall silently.
- Handover context defined. Each hop knows what context to receive: alert payload, actions taken, current state, hypothesis.
The default graph¶
A common AI escalation graph:
Primary on-call
| (5 min ack SLO)
v
Backup on-call
| (10 min if backup also fails)
v
Lead-on-call (15 min response)
|--+
| +---> Prompt SME (for prompt-suspect incidents)
| +---> Model gateway on-call (for provider drift)
| +---> Retrieval SME (for retrieval staleness)
| +---> Security on-call (for safety violations)
| +---> Legal/compliance (for regulated-content incidents)
|
v (if specialist insufficient)
Engineering director
|
v (for provider issues)
Provider enterprise support
The graph is not linear past the lead. After the lead-on-call, the graph branches based on the suspected cause. The runbook indicates which branch; the lead confirms or overrides.
Each hop has a contract¶
Each named hop in the graph has a small contract:
- Reachability. How the hop is paged (PagerDuty service, dedicated phone, Slack channel with auto-page integration).
- Response SLO. Time within which the hop must acknowledge.
- Authority. What decisions the hop can make alone (rollback, feature flag, tenant disable) and what requires further escalation.
- Handover-context schema. What context the hop must receive: alert payload, runbook position, actions taken, current state, hypothesis, escalation reason.
Without the contract, escalation is improvisation. With it, escalation is mechanical and fast.
A worked example — the runaway agent escalation¶
The Pune fintech's account agent enters a tool-call loop on a Sunday at 14:00. The quality alert fires; the on-call follows the runbook. Step 5 of the runbook indicates the issue is not deploy-caused; the on-call cannot identify the cause and escalates.
The escalation flow:
- 14:02. Primary acknowledges page; opens incident channel.
- 14:08. Primary completes runbook steps 1-5; no resolution. Escalates.
- 14:09. Lead-on-call is paged via the dedicated PagerDuty service. SLO: 15 minutes.
- 14:14. Lead joins channel. Reviews payload, actions taken, hypothesis ("loop in the new agent tool, but cannot confirm").
- 14:16. Lead invokes the runbook's specialist branch: "agent loop or tool misconfiguration → agent team specialist + gateway team on-call."
- 14:21. Agent team specialist (Rahul Mehta) joins. Identifies the tool retry config as the cause. Pushes a hotfix to the agent's retry policy.
- 14:38. Hotfix deployed; alert clears; incident contained.
Total time: 36 minutes from page to clear. Without a defined escalation graph, the primary would have spent the time after step 5 either improvising debugging or messaging the team channel hoping someone was awake. Mean time to contain would have been multiple hours.
Provider escalation¶
The hardest hop is the provider — Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI. The escalation has unique constraints:
- Reachability is contractual. Enterprise customers have named contacts and SLAs; smaller customers have a public status page and a support email.
- Response time is the provider's, not yours. The escalation can wait minutes or hours; the apparatus must have mitigation paths that do not depend on the provider responding.
- What you can ask for is limited. Providers can confirm an outage or behaviour change; they cannot usually re-deploy on your timeline.
- Documentation is your job. A solid provider escalation includes the incident payload, the gateway logs, the deviation from baseline, and the business impact statement. A vague "your model is broken" message gets a vague reply.
The runbook for provider-suspect incidents should include:
- The provider contact channel (with phone numbers, account IDs).
- The template for the provider message (with placeholder fields the on-call fills in).
- The gateway-side mitigation paths the apparatus controls (fallback chain, model pin, cache fall-through, refusal at the gateway).
Handover-context handoff¶
When an incident moves from one hop to another, context is preserved through a structured handover:
- The alert payload. Original page contents.
- Runbook position. Which step the previous hop was on; what they completed.
- Actions taken. Rollback attempted? Feature flag flipped? Tenant disabled?
- Current state. Is the alert still firing? Has the user-facing impact changed?
- Hypothesis. What does the previous hop think is happening?
- Escalation reason. Why is this hop being brought in?
The handover is verbal (live in the incident channel) plus durable (a short summary post in the channel). The discipline is that the next hop should be able to act within 5-10 minutes of joining; if they need to re-derive context, the handover failed.
Operational signals¶
Healthy. Hop SLOs are met during real incidents. The escalation graph is published, current, and referenced from the runbook. Provider escalations have defined contacts and templates. Drills exercise the full graph.
First degrading metric. Hop SLO breaches creeping up. A hop is not responding within their SLO; the apparatus is degrading.
Misleading metric. Number of escalations. High escalation volume can mean alerts are firing on novel cases (a sign the apparatus is still maturing) or that on-calls are escalating too early (a training gap). The metric does not distinguish.
Expert graph. Hop SLO compliance per hop, escalation reason distribution, mean time to context at each hop. The combination shows which hops are degrading and why.
Boundary of applicability¶
Strong fit. Multi-team platforms with distinct specialists per AI surface. The full graph is justified.
Pathology. A small team has every specialist on the rotation; the graph degenerates to "the on-call escalates to themselves." The pattern is to keep the graph explicit even when the names overlap; the graph is the contract, not the headcount.
Scale limit. Very large platforms have multi-layered escalation with specialist-of-specialist hops. The pattern is to keep individual hops to no more than 3-4 layers; deeper graphs lose context at each hop.
Failure-prone assumption¶
The seductive wrong belief: escalation can be informal — engineers know who to ask. This works at small scale and breaks under load. Informal escalation depends on individuals remembering who to ping; under stress, engineers ping the wrong person, ping multiple people who all defer to each other, or ping no one. The correct belief: escalation is a contract with SLOs and named owners. Even at small scale, naming the contract makes it survive turnover, holidays, and stress.
Where this appears in production¶
- A fintech has the lead-on-call as a separate rotation; escalations land cleanly.
- A telecom AI has provider escalation templates pre-written; on-call can send the right message in 2 minutes.
- A consumer chatbot has informal escalation; the next real incident takes 80 minutes longer because escalation goes to a vacationing engineer.
- A healthtech AI has named hops with phone-number fallbacks for off-hours; reachability is enforced.
- A coding assistant publishes the escalation graph in the alert payload; engineers do not have to look it up.
- A retail AI has a 3-layer graph (on-call → lead → director); deeper than this loses context.
- A logistics AI has specialist hops named by role, not person; when Rahul leaves, the role's new owner is updated centrally.
- A government AI escalates to the regulator as an explicit hop for safety violations; the regulator contact is in the graph.
- A B2B SaaS measures hop SLO compliance per quarter; failing hops are remediated by staffing or training.
- A travel platform has provider escalation templates per provider; the on-call fills in fields and sends.
- A payments AI has security on-call as a parallel hop on safety-violation incidents; both hops are paged simultaneously.
- A legal AI has legal-counsel on the escalation graph; legal-tier incidents reach counsel within 30 minutes.
- A staffing AI has multi-region escalation; the on-call for the affected region is the first hop.
- A real-estate AI has the engineering director as a final escalation; the hop is rarely needed but defined.
- A media AI had provider escalation through a generic support email; response time was 24 hours; upgraded to enterprise contract with 4-hour SLA.
- An ad-tech AI has a Slack auto-page integration; specific channel messages trigger the named hop.
- A document AI has the prompt SME hop with a 1-hour SLO during business hours, escalates to the lead off-hours.
- A search-rerank service escalates to the data platform team for retrieval issues; the hop's SLO is contracted between teams.
- A medical AI has a regulator notification step in the escalation flow; the step is exercised in drills.
- A small startup AI has every engineer in the escalation graph; the graph names them by role to survive growth.
Recall / checkpoint¶
- What are the three properties of a good escalation graph?
- What is in the contract for each named hop?
- What is unique about provider escalation, and what does it require?
- What is the handover-context schema?
- How does the escalation graph relate to the runbook?
- How do you detect that an escalation hop is degrading?
- Why is informal escalation a structural failure pattern even at small scale?
Interview Q&A¶
Q1. A team's escalation works fine in practice — engineers know who to ask. Why design a formal graph? Because "knowing who to ask" depends on individual memory, current relationships, and being awake. Under stress, the wrong person is pinged; under turnover, the right person is no longer on the team; under holidays, no one is on. The graph survives all three by making the contract explicit. Designing the graph at small scale also surfaces gaps before they fail in a real incident (no backup for the prompt SME, no provider contact for off-hours). Common wrong answer to avoid: "we'll add it when we grow" — escalation discipline does not retrofit easily once habits form.
Q2. The runbook says "escalate to the lead." What is missing, and how do you fix it? The hop has no SLO and no handover-context schema. "Escalate to the lead" leaves the on-call to figure out how to reach the lead (paging system? Slack? phone?), how long to wait, and what to send. The fix is to name the channel (PagerDuty service: lead-on-call), the SLO (15-minute response), and the handover-context schema (alert payload + actions taken + current state + hypothesis + reason). Then the runbook step "escalate" is executable. Common wrong answer to avoid: "the lead will figure it out" — the on-call cannot get the lead's attention without the contract; the lead cannot act without the context.
Q3. Walk through provider escalation for a suspected provider behaviour change. Open the gateway dashboards; confirm the drift signal (refusal rate, error class, latency distribution); collect the magnitude and traffic share affected. Open the provider escalation template; fill in the deviation from baseline, affected workload, business impact. Send to the provider contact (enterprise channel for enterprise customers; status page email otherwise). While waiting, execute the gateway-side mitigations: pin the prior model version, route to the fallback provider for the affected workload, enable the cache fall-through. The provider response is asynchronous; the apparatus contains independently. Common wrong answer to avoid: "wait for the provider to fix it" — the apparatus must have mitigation paths that do not depend on the provider's timeline.
Q4. The hop SLO for the prompt SME is being missed during weekend incidents. What is the apparatus failure? Either the SME rotation is under-staffed (one engineer covering weekends), the page is not reaching them (channel misconfigured), or the SLO was set unrealistically (15 minutes when 30 is more honest). The diagnostic is the SLO breach data: which hops, what time of week, what response time was achieved. The fix is rotation rebalance, channel fix, or SLO renegotiation — chosen based on the data. Common wrong answer to avoid: "tighten the SLO" — does not improve compliance if the structural staffing or reach issue is unresolved.
Q5. How does the handover-context schema reduce mean time to context? The next hop joins an incident with a structured summary instead of having to ask questions. The schema (alert payload, runbook position, actions taken, current state, hypothesis, escalation reason) covers the questions the new hop would ask first. The next hop can act within 5-10 minutes; without the schema, they spend 15-20 minutes re-deriving context. Across a multi-hop incident, the schema saves 30+ minutes. Common wrong answer to avoid: "the next hop can read the channel history" — channel history is unstructured; the schema is the digest.
Q6. The graph has 5 layers. Is this a problem? Often yes. Deeper graphs lose context at each hop; by the fifth hop, the original cause is summarised through 4 rephrasings. The pattern is to keep individual graphs to 3-4 layers; if more depth is needed, split the graph by failure family (one graph for safety incidents, one for cost incidents) rather than stacking hops. Each shorter graph preserves context better than one long graph. Common wrong answer to avoid: "more hops means more reach" — reach with degraded context is worse than reach with clean context.
Design / debug exercise (10 minutes)¶
Modelled example. Walk through the worked example (the Pune fintech agent incident). Identify each hop, its SLO, the handover content, and the time spent at each hop. Identify where the graph design saved time vs. where it cost time.
Your turn. Draw your team's escalation graph. For each hop, fill in: channel, SLO, authority, handover-context schema. Identify any hop that is not formally named or has no SLO.
Reproduce from memory. Write the default graph from this chapter, with hops and SLOs. The signal of internalisation is that the graph lands in under three minutes with the typical SLOs.
Operational memory¶
This chapter explained the escalation plane: a named graph from primary on-call through backup, lead, specialists, and provider contacts, with SLOs and handover contracts at each hop. The important idea is that escalation is a contract, not a habit; the contract is what makes escalation executable under stress.
You learned to design hops with channel, SLO, authority, and handover schema; to define provider escalation with templates and mitigation independence; and to keep graphs shallow enough to preserve context across hops. That solves the opening failure because the runbook's "escalate" step is now an action with a defined target and contract.
Carry this diagnostic forward: when a team says "escalation works," ask for the graph. The graph either exists, with hops and SLOs, or escalation is happening by habit and will fail under load.
Remember:
- Named hops with channel, SLO, authority, handover schema.
- Provider escalation has templates plus independent mitigations.
- Handover-context schema cuts mean time to context.
- Graphs stay shallow (3-4 layers); deeper graphs lose context.
- Informal escalation is a structural failure pattern, not a stage of maturity.
Bridge. The first six chapters built the apparatus — alerts, rotations, runbooks, escalations. The next three chapters develop the specific runbook families the apparatus needs. The first is the hardest: degraded quality. → 07-degraded-quality-runbooks.md