01. Why classic on-call fails for AI¶

~10 min read. Classic SRE on-call is uptime-first: alert on errors, page on saturation, restart what failed. AI systems fail in shapes those alerts cannot see. This chapter is the diagnosis that forces a new apparatus.

Continues from 00-first-principles.md. The alert plane, rotation plane, runbook plane, and escalation plane of the AI on-call apparatus all exist because the classic apparatus produces false-negatives on AI-specific failures. Until we feel why the classic apparatus fails, we will treat its replacement as overhead.

The first-principles overview promised an apparatus. This chapter earns that promise by walking through five real failure shapes that pass through a classic on-call apparatus unnoticed. Each is something a senior SRE on-call would not have caught with their usual instruments, not because the SRE was wrong but because the instruments were calibrated for a different problem.

The five shapes a classic apparatus misses¶

A classic SRE on-call is calibrated for outage and saturation. Endpoint returning 5xx; queue depth climbing; pod OOM; cluster CPU at 95%. The alerts page on symptoms whose ground truth is "the service is up or it is not." The runbooks are about restoring up-ness: restart, scale, route around. The post-mortem asks why the service went down and what would have prevented the downtime.

None of that is wrong. It just does not see five families of AI-specific failure:

Family	What goes wrong	Why classic alerts miss it
Silent quality regression	The AI's answers degrade after a prompt or model change	API still returns 200; error rate unchanged
Provider behaviour drift	The provider's model now refuses or produces different shapes	API is responsive; the shift is in content, not status
Cost runaway	A misconfigured batch loops on the gateway	Latency dashboards do not show cost; bills lag
Safety boundary violation	The AI produces harmful content for a specific input class	Aggregate metrics hide the input slice; complaints lag
Long-tail user harm	A small fraction of users gets systematically bad answers	The 99% is healthy; the 1% is invisible to averages

Each one has a different shape; each one needs its own paging condition, its own runbook, and often its own escalation path. The first chapter's claim is that any company shipping AI without a re-designed apparatus is missing at least three of these five.

A worked example — the silent regression that pages no one¶

The platform team at a Bengaluru insurance SaaS deploys a prompt change at 18:00 IST. The change improves average answer quality on the eval set by 3%; the team is pleased. At 18:42, a user-feedback signal in the product surface shows a small uptick in thumbs-down on policy-comparison questions. The on-call dashboard does not surface this signal — the dashboard is wired to API status, latency, and error rate. All three are green. The on-call engineer is not paged.

By 23:00 the user-feedback signal is loud enough that the product manager notices. By 09:00 the next morning, a postmortem is opened. Investigation reveals that the prompt change improved aggregate eval scores but degraded a specific intent — policy_comparison_with_riders — that constitutes 14% of traffic. The regression on that intent was 23%, large enough that every affected user noticed.

The classic apparatus had no failure to alert on. The API was healthy. The prompt deploy was logged. The eval scores were better. The thumbs-down signal was not paging because it had never been wired as a paging condition; it was a dashboard tile.

Three lessons from this:

Aggregate metrics hide slice regressions. Quality alerts must be sliced by intent, tenant, or user cohort, not just measured in aggregate.
Eval scores on a frozen set are necessary but not sufficient. Production traffic distributions shift faster than eval sets are refreshed; an eval win on yesterday's set can be a production loss today.
A signal that lives only on a dashboard is a signal that does not page. The dashboard is for the postmortem; the page is for the incident.

The runbook the team needs at 18:42 — "quality regression after prompt change" — did not exist. The escalation path to the prompt owner was not defined. The drill scenario that would have rehearsed this had not been run.

That is one shape. The next four behave differently and need their own treatment.

Shape 2 — provider behaviour drift¶

The provider rolls out a behavioural change to a model on Tuesday afternoon — perhaps a stricter refusal posture, perhaps a different formatting default for tool calls. The API is healthy. The model alias the gateway uses still resolves. The gateway returns 200s.

The downstream effect is that 8% of production calls now produce a refusal where they used to produce an answer. The product surface treats refusal as a soft failure; the user is shown a fallback message; the impact is visible to users but not to the SRE dashboard. Behind the scenes, the gateway's refusal rate metric (see 01_model_gateway_provider_ops chapter 09) drifts up. If the gateway team has not wired a refusal-rate alert with a sensitive threshold, the drift goes unnoticed.

The classic apparatus is blind here in a specific way: there is nothing to restart. The provider has not failed; they have changed. The runbook to write is not "restore the service" but "detect the drift, pin the model version, dual-run a candidate, decide whether to switch."

Shape 3 — cost runaway¶

A misconfigured tool-call loop in a new agent feature retries on every soft failure. Each retry is a full model call. The latency does not spike — each call completes in budget; the per-second call rate climbs steadily for 90 minutes before crossing the gateway's per-tenant quota and triggering its quota alert. By the time the quota alert fires, the tenant has spent $12,400 above its expected daily budget.

Classic SRE alerts: nothing fires. CPU and memory are stable; the gateway absorbs the calls. Cost dashboards exist but are refreshed daily, not in real time.

The on-call surface needs cost-anomaly alerts tied to per-tenant or per-feature baselines, not to absolute thresholds. A cost-spike runbook needs to enumerate the kill paths: feature flag off, tenant quota tightened, agent rate-limited, the offending tool disabled. The escalation needs to reach the agent owner, who has the context the SRE on-call does not.

Shape 4 — safety boundary violation¶

A user submits a prompt that successfully bypasses the safety layer for a specific class of input. The output causes user harm — a wrong medical claim, a wrong legal claim, a hateful response — that the aggregate safety eval did not catch because the input class is rare.

The classic apparatus has no signal. The aggregate safety eval is in the green zone. The complaint is filed in the support queue and routed through customer success.

The AI apparatus needs the safety-violation signal pre-defined as a paging condition with a low threshold — even a single confirmed violation should reach the safety on-call, not just the support queue. The runbook needs to scope the input class, freeze further generation on that class if possible, and kick off a rapid eval-set expansion.

Shape 5 — long-tail user harm¶

A new model is rolled out at 50% canary. Aggregate metrics — eval score, latency, error rate, cost — all look healthy or better. A specific user cohort, say enterprise tenants with a particular jargon vocabulary, sees a 30% answer-quality regression because the new model's training distribution under-represented that vocabulary.

In a classic apparatus, the canary "passes" — the aggregate metrics support promotion. The regression on the cohort is invisible until that cohort's customer success manager surfaces it in a quarterly review.

The AI apparatus needs canary metrics sliced by cohort and tenant tier, with cohort-level regression alerts that block promotion. The runbook needs to specify the rollback path and the eval expansion that prevents recurrence.

The "not X, Y" diagnosis¶

The diagnosis is uncomfortable for an SRE-fluent organisation. So the real problem is not that on-call engineers are missing alerts; it is that the alerts were never designed for the failure modes the system is most likely to produce. A team can have a flawless SRE practice and still ship AI with a fundamentally inadequate on-call apparatus.

The natural question: so what does the AI-specific apparatus need to add? Three additions that the next chapters develop:

New paging conditions — quality, drift, cost, safety — wired with the same rigour as latency and error rate.
New runbook families — one per AI-specific failure shape, with executable steps and the right owners on the escalation graph.
New rotation discipline — on-call engineers trained for the AI surfaces they are paged on, with explicit handoff of the prompt-version, model-version, retrieval-index, and eval-state context.

The apparatus is the sum of these additions plus the SRE baseline. Neither alone is enough.

Operational signals¶

Healthy. The team's AI-specific paging conditions (quality, drift, cost, safety) have fired and been resolved within their SLOs. The runbooks they reference are within their freshness budget. Drills in the last quarter exercised at least one scenario per AI failure family.

First degrading metric. A new AI feature is shipped without a new alert; a new runbook is missing; a drill is skipped. The apparatus does not page; the apparatus has not noticed.

Misleading metric. Aggregate alert count. A team with no AI-specific alerts wired will have low alert volume and feel healthy. The metric to watch is alert coverage per failure family, not alert volume.

Expert graph. The matrix of failure families × surfaces, with green for "alert exists and has fired in a drill," yellow for "alert exists, never fired," red for "no alert wired." A red cell is a guaranteed future incident.

Boundary of applicability¶

Strong fit. Companies that ship user-facing AI features at scale, with multiple intents, multiple tenants, and provider dependencies. The full apparatus is justified.

Pathology. Tiny prototypes with no users. The apparatus is overkill; a single eval-on-feedback dashboard is enough. The pathology is to ship the prototype to users without then building the apparatus.

Scale limit. Very large AI platforms (hundreds of features, dozens of teams) may need to specialise the apparatus per-domain, with shared infrastructure for the underlying signals. The six surfaces still apply; their implementation is distributed.

Failure-prone assumption¶

The seductive wrong belief: good SRE practice transfers cleanly to AI. It does not. SRE practice transfers as a foundation; on top of it, AI requires a re-designed alert plane, a re-designed runbook plane, and a re-designed escalation plane because the failures live in shapes the SRE foundation was not built to see. A team that treats AI on-call as "just SRE on-call applied to AI services" will miss three of five failure families.

The correct belief: the classic apparatus is a prerequisite, not a substitute. The AI apparatus extends it, not replaces it.

Where this appears in production¶

A travel platform ships a prompt change at 22:00; thumbs-down rate on flight-rebooking queries climbs 18% with no API alert; the team learns from the product manager in the morning.
A fintech sees a provider's safety posture tighten overnight; 6% of compliance-document queries now refuse; the gateway's refusal rate alert was not wired with a sensitive threshold; the team learns from a regulatory complaint.
A customer support SaaS has a misconfigured agent loop that costs $40,000 over a weekend; no alert paged because cost dashboards refresh daily.
A healthtech assistant generates a wrong dosing suggestion for a rare condition class; the safety eval missed the input; the violation reaches a clinician before any paging condition fires.
An enterprise AI assistant promotes a model to 100% on green aggregate metrics; a specific tenant cohort sees a 30% quality regression invisible to aggregate.
An e-commerce platform has six AI services and one stale runbook; an outage in the retrieval index pages the on-call who has no context; resolution takes 90 minutes longer than it should.
A legal-tech tool rolls out a new prompt for clause extraction; eval score improves by 4%; one specific clause type regresses by 25%; the team learns from a customer.
A media company runs an agent that calls four tool providers; a provider's API contract change breaks the agent; the gateway returns 200 with a degraded shape; users notice before the team does.
A government AI service has no postmortem template that captures eval delta; recurring incidents do not produce eval coverage improvements; the same shape repeats.
A consumer chatbot has no drill calendar; the team's apparatus has not been exercised in 14 months; the next real incident is the first test.
A payments AI has a single global on-call rotation; an incident at 03:00 wakes an engineer with no context on the affected feature; resolution waits for the lead.
A search-rerank service rolls out a candidate model; canary metrics aggregate-pass; the high-margin tenant sees a regression; the canary promotion proceeds.
A document AI rolls out provider-side refusal posture change; the affected document type's downstream pipeline silently halves throughput; no alert fires.
A B2B SaaS has a postmortem template inherited from web SRE; AI-specific causes (prompt, retrieval, eval) are recorded as "other"; lessons do not accumulate.
An internal tooling AI at a large bank has no escalation path to the model team; on-call engineers escalate to "the AI Slack channel" hoping someone is awake; mean time to context is 40 minutes.
A regional ride-hailing app ships AI dispatch optimisation; the agent has no kill switch; a bad model push requires a full feature deploy to roll back.
A staffing platform runs eval-on-production-traffic at 0.5% sample; the regressions large enough to detect are larger than they should be; the team accepts the trade-off without measuring it.
A retail AI classifies customer messages; a provider's tone change causes a politeness regression; the brand surface degrades; the team learns from a Twitter screenshot.
A telecom AI has alerts wired for every failure family but no drill calendar; the on-call has not handled a real page in four months; the next page goes poorly.
A coding assistant has a quality-regression alert that fires three times in two weeks; the on-call mutes it because they cannot tell which fires are real; the apparatus degrades by use.

Recall / checkpoint¶

Name the five AI-specific failure shapes that classic SRE alerts miss.
Why is an eval-score improvement on an aggregate set not sufficient to clear a prompt deploy?
What is the difference between an alert that lives on a dashboard and a paging condition?
Why is per-tenant cost anomaly a different shape from absolute cost ceiling?
What is the AI-specific addition the apparatus must make on top of the SRE baseline?
Which failure family will a tiny prototype most likely be exposed to first, and why?
What signal would tell you that your apparatus is degrading by use rather than by design?

Interview Q&A¶

Q1. A team has a strong SRE practice and is shipping its first AI feature. The on-call lead says the existing apparatus is sufficient. Walk through the pushback. The existing apparatus is calibrated for outage and saturation. AI features fail through silent quality regression, provider behaviour drift, cost runaway, safety violation, and long-tail user harm. None of these surfaces in the existing apparatus. The pushback is not to throw away SRE; it is to layer AI-specific paging conditions, runbooks, and escalation paths on top. Show the lead one concrete scenario per family with the question "which alert would fire?" In four of five cases the honest answer is "none." That answer makes the case. Common wrong answer to avoid: "we'll add alerts as we hit incidents" — by the time the incidents happen, the apparatus is reactive, the postmortems are repetitive, and the lessons cost users.

Q2. The team's eval scores improved after a prompt change, and the deploy passed. Forty minutes later, a specific intent is regressed. Diagnose the apparatus failure. Two apparatus failures. First, the eval set was scored only in aggregate; per-intent slice regressions were invisible. Second, the production-traffic feedback signal was on a dashboard but not wired as a paging condition. The fix is per-slice eval scoring (or at minimum per-intent regression alerts on production traffic) and a paging condition tied to feedback-signal anomaly after any prompt deploy. The runbook for "quality regression after prompt change" should pre-exist and name the rollback path. Common wrong answer to avoid: "the eval set needs to be bigger" — bigger evals help, but the apparatus gap is in slicing and paging, not in eval volume.

Q3. How is cost runaway different from latency or error-rate anomaly, and what does the apparatus need to add? Cost runaway can happen without latency or error-rate movement; each individual call is healthy, the volume is wrong. The apparatus needs cost-anomaly alerts based on per-tenant or per-feature baselines (deviation from rolling average) rather than absolute thresholds. Real-time or near-real-time cost telemetry is required; daily-refreshed cost dashboards are too slow. The cost-spike runbook needs the kill paths enumerated: feature flag, tenant quota, agent rate limit, offending tool disable. Common wrong answer to avoid: "the gateway's quota alert is enough" — quota alerts fire at a fixed level, anomaly alerts catch the climb before the level.

Q4. The team's canary promoted a model on green aggregate metrics; a tenant cohort regressed. How should the apparatus prevent recurrence? Two changes. First, canary metrics must be sliced by tenant tier and cohort, with cohort-level regression as a promotion blocker. Second, the rollback runbook must include the model-version revert path and the eval-set expansion to cover the regressed cohort. The drill calendar should include a "tenant-cohort regression during canary" scenario so the apparatus has been exercised before a real one. Common wrong answer to avoid: "we'll detect it from customer success feedback" — that loop is days long; the canary loop should be minutes to hours.

Q5. The on-call engineer mutes a quality-regression alert because they cannot tell which fires are real. What is the apparatus failure? Alert quality, not alert volume. The alert as designed has poor precision; the on-call rationally treats it as noise; the apparatus has degraded by use. The fix is to tighten the alert's threshold, add confirmatory signals (slice-level regression rather than aggregate), or split into severity tiers — a tier that pages only on high-confidence regressions and a tier that creates tickets for review. The drill calendar should rehearse alert-tuning as a regular exercise, not only incident response. Common wrong answer to avoid: "tell the on-call to take alerts seriously" — the on-call's behaviour is a signal about the alert, not a discipline problem.

Q6. What is the single piece of evidence that an AI on-call apparatus is degrading silently? The alert coverage matrix has stable green-cell counts while the AI feature surface area is growing. New features are being shipped without new alerts, new runbooks, and new drill scenarios. The apparatus's coverage is shrinking relative to the system it protects, but no metric in the apparatus itself flags it. The fix is to make alert coverage growth a release-process gate, paired with feature ship counts. Common wrong answer to avoid: "low alert volume" — low volume can mean either healthy or blind; the matrix distinguishes them.

Design / debug exercise (10 minutes)¶

Modelled example. Take the five failure shapes from this chapter. For each, write down: (a) what classic alert would fire (almost always "none"), (b) what AI-specific alert should fire, (c) which runbook the on-call needs in hand, (d) who is the escalation target.

Your turn. Pick one AI feature you have shipped or are about to ship. For each of the five shapes, walk through (a)-(d). Mark which cells you have today, which you have in dashboard-only form, and which are blind. The blind cells are your next two sprints of apparatus work.

Reproduce from memory. Draw the table from the "five shapes a classic apparatus misses" section without looking. The signal you have internalised the chapter is that you can name the family, what goes wrong, and why classic alerts miss it — for all five — in under three minutes.

Operational memory¶

This chapter explained why a competent SRE on-call apparatus, applied unchanged to AI systems, will miss the failure families AI is most likely to produce. The important idea is that AI failures live in shapes — silent quality, drift, cost, safety, long-tail harm — that the symptoms-of-outage frame does not see, not that the SRE team is missing discipline.

You learned to enumerate the five shapes, to recognise each one against your team's current alert coverage, and to identify the cells where the apparatus is blind today. That solves the opening failure because the apparatus you build in the rest of the module is exactly the set of additions that turn red cells green.

Carry this diagnostic forward: when a team says "our on-call covers our AI," ask them to name a paging condition for each of the five families. If they cannot, you have found the gap before the gap finds you.

Remember:

Classic on-call sees outage and saturation; AI fails through quality, drift, cost, safety, and long-tail harm.
A dashboard tile is not a paging condition.
Aggregate metrics hide slice regressions; alerts must be sliced.
Cost runaway requires anomaly detection, not threshold breach.
Apparatus blindness grows silently as the system grows; track alert coverage relative to feature surface area.

Bridge. Knowing why the classic apparatus fails is the diagnosis. The next chapter is the prescription — the six surfaces of an AI on-call apparatus as a service architecture, so the rest of the module can develop each surface in turn. → 02-the-oncall-apparatus.md