Skip to content

07. Degraded quality runbooks

~9 min read. Silent quality regression is the hardest AI failure to handle because the system is "up" — APIs are healthy, errors are absent, latency is fine. The runbooks for this family must work without those classic signals.

Continues from 06-escalation-paths.md. This chapter develops the runbook family for the quality failure shape. Recurring concepts in bold: quality regression card, deploy-suspect path, slice isolation, rollback-then-investigate, eval-set expansion.

The apparatus is built; this chapter is the first of three that develop the specific runbook content per failure family. Degraded quality is first because it is the family the classic apparatus most often misses, and the family with the highest user impact when missed.


What "degraded quality" covers

Three shapes within the family:

Shape What happens First-line signal
Deploy-caused regression Prompt or model change degrades some slice Quality alert in the deploy-anchored window
Distribution shift Production traffic changed (new feature surface, new tenant cohort) Quality alert outside the deploy window
Upstream change Retrieval, tools, or context source changed shape Quality alert correlated with upstream system change

Each has a different runbook. The on-call's first job after the page is to identify which shape they are in.


The deploy-suspect runbook (most common case)

The deploy-anchored window catches 60-80% of quality regressions. The runbook:

1. Acknowledge page (5 min SLO). Open incident channel; post payload.
2. Confirm the deploy in window: open deploy log, identify the deploy ID
   matching the payload. If absent, switch to runbook 2 (distribution shift)
   or runbook 3 (upstream change) per signal.
3. Confirm the affected slice: open the eval-on-traffic dashboard,
   identify the regressed slice(s) and magnitude.
4. Execute the rollback command from the payload:
       release-mgr rollback --deploy-id <id>
   Expected: confirmation with prior version restored.
   If fails: escalate to release management on-call.
5. Wait 5 minutes; recheck the quality alert. Expect: clear or trend toward baseline.
   If not clearing: the regression may be independent of the deploy.
6. Confirm clear; close the immediate incident.
7. File postmortem with: deploy ID, affected slices, root cause hypothesis,
   eval-set expansion action.

The runbook is short by design. The on-call's job in the first 15 minutes is not to understand the regression; it is to contain it. Investigation is the postmortem's job.


The distribution-shift runbook

When the quality alert fires outside the deploy window, the cause is usually a shift in production traffic. The runbook:

1. Acknowledge page. Confirm no deploy in window.
2. Open the slice dashboard. Identify which slice has shifted and how:
   - Volume change: a slice is much larger than baseline?
   - New input pattern: traces show unfamiliar formats?
   - New tenant: a recently-onboarded tenant dominating the regression?
3. Decision: if volume change, the eval set may be missing coverage;
   if new pattern, the prompt may not handle the format;
   if new tenant, the cohort may have specific needs.
4. Mitigation options:
   - Tighten the eval set's slice coverage; rerun production eval scoring.
   - Disable the feature for the affected cohort if user impact is severe.
   - Escalate to the feature lead for product-level decision.
5. Postmortem actions: eval-set expansion to cover the new pattern,
   prompt updates if needed, cohort-level monitoring going forward.

This is the runbook the on-call hopes to never need; distribution shifts are slower to contain and rarely have a single-command kill path.


The upstream-change runbook

When the quality alert correlates with a change in retrieval, tool execution, or context source, the runbook:

1. Acknowledge page. Note correlation: which upstream system changed,
   how recently.
2. Open the upstream's recent deploy or change log. Identify candidates.
3. Confirm impact: trace samples from the alert. Are they all hitting the
   affected upstream path?
4. Mitigation:
   - If the upstream has a rollback path, escalate to its on-call.
   - If not, the gateway/feature may have a fallback (cached retrieval,
     simpler tool, prior context source).
5. Decision: contain via fallback, or wait for upstream rollback.
6. Coordinate with upstream on-call; share alert payload and trace samples.
7. Postmortem actions: cross-system change coordination, fallback validation.

The hardest part is the cross-team coordination; the runbook's job is to make the coordination start in the first 5 minutes.


Worked example — the policy_comparison regression

The Bengaluru insurance SaaS team's quality alert fires at 18:42. The on-call follows runbook 1 (deploy-suspect):

  • 18:42 — page acknowledged; channel opened.
  • 18:44 — deploy log shows prompt deploy at 18:00; deploy ID is in the payload.
  • 18:46 — eval-on-traffic dashboard confirms policy_comparison slice regressed by 19%; other slices flat or improved.
  • 18:48 — rollback command executed; success.
  • 18:53 — alert clears; user-feedback signal returns to baseline.
  • 19:05 — postmortem ticket opened with action items: eval-set expansion for policy_comparison, slice-level alert sensitivity tuning, prompt-author review process.

Time to contain: 11 minutes. Time to root-cause identified: 25 minutes (during postmortem). The runbook contained the impact; the postmortem captured the cause. The two are separate concerns and the runbook respects that separation.


Slice isolation as a discipline

The aim of the quality runbooks is to isolate the regression to specific slices and act narrowly. Three principles:

  • Identify the slice first, then act. Acting on aggregate without knowing the slice often over-broadens the mitigation.
  • Prefer slice-scoped mitigations. Disable a feature for one cohort rather than for all users; rollback affects only the changed surface.
  • Verify the slice is contained. After mitigation, the slice's score should recover; other slices should not be affected by the mitigation itself.

Slice isolation is what distinguishes a 12-minute incident from a 12-hour incident — the smaller the affected slice, the less recovery work, and the cleaner the postmortem.


Operational signals

Healthy. Quality runbooks fire and resolve within the contained SLO (typically 15-30 minutes). Drills exercise all three runbooks at least quarterly. Postmortem follow-ups produce eval-set expansions that cover the observed regressions.

First degrading metric. Mean time to contain creeping up. The on-call is taking longer to act; the runbooks may be drifting from the current system.

Misleading metric. Number of quality incidents. A platform with healthy alerts catches more incidents than one with blind alerts; high count can mean health or load. The metric to watch is mean time to contain per incident.

Expert graph. Per-runbook time-to-contain distribution, per-slice incident count, eval-set expansion rate (postmortem follow-ups producing eval coverage). The combination shows runbook health and learning velocity.


Boundary of applicability

Strong fit. Production AI features with measurable slice-level eval signal. The full three-runbook family is justified.

Pathology. A team running aggregate-only evals cannot isolate slices; the runbooks degrade into wide rollbacks. The fix is upstream — slice the evals before relying on the runbooks.

Scale limit. Very large platforms have many slices; the runbooks reference per-slice escalation paths rather than a single one. The pattern is the same; the per-slice owner list is longer.


Failure-prone assumption

The seductive wrong belief: the on-call's job is to find the root cause. It is not, not in the first 15 minutes. The on-call's job is to contain the impact and preserve evidence for the postmortem. Treating the runbook as a debugging guide expands the incident window and risks acting before the cause is understood.

The correct belief: contain first, investigate after. The rollback-then-investigate pattern is structurally faster and structurally safer than investigate-then-act.


Where this appears in production

  • A fintech rolls back a prompt deploy 11 minutes after the quality alert; postmortem captures the slice regression and eval-set expansion.
  • A telecom AI has a distribution-shift runbook that catches a new tenant cohort regression; mitigation is per-cohort feature disable.
  • A consumer chatbot treats the runbook as a debugging guide; mean time to contain is 90 minutes.
  • A healthtech AI rolls back immediately on a regulator-affecting slice regression; postmortem catches the upstream change.
  • A coding assistant has all three runbooks drilled quarterly; mean time to contain is consistently under 20 minutes.
  • A retail AI has aggregate-only evals; runbook 1 produces wide rollbacks on narrow regressions; user impact extends to unaffected cohorts.
  • A logistics AI correlates a quality regression with a retrieval index change; runbook 3 coordinates the upstream rollback.
  • A government AI has the postmortem capture eval-set expansion as a mandatory follow-up; eval coverage grows over time.
  • A B2B SaaS isolates a regression to one cohort within 8 minutes; mitigation is cohort-scoped.
  • A travel platform has runbook 2 (distribution shift) drilled; the team identifies the new pattern in 12 minutes.
  • A payments AI has slice-level alerts that distinguish enterprise from consumer slices; runbooks act per-tier.
  • A legal AI rolls back a model upgrade after slice regression on a specific clause type; postmortem expands the eval.
  • A staffing AI has cross-team coordination drilled for upstream-change incidents; team on-call SLOs are mutually agreed.
  • A search-rerank service rolls back a candidate model within the canary; the regression is caught before promotion.
  • A document AI isolates the regression to one document class; the mitigation disables that class only.
  • A media AI has a runbook for the distribution-shift case where a new feature surface drives traffic to an unprepared intent.
  • An ad-tech AI has eval-set expansion as a recurring postmortem action; the eval set grows monotonically with the system.
  • A real-estate AI has a runbook 3 case where a retrieval source change degrades address matching; cross-team rollback contains.
  • A small SaaS has only runbook 1 written; runbooks 2 and 3 are improvised when needed; mean time to contain is bimodal.
  • A medical AI has slice-isolation drilled for every regulatory-affecting slice; mitigations are always scoped to the affected slice.

Recall / checkpoint

  1. Name the three shapes within the degraded-quality family.
  2. What is the first-line signal that distinguishes deploy-caused regression from distribution shift?
  3. Why is "contain first, investigate after" the structural rule?
  4. What does slice isolation mean and why does it matter?
  5. What is the discipline the postmortem owes after a quality incident?
  6. How does the upstream-change runbook differ from the deploy-suspect runbook?
  7. What metric tells you the runbook plane is degrading on quality incidents?

Interview Q&A

Q1. The quality alert fires; the on-call cannot decide whether to roll back. Walk through the right framing. The on-call's job is to contain, not to be certain. If the deploy is in the window and the slice is regressed, the rollback is the structurally faster path even if not 100% confirmed. The rollback either resolves the alert (confirming the deploy was the cause) or does not (in which case investigation continues). Waiting to be certain often costs 30-60 minutes of user impact; rolling back costs minutes of system churn. The asymmetry argues for rollback-first. Common wrong answer to avoid: "wait for certainty" — certainty in 15 minutes is rare; rollback gives certainty in 5.

Q2. The quality alert fires outside the deploy window. The on-call defaults to runbook 1 (deploy-suspect). What is the failure? Runbook 1 has no deploy to roll back; the on-call wastes the first ten minutes looking. The correct path is to switch to runbook 2 (distribution shift) or runbook 3 (upstream change) per signal. The runbook should fork at step 2: "if no deploy in window, switch to runbook 2/3 per signal." The branching is explicit, not implicit. Common wrong answer to avoid: "the on-call should know which runbook" — under stress, runbook design carries the load, not on-call judgment.

Q3. The team's evals are aggregate-only. Walk through how this degrades the quality runbooks. The runbooks rely on slice isolation to act narrowly. Without slice-level eval scores, the on-call cannot identify which slice is regressed; the rollback is wide (the entire deploy) when it could be narrow (one intent's prompt). The mitigation impacts users not affected by the original regression. The fix is upstream — slice the evals — before the runbooks can act precisely. Common wrong answer to avoid: "we'll just roll back everything" — wide rollbacks impose collateral damage on healthy slices.

Q4. What is the postmortem's discipline after a quality incident? Capture the affected slice, the root cause, the mitigation taken, and at least one eval-set expansion action. The eval-set expansion is the mechanism by which the next instance of this regression is caught earlier. Without that action, the same shape can recur. Other follow-ups: alert sensitivity tuning if the alert was late or noisy, runbook update if the runbook missed a step, prompt-author review process if the regression escaped review. Common wrong answer to avoid: "the rollback is the fix" — the rollback contains; eval expansion prevents recurrence.

Q5. How does cross-team coordination work in an upstream-change incident? The runbook explicitly names the upstream team's on-call as the hop to engage. The on-call from the affected feature pages the upstream on-call with: the quality alert payload, trace samples showing the upstream path, the correlation with the upstream's change. Both on-calls join the same incident channel; both rollback decisions are coordinated. The discipline is to bring the upstream in early, not after the affected team has spent 30 minutes guessing. Common wrong answer to avoid: "we'll diagnose first then escalate" — for upstream-suspect incidents, early cross-team coordination is the faster path.

Q6. Why is "mean time to contain" the right metric and not "mean time to root cause"? Because containment ends user impact; root cause comes after. Optimising for root cause time can pull the on-call into investigation before containment, extending user impact. The two metrics are tracked separately: mean time to contain for the runbook plane's health; mean time to root cause for the investigation discipline. Both are important; conflating them produces wrong incentives. Common wrong answer to avoid: "they're the same thing" — they are different stages with different optimisations.


Design / debug exercise (10 minutes)

Modelled example. Walk through the worked example (the policy_comparison regression). Identify each runbook step taken and the time at each step. Verify the runbook's branching at step 2 (deploy in window vs. not).

Your turn. Take your team's quality alert. Walk through what runbook the on-call would follow today; identify any branch the runbook does not handle. Estimate the mean time to contain for a deploy-suspect incident, a distribution shift, and an upstream change.

Reproduce from memory. Write the deploy-suspect runbook's seven steps from memory. The signal of internalisation is that the steps land in under three minutes with the right order and the right branching.


Operational memory

This chapter explained the degraded-quality runbook family: deploy-suspect, distribution-shift, upstream-change, each scoped to one shape with explicit branching and contain-first discipline. The important idea is that the on-call's job is to contain, not to debug; investigation belongs to the postmortem.

You learned to identify which sub-runbook applies, to execute slice-isolated mitigations, and to capture the eval-set expansion that prevents recurrence. That solves the opening failure because the quality regression — the hardest AI failure to detect — now has a tested response.

Carry this diagnostic forward: when a quality alert fires, ask which of the three shapes it is, and act through the matching runbook. The branching is the apparatus's discipline against improvisation.

Remember:

  • Three shapes: deploy-suspect, distribution-shift, upstream-change.
  • Contain first, investigate after.
  • Slice isolation is what makes the mitigation narrow.
  • Mean time to contain is the metric, not mean time to root cause.
  • Eval-set expansion is the postmortem's required follow-up.

Bridge. Quality is one runbook family. Provider issues and cost runaway are two more, with their own shapes and kill paths. The next chapter is those runbooks. → 08-provider-and-cost-runbooks.md