Skip to content

11. Incident Response — runbooks beat brave Slack threads

~15 min read. Alerts are useful only if the next moves are already decided.

Built on the ELI5 in 00-eli5.md. The production monitor — the factory siren board — now hands control to a written response plan.


1) First picture: alerts need a railway track

Look.

alert fires
┌──────────────┐
│ severity map │
└──────┬───────┘
┌──────────────┐
│ owner on-call│
└──────┬───────┘
┌──────────────┐
│ triage list  │
└──────┬───────┘
┌──────────────┐
│ mitigate     │
│ rollback     │
└──────┬───────┘
┌──────────────┐
│ verify heal  │
└──────┬───────┘
┌──────────────┐
│ postmortem   │
└──────────────┘

Many teams handle AI incidents through heroic chat messages. That feels fast for ten minutes. Then it becomes chaos.

One person asks for logs. Another person changes prompts. A third person restarts pods. Nobody knows the owner or severity.

So what to do? Write a runbook before the fire. Simple, no?

A good runbook is a railway track. Once the alert fires, the next station is already known. The production monitor raises the hand. The runbook tells the team what to do next.

The upgrade without downtime matters here too. If you can shift traffic safely, you buy time. If you cannot, every decision becomes more stressful.


2) What every AI incident runbook must contain

A runbook should be boring. That is a compliment. Boring means repeatable under pressure.

Minimum contents are these.

Trigger conditions

State exactly what starts the incident. Examples: p95 latency above 5 seconds for 10 minutes. Or refusal rate doubles on paid traffic. Or revenue drops beyond the agreed threshold.

Severity rubric

Decide severity before emotions arrive. For example:

  • Sev 1: broad user harm or regulatory risk.
  • Sev 2: major degradation with workaround.
  • Sev 3: partial degradation or internal-only issue.
  • Sev 4: minor anomaly, watch and investigate.

This avoids endless debate in the first fifteen minutes. Yes?

Owner on-call

Name the first responder role. Not a loose group. Use one accountable owner. That may be the ML platform engineer, search engineer, or product SRE.

Triage checklist

The first responder should not invent a checklist live. Have one written already. Typical checks are:

  • confirm the alert is real,
  • inspect blast radius,
  • check recent deployments,
  • check provider status,
  • compare control versus treatment traffic,
  • decide whether user harm is active now.

Rollback steps

Rollback is not one button in AI systems. You may need to roll back weights, prompt, routing policy, thresholds, or retrieval index. Sometimes the assembly line shipped code safely, but the live failure sits in prompt text or external retrieval data.

Communication template

You need a fixed message shape. Write one for internal stakeholders and one for customer-facing teams. State impact, current action, next update time, and owner.

Recovery verification

Do not close an incident because a deploy succeeded. Close it because user impact stopped. Verification should name the exact metrics to watch after mitigation.

Postmortem

The runbook should force a postmortem. Capture timeline, root cause, detection gap, and prevention action. Without this, the same incident comes back.


3) LLM incidents have extra failure modes

Classic software already has outages and latency spikes. LLM systems add some strange failures on top. That means the runbook needs extra branches.

Provider outage

The hosted model provider may slow down or fail. So what to do? Have fallback routes, cached answers, reduced features, or a smaller local path. The warehouse can help you know which fallback model is approved.

Prompt regression

A tiny prompt edit can change behavior a lot. Latency can rise. Output length can blow up. Safety wording can over-trigger. Reasoning style can become inconsistent.

Treat prompt versions like production artifacts. They deserve rollback steps too.

Retrieval freshness failure

The LLM may be fine. The retrieval index may be stale. Then answers become confidently outdated. This is especially dangerous because system metrics may stay green.

Safety false positives

Sometimes the safety layer blocks too much. Refusal rate rises. Good users get denied. Revenue drops while the team thinks the guardrail is helping.

Cost spike

An incident is not only downtime. A silent output-token explosion is also an incident. GPU saturation and API spend explosions deserve runbooks. If not, finance discovers the problem before engineering does.

Look. LLM incidents often cross layers. Model, prompt, routing, retrieval, and provider may all interact. That is why a brave Slack thread fails. It cannot hold stable structure under pressure.


4) Worked example: rollback is wider than old code

Suppose your support bot has these live parts.

  • application code version api-v18
  • prompt version prompt-v7
  • model weights support-3.2
  • retrieval index kb-2025-02-15
  • router policy small-first-then-large

At 2:10 PM, the production monitor shows trouble. Refusal rate jumps from 4% to 19%. Ticket deflection falls from 42% to 25%. Latency barely changes.

Recent changes are:

  • 1:30 PM: new prompt prompt-v7
  • 1:40 PM: new retrieval index kb-2025-02-15
  • no code deploy,
  • no model weight deploy.

A weak team says, "Roll back code." That does nothing. There was no code change.

A better team uses the runbook. Step 1: confirm the alert on paid English traffic. Step 2: compare prompt versions on shadow requests. Step 3: inspect retrieval freshness and blocked intents. Step 4: roll back prompt first.

Suppose refusal rate drops from 19% to 8%. Deflection rises from 25% to 36%. Still not normal. Now inspect the retrieval index. Roll back to kb-2025-02-01. Refusal rate returns to 5%. Deflection returns to 41%.

See the lesson. Rollback may require two artifacts, not one. The upgrade without downtime lets you shift traffic safely while testing. That is why runbooks must name every rollbackable piece.


5) Communication, recovery, and learning

Incidents are technical and social together. If stakeholders are confused, the incident feels larger. If updates are crisp, panic stays lower.

Use a simple template.

  • what users are seeing,
  • when it started,
  • what scope is affected,
  • what mitigation is running,
  • when the next update will arrive,
  • who owns the call.

After mitigation, do not close early. Recovery verification should check the production monitor for stability. Watch the key metric, the safety metric, and the business metric. Hold the incident open until they settle.

Then write the postmortem. Ask these questions. Why was detection late? Why was triage slow? Which rollback path was missing? How should the assembly line or quality gate change now?

A good postmortem improves the next week. A bad postmortem only assigns blame. Choose the first one. Simple, no?


Where this lives in the wild

  • OpenAI platform operations — an incident commander coordinates provider fallback, quota controls, and customer communication during inference degradation.
  • GitHub Copilot reliability team — a staff engineer uses runbooks for model routing issues, prompt regressions, and retrieval freshness failures.
  • Swiggy support automation — an ML platform engineer rolls back prompt templates and knowledge snapshots when deflection suddenly drops.
  • Cloudflare AI Gateway operations — an SRE watches provider health, rate limits, and failover paths across multiple model vendors.
  • Stripe risk infrastructure — an on-call engineer uses severity rules and blast-radius checks before changing live fraud thresholds.

Pause and recall

  1. Why is a runbook better than handling incidents through ad hoc chat?
  2. Which seven parts should every AI incident runbook contain?
  3. Why is rollback wider than just code deployment in AI systems?
  4. Which LLM-specific failure modes deserve dedicated runbook branches?

Interview Q&A

Q: Why should severity be defined before the incident starts? A: Severity decisions made during the incident become emotional and inconsistent. A pre-written rubric speeds escalation, staffing, and customer communication.

Common wrong answer to avoid: "The incident commander can decide on the fly." That creates debate exactly when speed matters most.

Q: Why is rollback in AI systems wider than standard web rollback? A: Because user behavior depends on multiple live artifacts. Weights, prompts, routing policies, retrieval indexes, and thresholds can all change outcomes even when code stays fixed.

Common wrong answer to avoid: "Just redeploy the previous app version." That ignores non-code artifacts that often caused the failure.

Q: Why can a safety false-positive surge be treated as an incident? A: Because it creates active user harm, revenue loss, and trust damage. Excess blocking is still production degradation, even if the system looks conservative.

Common wrong answer to avoid: "Blocking more is always safer." Overblocking can be as damaging as underblocking in real products.

Q: What makes a postmortem useful after an AI incident? A: It should connect the timeline, root cause, detection gap, and prevention action. The goal is stronger systems, not theatrical blame.

Common wrong answer to avoid: "Postmortems are mainly for accountability." Accountability matters, but prevention is the real output.


Apply now (5 min)

Pick one AI feature you know. Write one trigger condition, one severity rule, and one owner role. Then list four rollbackable artifacts for that feature. Now sketch from memory: - the alert-to-postmortem railway, - the seven runbook sections, - and the rollback stack beyond code. Say aloud which part the production monitor handles, and which part the runbook handles.


Bridge. Incidents are expensive when they explode suddenly. But silent waste also hurts every day. Next we study how serving architecture, routing, and GPU utilization control cost before bills become their own incident. → 12-cost-optimization-serving.md