Skip to content

12. Incident response — run the emergency room, not a group panic

~15 min read. When an AI incident begins, good teams follow a playbook before they follow their feelings.

Built on the ELI5 in 00-eli5.md. The whole hospital comes together here: the triage desk classifies, the sealed ward isolates, the senior doctor makes hard calls, and the stability kit keeps users safe meanwhile.


1) First picture: incidents need roles immediately

An incident is not only a technical problem. It is a coordination problem. Without roles, everyone debugs, nobody decides,

and user harm continues.

incident starts
    ├── incident lead       → owns decisions and timeline
    ├── technical lead      → drives diagnosis and mitigation
    ├── communications lead → updates stakeholders and users
    └── support liaison     → feeds real user impact back in

The simple version: AI incidents especially need user-impact clarity. A fluent wrong answer may hurt before dashboards spike. So the support liaison matters a lot. The triage desk should classify both failure type and harm scope quickly.

2) First 15 minutes: stabilize before root cause purity

Now what is the common mistake? Teams chase perfect diagnosis first. That wastes time. The first phase is stabilization. Typical actions are:

  • stop rollout,
  • open breaker,
  • enable fallback,
  • disable risky tools,
  • reduce traffic,
  • page domain owners,
  • open incident channel.
    first 15 minutes
    ┌──────────────────────────────┐
    │ 1. confirm user harm         │
    │ 2. contain blast radius      │
    │ 3. switch to safe mode       │
    │ 4. collect evidence          │
    └──────────────────────────────┘
    

Rollback or degrade first. Root cause can come after bleeding stops. Worked example. A support assistant starts issuing wrong refund guidance after a prompt rollout. Good first actions:

  • disable refund action path,
  • revert prompt version,
  • send users to human support for refund cases,
  • capture affected traces. Bad first action: fifteen engineers debating tokenization edge cases. The stability kit should activate before elegant analysis.

3) AI runbooks must include semantic symptoms, not only uptime symptoms

Traditional runbooks often say, "If 5xx > X, do Y." AI runbooks need more. They need semantic indicators.

  • false refusal spike,
  • malformed tool arguments,
  • citation-grounding failure,
  • wrong-entity actions,
  • complaint clusters for one workflow.
    AI incident trigger examples
    ┌─────────────────────────────────────┐
    │ answer success normal, but          │
    │ wrong-account actions rising        │
    │ → treat as incident                 │
    └─────────────────────────────────────┘
    

A semantic incident may not look like downtime. Still, it is an incident. The vitals monitor should feed these signals into on-call.

4) Rollback is often the fastest repair

Now what helps fastest under uncertainty? Rollback. Especially for recent changes. Prompt change, routing change, feature flag,

schema change, model version change, tool contract change.

possible rollback targets
- last prompt version
- last routing policy
- last model alias
- last retrieval index snapshot
- last policy template

The simple version: Rollback is not defeat. Rollback is reliability. For example, at 11:20, a new router starts sending finance queries to a cheaper general model.

At 11:24, wrong-answer complaints rise. At 11:27, incident opens. At 11:29, router flag rolls back.

Complaints flatten. Root cause analysis can continue later. The triage desk should always ask, "What changed recently and can we reverse it now?"

5) Communication should be factual, timed, and role-specific

Now what is another failure mode? Panic communication. One Slack thread says outage. Another says safe. Support tells customers something else. That damages trust.

The practical response: Use a single incident timeline. Use scheduled updates. Separate internal diagnosis from external promises.

communication cadence
internal updates → every 10 min
status page      → when user impact confirmed
support brief    → after mitigation path is chosen
executive brief   → when scope and ETA are known

See the discipline. The senior doctor may make difficult policy calls, but the communications lead turns them into stable messaging. For example, a code assistant is degrading on repository-wide edits. User-facing message should say:

  • what feature is affected,
  • what safe alternatives remain,
  • what the team is doing,
  • where to check status. Not vague drama.

6) Post-mortems should produce system changes, not only blame maps

After mitigation, do the real learning work. A strong post-mortem covers:

  • timeline,
  • impact,
  • trigger,
  • missed signals,
  • containment quality,
  • what worked,
  • corrective actions,
  • owners and dates.
    post-mortem skeleton
    ┌──────────────────────────────┐
    │ what happened?               │
    │ why did it reach users?      │
    │ what slowed detection?       │
    │ what reduced harm?           │
    │ what changes now?            │
    └──────────────────────────────┘
    

The simple version: Do not stop at, "Model was bad." Ask why the vitals monitor missed it, why the sealed ward opened late, or why the backup ambulance was too weak.

For example, an incident caused duplicate refund emails. Technical root cause: delayed acknowledgment plus missing dedup. Operational root cause: runbook did not specify disabling email side effects during uncertain retries.

Corrective actions must fix both.

7) AI on-call needs domain experts, not only infra experts

Now a senior organizational point. Not every AI incident is an infrastructure incident. Sometimes model-serving is fine. Policy behavior is wrong. Retrieval freshness is wrong. Human escalation queues are jammed.

So AI on-call should involve:

  • platform or infra engineer,
  • feature owner,
  • domain-risk owner when relevant,
  • support or operations contact.
    AI on-call map
    infra     → can the system run?
    feature   → is behavior correct?
    risk      → is action allowed?
    ops       → what are users experiencing?
    

The difference is practical: A classic SRE alone may not detect semantic harm quickly. The senior doctor role sometimes belongs to a domain owner, not the best systems debugger.

8) Worked incident mini-runbook

Scenario. A financial assistant starts issuing inconsistent repayment advice. Runbook:

  1. Confirm with sampled traces and complaints.
  2. Disable repayment-action path.
  3. Roll back latest prompt and routing changes.
  4. Route repayment questions to human queue.
  5. Publish user message for affected feature.
  6. Audit already-served answers for highest-risk cases.
  7. Reopen carefully after verification. Look how practical this is. The stability kit keeps service safe. The sealed ward limits spread. The senior doctor covers risky decisions. That is incident response.

Where this lives in the wild

  • GitHub Copilot — incident commander: can disable autonomous workspace actions, keep inline completions alive, and roll back recent routing changes during a live reliability incident.
  • Intercom Fin — support AI operations lead: uses runbooks that separate account-action incidents from answer-quality incidents so containment is faster and safer.
  • Perplexity — search response owner: can roll back a retrieval-ranking change while keeping core answer generation online when stale or wrong citations spike.
  • Klarna assistant — payments risk manager: joins incident response when repayment or refund behaviors look semantically wrong even if the serving stack remains up.
  • Healthcare AI triage teams — clinical operations lead: shift risky symptom cases to nurse queues immediately when verifier or moderation paths degrade.

Pause and recall

  • Why should incident response assign roles immediately instead of letting everyone debug together?
  • What should the first 15 minutes prioritize?
  • Why do AI runbooks need semantic symptoms as incident triggers?
  • Why is rollback often the best first mitigation after a recent change?

Interview Q&A

Q: Why should incident response prioritize containment before perfect root-cause analysis? A: User harm accumulates while the team debates, so fast stabilization usually creates more value than early diagnostic elegance. Common wrong answer to avoid: "Because root cause never matters during incidents." It matters, just not before containment. Q: Why do AI incidents require domain or product owners on-call in addition to infrastructure engineers? A: Semantic failures may present as business or safety errors even when compute and network systems look healthy. Common wrong answer to avoid: "Because infra engineers cannot read dashboards." The issue is domain judgment, not dashboard literacy. Q: Why is rollback often better than tuning under live pressure? A: Rollback quickly returns the system to a known safer state while diagnosis continues off the critical path. Common wrong answer to avoid: "Because rollbacks always restore full service immediately." They often help, but not always fully. Q: Why should post-mortems include containment quality and missed signals, not only the trigger event? A: Reliability depends on the whole detection-and-response chain, so learning only the trigger leaves the bigger system weakness untouched. Common wrong answer to avoid: "Because post-mortems are mostly documentation exercises." They should drive concrete change.


Apply now (5 min)

Exercise. Write a mini runbook for one AI failure scenario. Include trigger, first three containment steps, rollback option, user communication, and post-mortem owner.

Sketch from memory. Draw the incident-role box. Label where the triage desk, sealed ward, stability kit, and senior doctor each act during the first 15 minutes.


Bridge. We can build many controls and still face hard unknowns. Next we end honestly with the open problems that make AI reliability difficult even for strong teams. → 13-honest-admission.md