Skip to content

02. The first fifteen minutes — contain before you theorize

~12 min read. The first mistake in an AI incident is trying to be clever. The first job is to preserve evidence, name an owner, and stop the spread.

Continues from 01-what-counts-as-ai-incident.md. The alarm bell has rung; now the fire captain needs the runbook wall before the team starts guessing.

The previous chapter widened incident thinking beyond outages: a green API can still be a red product. That solved declaration, but it exposed the next failure: once the team agrees the alarm is real, everyone wants to debug at once. This chapter gives the first fifteen minutes enough structure that evidence survives and harm stops spreading.


1) The wall — root cause can wait

The refund assistant complaint arrives at 03:07. The on-call engineer opens the trace, sees a suspicious retrieval result, and starts tuning query rewriting in production.

That is the wrong first move.

The first fifteen minutes are not for fixing. They are for coordination, evidence, and containment. Fixing too early can erase the scene: prompt versions change, cache entries expire, model routing moves, retrieval indexes rebuild, and tool logs roll off.

The first rule is simple:

declare -> assign -> snapshot -> contain -> communicate -> investigate

If you reverse that order, the postmortem becomes fiction.


2) The incident runbook wall

The runbook wall should be boring enough to follow under stress.

Minute Action Artifact
0-2 Acknowledge alarm and open incident channel channel link + incident ID
2-4 Assign fire captain and scribe owner names
4-7 Write initial incident statement severity guess + unknowns
7-10 Capture snapshot room package trace, prompt, model, retrieval, tools, flags
10-12 Choose temporary firebreak rollback, disable, rate limit, or monitor
12-15 Send first status update impact, action, next update time

This sequence is not bureaucracy. It keeps ten smart people from creating ten different realities.

Teacher voice. In the first fifteen minutes, the best engineer is not the person with the best theory. It is the person who keeps the system observable and safe long enough for the right theory to emerge.


3) Worked example — applying the runbook

At 03:07, support posts the refund complaint. At 03:09, on-call declares:

INC-1047: Possible AI correctness incident in enterprise refund assistant.
Known: one customer-visible answer recommended refund outside 30-day policy.
Unknown: whether tool executed, whether broader tenants affected, whether recent prompt/model/retrieval change involved.
Initial owner: Priya as fire captain, Mateo as scribe.
Next update: 03:24.

At 03:12, the team snapshots:

  • request ID and user-visible answer
  • full prompt after template rendering
  • model provider, model name, routing tier, temperature
  • top retrieval candidates with scores and metadata filters
  • tool-call plan and actual tool execution status
  • prompt version, feature flags, deployment SHA
  • guardrail and policy classifier outputs

At 03:15, they choose a firebreak: disable the "refund recommendation" tool action and degrade to "show policy excerpt only" for enterprise refund flows.

Notice the discipline. They did not yet decide whether retrieval, prompt, or model caused the failure. They stopped automatic harm while preserving the trail.


4) Why not debug in the incident channel immediately

The tempting alternative is to let everyone investigate freely in the incident channel. It feels fast because many people are thinking at once.

It fails because the channel becomes noisy, decisions get buried, and nobody can reconstruct what happened. AI incidents already have too many moving parts. The channel must separate facts, hypotheses, decisions, and customer-facing updates.

Use this format:

[FACT] trace 8f1 shows retrieval returned policy_2024_old before policy_2025_current
[HYPOTHESIS] reranker may prefer older policy due higher historical click score
[DECISION] disable refund action tool for enterprise flows at 03:15
[NEXT] compare affected traces from 02:00-03:15 by tenant and prompt version

The status board becomes a usable artifact instead of a chat transcript.


5) Production signals in the first fifteen minutes

The first metric is spread rate: how many users, tenants, workflows, or tool actions could still be affected while the team talks.

The misleading metric is confidence in the first hypothesis. Fast confident theories are cheap. Preserved evidence is expensive and valuable.

The expert signal is whether the team can answer these questions by minute fifteen:

  • Who is the fire captain?
  • What harm class is suspected?
  • What evidence is frozen?
  • What firebreak is active or explicitly deferred?
  • When is the next update?

If those are missing, the incident is running the team instead of the team running the incident.


6) Boundary — when the runbook can be lighter

For a low-risk internal feature, the first fifteen minutes can be lighter: one owner, one snapshot, one note, one fix path.

For customer-facing flows, tools that take action, private data, regulated domains, money movement, or safety policy, do not improvise. Use the full runbook.

The pathology is premature repair. A well-meaning engineer patches the prompt, the symptom disappears, and the team loses the only evidence that could have shown a broader model-routing or retrieval-index issue.


Recall checkpoint

  • Why does snapshot come before fix?
  • What does the fire captain own?
  • What is the difference between fact and hypothesis in the incident channel?
  • Which first-fifteen-minute questions prove the team is in control?

Interview Q&A

Q: In an AI incident, what do you do before debugging root cause? A: Declare the incident, assign a fire captain and scribe, snapshot the system, choose containment, and send the first status update.

Common wrong answer to avoid: "Start by tuning the prompt." That may erase evidence and spread unreviewed change during the incident.

Q: Why is snapshotting more urgent in AI systems than many backend systems? A: Prompts, retrieval results, model routing, caches, indexes, guardrails, and provider behavior can change quickly. Without the original trace package, the team may never reproduce the visible failure.

Common wrong answer to avoid: "Logs will have it." Logs often miss rendered prompts, retrieval candidates, model routing, or guardrail decisions unless designed for incidents.

Q: What should an incident channel distinguish explicitly? A: Facts, hypotheses, decisions, next actions, and customer-facing updates.

Common wrong answer to avoid: "Let everyone brainstorm in the channel." Brainstorming without structure destroys the status board.


Apply now (10 min)

Model the exercise. Write a first incident update for the refund failure with knowns, unknowns, owner, mitigation, and next update time.

Your turn. Take one AI feature and write the first fifteen-minute checklist your team would follow.

Reproduce from memory. Explain why "contain before you theorize" is not anti-debugging; it is what makes debugging possible.


What you should remember

This chapter explained the first fifteen minutes of an AI incident. The important idea is that early response protects evidence and limits blast radius before root cause is known.

Carry this diagnostic forward: if the team cannot name the captain, snapshot, firebreak, and next update, the response is not yet under control.

Remember:

  • The runbook order is declare, assign, snapshot, contain, communicate, investigate.
  • Snapshot before mutating prompts, indexes, routing, or tool behavior.
  • The incident channel separates facts, hypotheses, decisions, and next actions.
  • A good firebreak buys time without pretending to fix root cause.

Bridge. The first fifteen minutes give us control, but the containment choice depends on severity and blast radius. Next we learn how to classify the fire before choosing how hard to pull the firebreak. → 03-severity-and-blast-radius.md