Skip to content

05. War-room roles and communication — one timeline, one captain, no folklore

~11 min read. AI incidents go sideways when everyone has a theory and nobody owns the timeline.

Continues from 04-snapshot-the-system.md. The snapshot room preserved the evidence. Now the status board must preserve decisions.

The previous chapter protected the technical record: prompt, model, retrieval, tools, guardrails, flags, and timestamps. That solved the evidence problem, but incidents still fail socially when facts, hypotheses, customer questions, and executive pressure collide. This chapter makes the human control plane explicit.


1) The wall — smart people create noise under pressure

In the refund incident, retrieval engineers inspect candidates, prompt owners inspect templates, product asks about customer impact, legal asks whether money moved, support asks what to tell the account team, and leadership asks for an ETA.

All of those questions are legitimate. Together they can drown the response.

A war room exists to create one decision surface:

fire captain   -> owns severity, containment, decisions
scribe         -> owns timeline and status board
investigators  -> own hypotheses and evidence
comms owner    -> owns customer/internal language
approvers      -> approve risky rollback or disclosure decisions

The goal is not hierarchy theater. It is reducing coordination cost while uncertainty is high.


2) The roles that matter

Fire captain. One person decides severity, containment, escalation, and closure criteria. They do not need to be the deepest technical expert.

Scribe. One person writes facts, hypotheses, decisions, timestamps, and next updates. Without a scribe, the postmortem starts with archaeology.

Technical investigators. Small owners per layer: prompt, retrieval, model route, tool execution, guardrails, data/index, infra.

Comms owner. One person translates uncertainty into careful internal and customer language.

Business or policy approver. For refunds, privacy, legal, safety, or regulated flows, some decisions are not purely engineering decisions.

The fire captain should protect investigators from stakeholder pings. The comms owner should protect customers from raw speculation.


3) Worked example — status board for the refund incident

A good status board looks like this:

INC-1047 enterprise refund assistant
Severity: sev-2, not yet public, one confirmed customer-visible case
Captain: Priya
Scribe: Mateo
Comms: Lena
Current firebreak: refund recommendation disabled for enterprise flows
Customer impact: text recommendation only; no refund tool execution found so far
Facts:
  - bad trace at 03:04 returned old policy chunk above current policy
  - reranker disabled for 10% cost experiment in affected window
Hypotheses:
  - reranker disable increased stale-policy ranking
  - fallback model followed stale chunk more aggressively
Next:
  - compare 02:00-03:15 traces by prompt version and reranker flag
  - support update due 03:45

This is concise enough for leadership and specific enough for engineers. It does not pretend root cause is settled.


4) Communication rules under uncertainty

AI incidents often involve semantic uncertainty. The team may know the answer was bad before knowing how many users saw similar behavior.

Use language that separates knowns from unknowns:

  • "We have confirmed one customer-visible incorrect answer."
  • "We have not found evidence of automatic refund execution."
  • "We are disabling recommendations for the affected flow while we measure blast radius."
  • "Next update in 30 minutes."

Avoid language that overclaims:

  • "The model hallucinated."
  • "Only one customer affected."
  • "Fixed."
  • "No data issue."

Those may become false within an hour.


5) Production signals — war-room health

The first signal is decision latency: how long between credible evidence and containment decision.

The misleading signal is channel activity. A busy channel can mean progress or panic.

The expert signal is timeline quality. Can an uninvolved engineer read the status board and understand severity, current firebreak, facts, hypotheses, decisions, and next update time?

If not, the incident has two problems: the AI failure and the response failure.


6) Boundary — when a lightweight room is enough

Small internal incidents can use a lightweight room: one owner, one doc, one update. Customer-facing, safety, privacy, money, public, or executive-visible incidents need explicit roles.

The pathology is role ambiguity. Two people think they are captain, nobody is scribe, support uses an old claim, and the final postmortem cannot explain why a firebreak was delayed.


Recall checkpoint

  • Why does a war room need a fire captain?
  • What does the scribe preserve?
  • Why should customer comms avoid raw root-cause speculation?
  • What is the difference between channel activity and timeline quality?

Interview Q&A

Q: Who should be incident commander in an AI incident: the model expert or the engineering lead? A: The best fire captain is the person who can coordinate severity, containment, decisions, and communication. Deep experts investigate; the captain controls the response.

Common wrong answer to avoid: "The smartest model person should run everything." That overloads the expert and leaves coordination unmanaged.

Q: What should a customer update say before root cause is known? A: It should state confirmed impact, current containment, what is still unknown, and next update time without naming speculative root cause.

Common wrong answer to avoid: "Say the model hallucinated." That may be inaccurate and can create legal or trust problems.

Q: How do you know the war room is healthy? A: Decisions are timestamped, facts are separated from hypotheses, containment is explicit, and the next update time is visible.

Common wrong answer to avoid: "Lots of people are active in Slack." Activity is not coordination.


Apply now (10 min)

Model the exercise. Write a status board entry for the refund incident at minute thirty.

Your turn. Define war-room roles for one AI feature at your company or project.

Reproduce from memory. Explain why the incident timeline is a production artifact, not meeting notes.


What you should remember

This chapter explained war-room roles and communication. The important idea is that AI incident response needs one decision timeline while many layers are investigated in parallel.

Carry this diagnostic forward: if facts, hypotheses, decisions, owners, and next updates are mixed together, the status board is failing.

Remember:

  • The fire captain owns decisions, not every investigation.
  • The scribe turns chaos into an incident timeline.
  • Customer language should state known impact and containment, not speculation.
  • Timeline quality is the real war-room metric.

Bridge. Communication keeps the response coherent, but users are still exposed until a firebreak changes system behavior. Next we design rollback and kill switches. → 06-rollback-and-kill-switches.md