Skip to content

09. Postmortems, evals, and locks — the incident must change the system

~12 min read. A postmortem that produces only a document is a memorial. A postmortem that produces a lock changes the future.

Continues from 08-ai-specific-incident-patterns.md. We recognized the incident pattern and pulled a firebreak. Now the after-action lock must stop the class from returning.

The previous chapter made incidents recognizable: prompt regression, stale evidence, tool runaway, judge drift, cost explosion, and related patterns. That helps containment, but a contained fire can still return next month. This chapter turns incident learning into release gates, eval slices, guardrails, and runbook changes.


1) The wall — "root cause: model hallucinated" teaches nothing

After the refund incident, a weak postmortem says:

Root cause: model hallucinated old refund policy.
Action item: improve prompt.

That postmortem is almost useless. It hides retrieval, reranking, model fallback, prompt authority, eval coverage, and release process.

A strong AI postmortem names the failure chain:

cost experiment disabled reranker for 10% traffic
  -> stale policy chunk outranked current policy
  -> fallback model used due timeout
  -> prompt allowed directive refund advice
  -> no critical-slice eval covered enterprise 90-day renewal case
  -> support caught it by customer report

Now the system can change.


2) The AI postmortem sections that matter

Use normal incident fields: timeline, impact, detection, response, root cause, what went well, what went poorly.

Add AI-specific fields:

Section Question
Product contract What promise did the AI violate?
Evidence path What prompt, retrieval, tool, memory, model, and guardrail artifacts were involved?
Eval gap What test should have caught this before release?
Firebreak quality Did containment stop harm without excessive availability loss?
Human review gap Should this slice have had sampling or approval?
Recurrence lock What eval, guardrail, flag, runbook, or architecture change prevents this class?

The postmortem should produce at least one after-action lock. Otherwise the organization paid incident cost and bought only a story.


3) Worked example — refund after-action locks

The refund incident creates multiple locks:

  1. Eval lock. Add enterprise refund cases at 30, 60, 90, and renewal-edge windows.
  2. Retrieval lock. Current-policy chunks must outrank stale chunks for active policy flows.
  3. Prompt lock. Refund assistant cannot use directive approval language unless eligibility tool confirms.
  4. Feature flag lock. Reranker cannot be disabled for financial-policy flows without critical-slice eval pass.
  5. Runbook lock. Refund incidents require disabling recommendation mode until tool execution is ruled out.

These locks live in different places because the incident chain crossed layers. A lead engineer does not force one universal fix.


4) Why not stop at "add more tests"

The tempting alternative is to add one regression test for the exact complaint. That is better than nothing.

It fails when the same incident class returns with a different policy, tenant, model route, or tool. The lock should target the class, not only the example.

Turn one example into a slice:

exact case:
  enterprise customer, 90 days after renewal

slice:
  enterprise renewal refund eligibility across 0/29/30/31/60/90 days,
  current and stale policy versions,
  with and without tool eligibility confirmation

That is the difference between a unit test and an incident lock.


5) Production signals — postmortem quality

The first metric is lock completion rate: how many postmortem action items become merged evals, guardrails, runbook changes, dashboards, or architecture changes.

The misleading metric is postmortem length. Long documents can still produce weak locks.

The expert signal is recurrence: if the same class returns, ask whether the lock was missing, weak, bypassed, or not connected to release gates.


6) Boundary — postmortems cannot remove all ambiguity

AI postmortems often have partial causes. A prompt change, model route, stale document, and user phrasing may all contribute.

Do not force a single clean root cause when the truth is a causal graph. The action items should lock the most dangerous edges in that graph.

The pathology is blame-shaped causality. The team picks the layer owned by the least powerful group, writes a tidy story, and misses the system interaction.


Recall checkpoint

  • Why is "model hallucinated" a weak root cause?
  • What is an after-action lock?
  • How do you turn one incident into an eval slice?
  • Why can AI postmortems have multiple partial causes?

Interview Q&A

Q: What is different about an AI postmortem? A: It must include product contract, evidence path, eval gap, firebreak quality, human-review gap, and recurrence lock, not only service timeline.

Common wrong answer to avoid: "Use the same postmortem template as backend incidents." The normal template misses prompt, retrieval, model, tool, guardrail, and eval gaps.

Q: How do you prevent a bad AI answer from recurring? A: Convert the incident into a slice-level eval, retrieval check, guardrail, prompt/tool constraint, release gate, or runbook lock depending on the failure chain.

Common wrong answer to avoid: "Add one test for the bad example." The class returns through nearby cases.

Q: What if root cause is multi-factor? A: Write the causal graph and lock the dangerous edges. Do not compress the incident into one false cause for narrative comfort.

Common wrong answer to avoid: "Pick the main owner and assign the fix." Ownership is necessary; oversimplified causality is dangerous.


Apply now (10 min)

Model the exercise. Convert the refund incident into five locks: eval, retrieval, prompt, feature flag, and runbook.

Your turn. Take one previous AI bug and rewrite its root cause as a failure chain instead of one sentence.

Reproduce from memory. Explain why a postmortem is not done until it changes a release gate, runbook, guardrail, or architecture.


What you should remember

This chapter explained postmortems, evals, and locks. The important idea is that incident learning must become a system constraint, not a document.

Carry this diagnostic forward: every serious AI incident should answer, "What eval, guardrail, runbook, or architecture change would catch or contain this class next time?"

Remember:

  • "Model hallucinated" is not a useful root cause.
  • Locks target incident classes, not only exact examples.
  • AI postmortems should include the eval gap.
  • Recurrence proves the lock was missing, weak, or bypassed.

Bridge. A postmortem lock helps after the fire. Strong teams also practice before the fire. Next we turn incident response into a drillable skill. → 10-incident-drills-and-readiness.md