09. Postmortem capture¶

~9 min read. Runbooks contain incidents. Postmortems make incidents change the system. The apparatus's memory lives here — the structured capture that turns a single incident into eval coverage, alert improvement, runbook update, and drill scenarios.

Continues from 08-provider-and-cost-runbooks.md. This chapter develops the postmortem plane. Recurring concepts in bold: postmortem template, eval delta, follow-up enforcement, blameless framing, AI-specific cause taxonomy, closure SLO.

The previous three chapters described how runbooks contain incidents. This chapter is what happens after — the structured capture that prevents the same shape from recurring.

What an AI postmortem is¶

An AI postmortem is a structured incident record with five mandatory fields — cause, blast radius, eval delta, follow-up actions, and apparatus updates — written within a defined SLO of incident closure and tracked to completion.

The classic SRE postmortem has cause and follow-ups. AI postmortems add three specifics:

Eval delta. What did the incident reveal about the eval set? What expansion or refresh is now required?
Apparatus updates. What alert, runbook, escalation, or drill change does the incident motivate?
AI-specific cause taxonomy. Was the cause in the prompt, model, retrieval, data, tooling, gateway, or upstream system? The classic "infrastructure / application / human" taxonomy is too coarse.

The template¶

A useful template:

1. Identification
   - Incident ID, date, severity, duration, on-call(s)
2. Summary (3-5 sentences)
   - What happened, what was the user-visible impact, what was done
3. Timeline
   - Bullet log of key events: page, action, escalation, resolution
4. Cause
   - Primary cause classified by AI taxonomy:
     prompt | model | retrieval | data | tooling | gateway | upstream | apparatus
   - Contributing factors (sub-causes)
5. Blast radius
   - Affected slice / cohort / tenant; user-impact count or proxy;
     business impact estimate
6. Eval delta
   - Did the incident reveal an eval coverage gap? What expansion is needed?
   - If no, why not?
7. Follow-up actions
   - Each item: action, owner, due date, current status
   - Required: at least one apparatus update item
8. Apparatus updates
   - Alert change, runbook update, escalation update, drill scenario added
9. What went well
   - Components of the apparatus that worked as designed
10. What did not go well
    - Components that failed or were missing

The template is the contract. Every postmortem fills every field. Empty fields are visible gaps, not skipped fields.

The five mandatory fields¶

Cause. Classified by the AI taxonomy. The taxonomy is non-negotiable — not because the categories are perfect, but because consistent classification lets the team analyse incident trends. Over time, the distribution reveals which surface needs investment.

Blast radius. Sized in user terms when possible. "12 enterprise tenants affected, 4,200 user-facing calls degraded over 38 minutes" is more useful than "an incident occurred."

Eval delta. The single most important AI-specific field. Most AI incidents reveal that the eval set did not catch the regression. The postmortem captures the missing coverage and commits to filling it.

Follow-up actions. Each with owner and due date. Tracked to closure with the same rigour as product backlog items. The platform team enforces closure rates; teams below 80% closure are flagged.

Apparatus updates. What changes about the apparatus itself. An incident with no apparatus update is either a perfect apparatus response (rare) or a missed learning opportunity (common).

A worked example — the policy_comparison postmortem¶

The Bengaluru insurance SaaS team writes the postmortem for the policy_comparison incident from chapter 07:

Identification. INC-2026-0512, severity P2, duration 11 minutes, on-call Priya Iyer.

Summary. A prompt deploy at 18:00 improved aggregate eval scores by 4% but regressed the policy_comparison intent by 19%. The quality alert fired at 18:42 within the deploy-anchored window. On-call rolled back via runbook step 7 at 18:48; alert cleared at 18:53. Approximately 38 affected user sessions over 42 minutes.

Timeline. Bullet log of the page, runbook steps, rollback, verification, channel close.

Cause. Primary: prompt. Contributing: eval set did not include sufficient policy_comparison coverage to catch the regression in pre-deploy eval.

Blast radius. 38 user sessions across 14 tenants over 42 minutes; one enterprise tenant; user-feedback signal showed 22% thumbs-down rate on policy_comparison during the window.

Eval delta. The eval set has 8 policy_comparison examples; the regression case (riders + age-banding) was not represented. Expand the eval set with 30 examples covering the rider-comparison cases. Validate against the regressed prompt to confirm the new eval would have caught the change.

Follow-up actions. - (a) Expand policy_comparison eval coverage — owner: Arjun Rao — due: 2026-05-19 — status: in progress. - (b) Validate new eval against the regressed prompt — owner: Arjun Rao — due: 2026-05-21 — status: pending. - (c) Add slice-level alert sensitivity for high-traffic intents — owner: Priya Iyer — due: 2026-05-26 — status: pending. - (d) Drill scenario: prompt regression with slice isolation — owner: drill lead — due: 2026-06-15 — status: scheduled.

Apparatus updates. Slice-level alert sensitivity tightened (action c); drill scenario added (action d). Runbook is unchanged (the existing runbook handled this incident cleanly).

What went well. Quality alert fired with deploy ID in payload; on-call executed runbook without escalation; mean time to contain was 11 minutes.

What did not go well. Pre-deploy eval did not catch the regression because the eval set's coverage of policy_comparison was thin. The deploy passed despite the regression being detectable in principle.

Blameless framing¶

The postmortem is about the system, not the engineer. The blameless principle:

Names appear in the timeline as role-holders, not as defendants.
"What did not go well" focuses on apparatus and design gaps, not on individual decisions.
The eval delta and apparatus updates focus on prevention, not on punishment.

The principle is not soft. A team that names individuals as causes will see fewer postmortems and worse incidents. A team that names systems as causes will see more honest postmortems and faster apparatus maturity.

Follow-up enforcement¶

Postmortems are valuable only if follow-ups close. The enforcement mechanism:

Each follow-up has an owner and a due date.
A follow-up's status is tracked in a system the team checks (ticketing, project tool).
A team's follow-up closure rate (closed within 30 days of postmortem) is reported quarterly.
Teams below 80% closure are remediated: backlog grooming sessions, priority recalibration, or apparatus review.

Without enforcement, follow-ups become a wishlist. With enforcement, the apparatus actually changes after each incident.

The AI-specific cause taxonomy¶

The taxonomy is:

Category	What it means
prompt	The prompt or prompt template was the cause
model	The model itself produced the regression (drift, capability change)
retrieval	Retrieval returned wrong, stale, or insufficient context
data	The underlying data (training, retrieval source, golden set) had an issue
tooling	The agent's tool calls or tool responses were the cause
gateway	The model gateway's routing, quota, or transform was the cause
upstream	An upstream system (auth, search, database) was the cause
apparatus	The on-call apparatus itself failed (alert missing, runbook stale)

A single incident can have a primary cause and contributing factors. The taxonomy is consistent across the team; teams may add domain-specific sub-categories.

Operational signals¶

Healthy. Every incident produces a postmortem within the SLO (commonly 5 business days). Follow-up closure rate is above 80%. Apparatus updates flow from postmortems back into the alert plane, runbook plane, and drill plane.

First degrading metric. Postmortem backlog growing. Incidents are happening; postmortems are not being written; the apparatus is not learning.

Misleading metric. Number of postmortems. A team with thorough incident response writes more postmortems than a team with shallow response; high count is health, not pathology. The metric to watch is follow-up closure rate.

Expert graph. Cause taxonomy distribution over time, follow-up closure rate per team, apparatus updates per quarter sourced from postmortems. The combination reveals where the apparatus is maturing and where it is stalling.

Boundary of applicability¶

Strong fit. Teams running real AI incidents at non-trivial rate. The postmortem discipline pays back through prevented recurrence.

Pathology. A team treating postmortems as audit artefacts. The template is filled; follow-ups are not tracked; the apparatus does not learn. The fix is to make closure rate visible and to remediate teams below threshold.

Scale limit. Very large platforms produce many postmortems; the analysis layer (cause taxonomy distribution, apparatus update aggregation) becomes its own activity. The pattern is to have a platform team that reviews postmortem trends across the portfolio quarterly.

Failure-prone assumption¶

The seductive wrong belief: the postmortem is the deliverable. It is not. The deliverable is the apparatus change. A team that produces well-written postmortems with zero apparatus changes is wasting the discipline. The correct belief: postmortems exist to produce apparatus updates; their value is in the follow-up closure, not in the document.

Where this appears in production¶

A fintech has an AI cause taxonomy enforced across postmortems; quarterly trends drive eval-set investment.
A telecom AI has follow-up closure rate as a leadership metric; teams below 80% are remediated.
A consumer chatbot writes postmortems as audit artefacts; closure rate is 30%; the apparatus does not mature.
A healthtech AI treats eval delta as a mandatory field; eval coverage grows monotonically.
A coding assistant has the apparatus-update field non-empty for every postmortem; the apparatus matures quickly.
A retail AI treats postmortems as blameless from day one; engineers are willing to write honestly.
A logistics AI writes one postmortem per incident; large incidents have multiple sub-postmortems for clarity.
A government AI has postmortems as regulatory artefacts; the apparatus discipline is layered on top.
A B2B SaaS measures cause distribution: 40% prompt, 25% retrieval, 15% data, 10% gateway, 10% other. Investment prioritises prompt and retrieval.
A travel platform had a quarter where postmortems were not written; the next quarter's incident rate doubled.
A payments AI has follow-ups tracked in the same system as product backlog; closure rates are visible to leadership.
A legal AI has postmortems reviewed by a senior engineer for quality; weak postmortems are returned for revision.
A staffing AI has a quarterly postmortem trends review; cross-team patterns surface.
A search-rerank service has eval-set expansion as a mandatory follow-up; eval coverage tracks incident learning.
A document AI has an apparatus-only postmortem when the apparatus itself failed (alert missed, runbook stale).
A media AI treats postmortems as the input to the drill calendar; observed failures become drill scenarios.
An ad-tech AI has the postmortem SLO at 3 business days; backlog is rare.
A real-estate AI has the cause taxonomy customised for their domain; the platform pattern adapts.
A medical AI has postmortems with regulator-notification follow-ups; the apparatus integrates with compliance.
A small SaaS does not write postmortems; the same incident has happened three times.

Recall / checkpoint¶

Name the five mandatory fields of an AI postmortem.
What is the AI-specific cause taxonomy, and why does consistency matter?
What is the eval delta field, and why is it the most important AI-specific field?
Why is blameless framing not a soft principle?
What is the follow-up enforcement mechanism?
What signals tell you the postmortem plane is degrading?
Why is "postmortems are the deliverable" a failure-prone assumption?

Interview Q&A¶

Q1. A team writes thorough postmortems but rarely closes follow-ups. What is the apparatus failure, and what is the remediation? The discipline has stopped at the document; the apparatus does not change after incidents. The remediation is to make closure visible — track each follow-up in the team's normal work-management system, surface the closure rate as a leadership metric, set a threshold (80% within 30 days), and remediate teams below it (backlog grooming, priority recalibration). The postmortem is the input; the apparatus update is the output. Common wrong answer to avoid: "write better postmortems" — closure is the problem, not document quality.

Q2. The cause taxonomy classifies an incident as "data." Walk through what that means and how it shapes the follow-up. The cause was in the data — training data, retrieval source content, golden set examples, or grounding documents. The follow-up usually involves the data team (or the team owning the data source), not just the AI team. The eval delta may include data-quality checks rather than test cases. The apparatus update may include a new data freshness alert or a data validation step in the deploy. The taxonomy points the follow-up to the right surface. Common wrong answer to avoid: "the AI team owns it" — depends on where the data originates and who owns the source.

Q3. Why is the eval delta field the most important AI-specific field? Because most AI incidents reveal that the eval set did not catch the regression. The eval delta captures what would have caught it — new test cases, broader coverage, sliced evaluation — and commits to filling the gap. Over time, the eval set grows in lockstep with the apparatus's observed failures, raising the bar on what can ship without detection. Without the eval delta field, the eval set stays static and the same incidents recur. Common wrong answer to avoid: "we'll update the eval set when we get to it" — without the field, the eval update is not tied to incident learning.

Q4. The team's postmortems are blameless but the engineer responsible feels punished anyway. What is the leadership failure? Blameless framing is a system design; if leadership conversations treat the engineer as responsible, the document's neutrality does not absorb it. The remediation is for leadership to model the framing in real time — name the system, ask about the apparatus, not about the engineer's choices. Engineers learn by what is said in the postmortem review, not by what is written in the template. Common wrong answer to avoid: "the template is blameless, that's enough" — culture follows behaviour, not artefacts.

Q5. How does the postmortem plane interact with the drill plane? Postmortems are the primary input to the drill calendar. An observed failure becomes a drill scenario the team can rehearse before recurrence. The drill scenario validates the apparatus updates the postmortem produced: did the new alert fire? Did the updated runbook work? Did the new eval coverage catch the regression in eval scoring? The drill closes the loop the postmortem opened. Common wrong answer to avoid: "drills are separate" — drills validate postmortem follow-ups; they're tightly coupled.

Q6. What is the right SLO for postmortem completion? 3-5 business days for most incidents; longer for complex incidents that require investigation. The SLO is a contract: the apparatus learns from incidents within a week, not within a quarter. Beyond a week, the team's context fades and the postmortem quality degrades. Within 3-5 days, the apparatus update is captured while the incident is still vivid. Common wrong answer to avoid: "when the team has time" — without the SLO, postmortems backlog and the apparatus stops learning.

Design / debug exercise (10 minutes)¶

Modelled example. Walk through the worked example (the policy_comparison postmortem). Verify all five mandatory fields are populated, the cause is classified, the eval delta has actionable expansion, the follow-ups have owners and dates.

Your turn. Take your team's most recent incident. Write the postmortem against the template in this chapter. Identify which mandatory fields you would struggle to fill in — those gaps are your apparatus's blind spots.

Reproduce from memory. Write the AI-specific cause taxonomy from memory. The signal of internalisation is that the eight categories land in under two minutes and that you can classify a hypothetical incident into the right primary category quickly.

Operational memory¶

This chapter explained the postmortem plane: structured incident capture with five mandatory fields, AI-specific cause taxonomy, blameless framing, and enforced follow-up closure. The important idea is that postmortems exist to produce apparatus updates; their value is in the changes that follow, not in the document itself.

You learned to write postmortems with the five mandatory fields, classify causes consistently, enforce follow-up closure, and feed the drill plane from observed failures. That solves the opening failure because the apparatus now learns from every incident, not just from the major ones.

Carry this diagnostic forward: when a team says "we do postmortems," ask for the follow-up closure rate over the last quarter. The closure rate is the truth; the document count is the appearance.

Remember:

Five mandatory fields: cause, blast radius, eval delta, follow-ups, apparatus updates.
AI-specific cause taxonomy (8 categories) lets you analyse trends.
Eval delta is the most important AI-specific field.
Blameless framing is a system design, not a soft principle.
Follow-up closure rate is the metric, not postmortem count.

Bridge. Postmortems capture learning. Drills exercise the apparatus before learning is needed. The next chapter is the drill plane — how to schedule, run, and score exercises that keep the apparatus from rusting. → 10-drills-and-game-days.md