11. Release postmortem¶

Emergencies are handled. The final operational discipline is the postmortem — for when a release goes wrong, emergency or routine, the structured investigation that captures lessons and produces systemic improvement.

A platform engineer at a Mumbai SaaS company writes the postmortem for a prompt change that regressed in production. The timeline: shipped at 10:00, regression detected at 11:30, rolled back at 12:00. The root cause: the prompt change shifted tone in a way the eval did not measure. The contributing factors: the eval rubric did not include a "tone" criterion; the canary was at 25% (a longer at 5% would have caught it earlier); the feedback gate threshold was set at 2σ (a tighter threshold would have flagged sooner). The action items: extend the rubric to include tone (specific case for the missing criterion); review canary step sizes (consider 5% → 25% rather than 5% → 50%); investigate calibration of the feedback gate threshold. Each item has an owner and a date; the postmortem is closed when all items ship.

This chapter is the postmortem discipline as applied to releases.

What the release postmortem covers¶

A blameless investigation of a release that went wrong. The structure parallels module 05_ai_incident_operations (general AI incidents) with release-specific elements:

Timeline. Release, detection, response, rollback, resolution.
Root cause. What specifically caused the regression.
Contributing factors. Disciplines that did not catch the cause (gates, canary, monitoring).
Blast. Number of affected users; estimated impact; customer escalations.
Response assessment. What worked; what was slow; what should improve.
Action items. Specific improvements with owners and dates.

The release-specific contributing factors to examine¶

Beyond general incident factors, release postmortems examine:

Eval gate. Did the eval catch the cause? If not, the eval set has a coverage gap (chapter 11 of 01_dataset_golden_set_operations).
Canary. Did the canary catch the cause? If not, the canary step sizes or duration may be wrong; the monitoring during canary may be insufficient.
Feedback gate. Did the feedback signal flag the issue? If not, the thresholds may be too loose; the feedback bias may have hidden the signal.
Rollback. Was rollback fast and clean? If not, the rollback infrastructure or testing has gaps.
Communication. Were stakeholders informed appropriately? If not, the templates or the process need adjustment.
Decision authority. Was the right person making the call? If not, the authority paths need clarification.

Each factor produces specific action items.

Blameless discipline¶

The postmortem names systems and processes, not individuals. "The engineer shipped without running the eval" becomes "the CI did not enforce the eval gate." The latter is fixable; the former is shame.

The discipline:

The postmortem investigates the system that allowed the failure.
Individual engineers are named for their actions in the timeline, not blamed for outcomes.
Action items address systemic improvements, not "be more careful."

Blameless culture produces honest postmortems; blame-driven culture produces defensive ones.

Action items that actually close¶

The postmortem produces action items. Closing them is the work.

For each item:

A specific owner.
A specific date.
A measurable outcome.
A tracker (ticket, project plan entry).

The postmortem is not closed until all action items ship. A list of "things to do someday" is not the discipline; tracked items are.

A reasonable platform tracks 5-10 action items per major release postmortem; the action items close within 2-8 weeks.

Pattern recognition across postmortems¶

Periodically (quarterly), review the recent postmortems for patterns.

Same eval-set gap appearing in multiple postmortems? The eval expansion is a systemic priority.
Same canary-too-fast pattern? The canary discipline needs adjustment.
Same rollback-slowness? The rollback infrastructure needs investment.
Same communication gap? The templates or process need rework.

The patterns surface what the individual postmortems do not. The cross-postmortem review is its own discipline.

Distinguishing release postmortems from other incident postmortems¶

A release postmortem is for incidents caused by a release. Other incidents (provider outages, data pipeline failures, security incidents not tied to a release) get their own postmortems per the broader incident discipline.

The distinction matters because the action items differ. Release postmortems focus on release discipline; provider-outage postmortems on gateway and fallback discipline; security postmortems on the security discipline.

When the release did not regress¶

Most releases ship fine. The discipline of postmortem is reserved for those that went wrong. Periodic reviews of routine releases (e.g., quarterly retrospective on release patterns) are useful but lighter.

What the postmortem does not solve¶

The original incident. That was handled by rollback and remediation. The postmortem prevents recurrence.
The team's emotional response. A bad release is stressful; the postmortem is the structural response, not the emotional one.
External pressure. A regulator asking why the incident happened wants the postmortem; the postmortem is your answer.

Common mistakes¶

Postmortem without action items. A summary without commitments; nothing changes.

Action items without dates. "Someday" becomes never.

Blame-driven postmortem. Engineers defensive; honest analysis impossible.

No pattern review. Repeated systemic issues fixed individually but never structurally.

Postmortem skipped because "it was small." Small incidents are the practice ground for the discipline; skipping decays the muscle.

Interview Q&A¶

Q1. Walk through the structure of a release postmortem. Timeline (release, detection, response, rollback, resolution). Root cause (what specifically caused the regression). Contributing factors (eval gate, canary, feedback gate, rollback, communication, decision authority — release-specific elements beyond general incident factors). Blast (users affected; impact). Response assessment. Action items (specific owners and dates). The structure parallels general incident postmortems with release-specific elements. Wrong-answer notes: missing the contributing factors loses the systemic analysis.

Q2. The postmortem identifies that the eval did not catch the regression. What is the action item? Add cases to the eval set representing the failure mode the regression exhibited (per chapter 11 of 01_dataset_golden_set_operations). The action item is specific: "Add 5-10 cases covering tone variations to the regression set; verify the new cases would have caught the original regression; ship by [date]." The action item is measurable (cases added) and verifiable (the regression would be caught now). Wrong-answer notes: "improve the eval" is too vague; specific cases make it actionable.

Q3. Why blameless postmortem? Because blame produces defensiveness; defensive analysis is incomplete; the next incident has the same systemic cause. Blameless analysis names systems and processes, which can be improved. The engineer who happened to be the proximate cause is part of the timeline; the blame falls on the system that allowed the action. Blameless culture is the precondition for honest analysis. Wrong-answer notes: "accountability matters" misframes — accountability for systems differs from blame on individuals.

Q4. The team has had three postmortems this quarter all identifying the canary as too fast. What is the cross-postmortem response? The canary discipline needs adjustment. The pattern across postmortems is the signal; the per-postmortem fix (longer canary for that specific release) does not address the systemic issue (canary defaults are too aggressive). The action item is platform-level: review and tighten the canary defaults; communicate the new defaults; verify the next several releases follow the tightened discipline. The pattern review converts individual incidents into platform improvement. Wrong-answer notes: "fix each incident's canary individually" misses the pattern.

What to do differently after reading this¶

Write the postmortem for every release that goes wrong; do not skip "small" ones.
Include release-specific contributing factors (eval, canary, feedback gate, rollback, communication).
Blameless culture; name systems, not individuals.
Action items with owners and dates; track to closure.
Quarterly pattern review across postmortems.

Bridge. Eleven chapters. The last two synthesise. The next chapter is the architect's checklist. → 12-architect-checklist.md