18. Postmortem for agents — the detective's case-closed report¶
~15 min read. Classical SRE postmortems were written for deterministic servers. Agents break differently. The template must change.
Built on the ELI5 in 00-eli5.md. The confession — the verified root cause from the lineup — must be written down so the next engineer does not re-walk the same case. That document is the postmortem.
The case-closed report — what a good agent post-mortem contains¶
A classical SRE postmortem reads like one config flip — "Region us-east-1 lost capacity. Quota hit. 502s for 18 minutes." One cause, one mitigation, deterministic. Agent postmortems are different in three structural ways: causation often involves two things in combination, the bug may not reproduce the same way twice, and the confession has no value if the next engineer cannot replay it.
classical SRE postmortem agent postmortem
┌──────────────────────────┐ ┌──────────────────────────────┐
│ what broke (one thing) │ │ what broke (often two) │
│ when it broke │ │ which trace shows it │
│ how we mitigated │ │ which suspect confessed │
│ action items │ │ which lock got added │
│ │ │ prompt + tool + model diff │
└──────────────────────────┘ └──────────────────────────────┘
The bug may live in a prompt example, not in code. The bug may not repro the same way twice. The confession is no good if the next engineer cannot reproduce. The lock is no good if nobody can find which eval to run. Structure is what makes the report durable.
The five sections every agent postmortem must have¶
1. The complaint slip — what the user actually said¶
Verbatim user words. Not a summary. Exact text from chat, ticket, or thumb-down.
Why? The words decide the bug class. "It refunded me too much" — tool or model layer. "It forgot what I said earlier" — memory layer. "It keeps trying the same thing" — loop layer. The complaint slip is the entry point to the lineup. Also record: timestamp, user ID, trace ID, model version.
2. The case file — the trace ID and reproduction recipe¶
One sentence: "Trace abc-123 at 2026-05-11 14:22 UTC." Then a recipe.
Exact input. Seed if any. Model version. Prompt version. Tool registry version. Retrieved documents.
Skip one and the next engineer cannot re-run the case file. Yes?
3. The lineup result — which suspect confessed¶
State each suspect. Mark each one ruled out or guilty.
suspect verdict evidence
───────── ───────── ──────────────────────────────
prompt innocent no diff vs working version
tool GUILTY schema accepted "1000" as cents not dollars
loop innocent single iteration, no retry
memory innocent fresh session
model innocent same version as working day
If two suspects co-caused it, mark both guilty. Do not hide ambiguity.
4. The lock — the regression eval added¶
Name the eval. Test case. Expected output. Gate (CI, pre-deploy, canary).
"Added
refund_unit_cents_v1. Input:'refund $10'. Expected:refund(amount=1000, unit='cents'). Gate: pre-deploy blocker."
The lock prevents recurrence. Without it, the confession is just a story.
5. The agent-version delta — prompt diff, tool diff, model diff¶
prompt diff tool diff model diff
────────────── ────────────── ──────────────
+ 2 examples + new endpoint claude-3.5 → 3.7
- 1 rule unchanged sig temp: 0.0 → 0.2
SRE postmortems do not have this. Agents have three independent surfaces that drift. All three must be diffed.
What classical SRE postmortems miss for agents¶
Non-determinism. Same prompt, same model, same seed — answer can still differ. Tool-call ordering, server affinity, temperature. So the postmortem must answer: did we reproduce the failure more than once? If only once, leave a cold case flag.
Prompt/data attribution. The bug may live in a few-shot example, a retrieved doc, or a tool description. git blame will not find it. So version the prompts, examples, and corpus. The agent-version delta must point at the exact diff.
Eval-set blind spots. For agents, the eval set is the deploy contract. Every postmortem must answer: why did the eval miss this? Three usual reasons. Slice not represented. Judge too lenient. Threshold too low.
Multi-suspect ambiguity. Sometimes two suspects co-caused it. Fix only the tool and a future prompt rewrite reopens the case. Fix only the prompt and a future tool rewrite reopens it. Both need locks. SRE rarely deals with this. Agents must.
The 5 Whys for agents¶
Classical 5 Whys asks "why?" five times. For agents, each "why?" maps to a layer.
why 10x refund? ──► tool called with amount=10000 (tool)
why amount=10000? ──► model passed dollars, tool wants cents (interface)
why dollars passed? ──► tool description did not state unit (tool-doc)
why prompt missed? ──► no few-shot showed unit conversion (prompt gap)
why eval missed? ──► no "refund $X" variants in eval set (eval gap)
Five whys, five distinct suspects named. The bug lived in three. Classical 5 Whys would stop at "tool was called wrong." Agent 5 Whys keeps going until the eval gap.
Worked postmortem — billing agent over-refunded by 10x¶
TITLE: Refund agent issued $1,000 instead of $100 to user 8842
SEVERITY: SEV-2 (1 user, reversed in 4h)
DATE: 2026-05-11
Section 1 — Complaint slip.
"Your bot just refunded me one thousand dollars. I only asked for a hundred. I am not complaining for myself but you should know." — user 8842, ticket #44912, 09:12 IST
Trace: tr_2026-05-11_09-11_user8842. Model: claude-sonnet-4-7. Prompt: refund-agent-v3.2.
Section 2 — Case file.
input: "Please refund me $100 for the cancelled order #771."
prompt version: refund-agent-v3.2
tool registry: tools-v9 (refund sig changed in v9)
model: claude-sonnet-4-7, temp=0.0
retrieved: order-771.json (amount: $250, cancelled)
Replayed on staging. Reproduced 9 of 10 times. Once the agent asked a clarifying question and avoided the bug. Non-determinism noted.
Section 3 — Lineup.
prompt GUILTY no example covered "refund $X"
tool GUILTY unit field optional and undocumented
loop innocent single tool call
memory innocent fresh session
model innocent same model worked on tools-v8
Two suspects confessed. Co-causation.
Section 4 — Lock. Three evals added.
1. refund_unit_dollars_v1
in: "refund $100"
expect: refund(amount=10000,unit='cents') OR (100,'dollars')
2. refund_unit_cents_v1
in: "refund 100 cents"
expect: refund(amount=100, unit='cents')
3. refund_clarification_v1
in: "refund me for order 771" (ambiguous)
expect: clarifying question, no tool call
Tool schema: unit now required, enum ["cents","dollars"]. Prompt: two new few-shot examples. Both locks committed before re-deploy. All gates: pre-deploy blocker.
Section 5 — Agent-version delta.
prompt diff (v3.1 → v3.2)
+ removed two "verbose" examples to save tokens ← suspect change
- no change to refund-handling section
tool diff (tools-v8 → tools-v9)
+ added unit field (optional) ← root-cause change
- removed legacy amount_cents parameter
model diff
no change. claude-sonnet-4-7 in both windows.
Tool diff is where the bug entered. Prompt diff made it worse by removing the examples that would have warned the model. Two changes, two days apart. Classic agent-failure pattern.
The postmortem lifecycle¶
incident lock added
│ ▲
▼ │
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌─────────────┐
│ complaint │──→│ trace │──→│ lineup │──→│ confession │
│ slip │ │ case file │ │ suspects │ │ root cause │
└────────────┘ └────────────┘ └────────────┘ └──────┬──────┘
▼
┌─────────────┐
│ eval set │
│ growth │
└─────────────┘
Every closed case grows the eval set by one entry. Over months, the eval set becomes the team's institutional memory. Open cold cases sit in a backlog with trace IDs preserved.
Post-mortem patterns for agent incidents¶
- Anthropic incident retrospectives — internal postmortems track prompt diff, tool diff, and model diff as three separate lanes alongside infra cause; the role is forcing three-lane inspection instead of a single root cause.
- OpenAI status page — public incident writeups for ChatGPT and the API increasingly cite model-version and prompt-version deltas, not just infra outages; the role is normalising the agent-postmortem structure publicly.
- Notion AI quality team — thumbs-down complaints get reproduced into traces; each closed case ships a new eval row tied to the offending trace ID; the role is encoding the lock workflow into product culture.
- Mintlify agent outage writeups — tool-schema regressions are first-class incident categories rather than "just bugs"; the role is making tool-layer suspects publicly named.
- Google SRE Book (adapted for LLMs) — the blameless-postmortem pattern survives; the cause column shifts from "one config change" to "any of five suspects — possibly two at once".
- PagerDuty incident retros — structured retrospective templates; the role is providing the cultural substrate agent postmortems extend.
- Atlassian Statuspage — public-facing incident communication; the role is exposing how incident severity maps to user-visible status.
- GitHub incident reports — public retros with timeline, root cause, lessons; the role is the canonical postmortem-as-public-artifact pattern.
- Datadog notebooks-as-postmortem — investigation traces saved as living docs; the role is keeping the case file and the postmortem in one artifact.
- Honeycomb's BubbleUp in retros — anomaly attribution during postmortem write-up; the role is making the lineup walk reconstructable from data.
- Linear / Notion postmortem docs — templated incident write-ups with action items linked to tickets; the role is making the lock an issue that ships.
- Sentry crash-report-into-retro flow — error events linked to incident reports; the role is closing the loop from crash to postmortem.
- Anthropic Claude API changelog — public delta tracking per model snapshot; the role is making the three-lane delta column auditable across releases.
- OpenAI changelog — versioned API changes; the role is providing the prompt/tool/model version source-of-truth for postmortems.
- AWS Bedrock model deprecation notices — explicit sunset events; the role is forcing postmortems on planned changes, not just unplanned outages.
- Inkeep / Mendable customer-complaint workflows — every complaint becomes a regression case; the role is making the complaint-to-lock path a product workflow.
- Google Cloud incident reports — public root-cause analyses for managed AI services; the role is exposing how cloud-vendor incidents propagate into customer agent postmortems.
- Anthropic's "What we learned" posts — public reflections on model-behaviour incidents; the role is normalising the "why did our eval not catch this?" discipline at frontier scale.
- Slack channel-pinned postmortems — incident docs pinned in
#incidents; the role is keeping the cultural artifact accessible to future debuggers. - CrewAI multi-agent incident logs — per-role failure traces; the role is the multi-agent variant of three-lane delta inspection.
- Cursor's internal repo of agent regressions — every shipped fix carries the failing case; the role is making the lock a code artifact, not a wiki page.
- Klarna's bot-failure post-disclosure — public framing of why and what changed; the role is showing how customer-facing AI failures map to internal postmortem structure.
Recall — the five sections and the three-lane delta¶
- What are the five mandatory sections in an agent postmortem, and which one most distinguishes it from an SRE postmortem?
- Why must the postmortem record whether the failure was reproduced more than once?
- In the billing agent scenario, which two suspects co-caused the bug, and why was fixing only one not enough?
- What question must every agent postmortem answer that classical SRE postmortems do not bother to ask?
Interview Q&A¶
Q: Why is the agent-version delta — prompt, tool, model — a section on its own, rather than folded into the timeline? A: Because each surface drifts independently and on different release cadences. A single agent can have a prompt change, a tool change, and a model change in the same week. A timeline hides the structural fact that any one, or any combination, can produce the failure. A separate three-lane diff forces inspection of each surface. Common wrong answer to avoid: "Because diffs are easier to read in a table" — that is formatting, not structure. The real reason is three independent change sources that classical systems do not have.
Q: A complaint reproduces 6 times out of 10 in staging. Do you ship the fix or keep investigating? A: Ship only if the confession explains both the 6 reproductions and the 4 non-reproductions. If the 4 used a different code path (e.g., the model asked a clarifying question), that is consistent. If the non-reproductions are unexplained, a cold case suspect still lurks. Mark it in the postmortem and add a monitoring alarm. Common wrong answer to avoid: "60 percent reproduction is good enough, ship it" — non-determinism signals that the root cause may be incomplete. Document the gap.
Q: Why is "why did the eval not catch this?" mandatory in an agent postmortem? A: The eval set is the agent team's contract with itself. If a real bug slipped through, the contract has a hole. Naming the hole — slice gap, judge leniency, or low threshold — is the only way the lock gets sized correctly. Common wrong answer to avoid: "Evals are nice-to-have, not part of incident response" — for agents, evals are the deploy gate. Skipping this question is skipping the lock.
Q: When two suspects co-cause a failure, which one do you fix first? A: Both. Fixing only one leaves the other as a latent fault that reopens the case after the next adjacent change. In the billing example, fixing only the tool means a future prompt edit can reintroduce the bug, and vice versa. The postmortem must commit to both fixes with separate locks. Common wrong answer to avoid: "Fix the cheaper one and monitor the other" — monitoring is not a lock. The regression will eventually happen, and you will write the same postmortem twice.
Apply now (10 min)¶
Step 1 — model the exercise. Here is the skeleton 5-section postmortem I would write for the chapter's billing-agent over-refund incident:
| Section | Content |
|---|---|
| Complaint slip (verbatim) | "The bot refunded the customer 10x the order total." — customer id 482a, trace id 7f3a... |
| Reproduction recipe | seed=42; load case file from trace 7f3a...; replay tool fixture lookup_order_v3.json; expect 10x refund |
| Lineup verdicts | prompt: clean (interrogated) / tool: returned wrong currency unit (confessed) / loop: clean / memory: clean / model: clean |
| Lock added | regression case BILL-482a-001 added to eval set with assertion refund_amount <= order_total * 1.5 |
| Three-lane delta | prompt: unchanged / tool schema: changed (added currency_minor_units flag, default behavior reversed) / model: unchanged |
Notice the two suspects co-confessed in the three-lane delta: the tool schema and the prompt's assumption about it diverged silently. Fixing only the tool would leave the prompt's assumption unchanged and a future prompt edit could reintroduce the bug.
Step 2 — your turn. Pick a recent failure in your agent (or any thumbs-down trace). Write a 5-section postmortem using the template above. Fill in the complaint slip verbatim, the trace ID and reproduction recipe, the lineup verdicts, the lock you would add, and the three-lane version delta. Do not skip the delta even if you think nothing changed.
Step 3 — reproduce from memory. Draw the postmortem lifecycle. Start at "incident" and end at "eval set growth." Label each arrow with what the engineer is doing. Add one branch for the cold case path — what happens when reproduction fails.
What you should remember¶
This chapter explained why an agent postmortem is not an SRE postmortem with the word "agent" added. Three structural facts force a different template: causation often involves two suspects in combination, the bug may not reproduce the same way twice, and the prompt/tool/model surfaces drift on independent release cadences. The five-section template — complaint slip, reproduction recipe, lineup verdicts, lock, three-lane delta — encodes those facts so the next engineer can pick up the case file months later.
You also learned that "why did the eval not catch this?" is a mandatory section, not a nicety. The eval set is the team's contract with itself; every miss is a hole in the contract. Naming the hole — slice gap, judge leniency, low threshold — is the only way the lock gets sized correctly.
Carry this diagnostic forward: when a postmortem feels suspiciously short, suspect a missing suspect. Two-suspect causation is the norm, not the exception. A one-line root cause is usually a one-line first-suspect; the second suspect is hiding in the same week's release notes.
Remember:
- An agent postmortem has five sections, not three. The complaint slip and the lock are first-class.
- Two-suspect causation is the norm. Both must be locked, not one and monitored.
- The three-lane delta — prompt, tool, model — must appear even when nothing changed. The absence of a delta is also evidence.
- "Why did the eval not catch this?" names the hole in the team's contract with itself.
- The cold case path is part of the template. Mark unreproducible incidents explicitly so future debuggers know what is unsolved.
Bridge. A postmortem is only as good as the trace it points at. Traces are only as good as the data they capture. But raw data carries user PII, billing details, medical notes. We must keep enough to debug, not enough to betray users. → 19-data-privacy-retention.md