Skip to content

09. Crash recovery and failure shapes — Checkpointing, idempotency, and topology-specific cracks

~18 min read. The pod died at step 4 of 7. The user is staring at a spinner. And the topology you picked decides whether this is a five-second resume or a five-dollar restart.


The 2 AM page that taught us durability

Friday. Travel-booking agent. Seven-step trajectory — search flights, search hotels, compare, book flight, book hotel, charge card, send confirmation. The agent reached step 4 (book flight) and the worker pod was OOM-killed. Kubernetes restarted it. The orchestration layer had no checkpoints. The agent restarted from step 1 — repeated every search, burned another $0.17 in tokens, and booked a second flight because the airline's API had no dedup key. Customer charged twice. Incident cost: $847 in refunds, one angry enterprise account, and a postmortem that said "add checkpointing" in bold.

That postmortem was incomplete. Checkpointing would have saved the single-agent case. But the team was also running an orchestrator topology where the boss crashed — losing the handoff memo to three workers who kept running, producing results nobody would ever collect. Different shape, different crack, different recovery strategy entirely.

Budgets constrain cost. Tenancy constrains access. But neither saves you when the process crashes mid-trajectory. This file is about what does.


What we know so far

From blast radius: every tool call has a damage footprint. From budgets: tokens and wall-clock are finite. From tenancy: data stays partitioned. None of these mechanisms address a simpler question — what happens when the compute disappears mid-step?

We need two things: (1) a way to recover a single trajectory after a crash, and (2) a way to predict which crash will hit first based on the topology we chose. The first is checkpointing. The second is failure-shape analysis. Together they form the durability layer.


What this file solves

  1. How to checkpoint agent state so crashes don't lose work
  2. When to resume vs restart after a failure
  3. How idempotency keys make resume safe
  4. What compensating transactions (sagas) buy for non-idempotent steps
  5. Which failure mode appears first in each topology
  6. How the failure shape determines the right recovery strategy

Every agent is a state machine — and state machines crash

People draw agents as a friendly loop. Reality is harsher. Every step is a transition between states. Every state can fail.

   ┌──────────┐  tool ok  ┌─────────┐  check ok  ┌──────────┐
   │ thinking │──────────→│ acting  │───────────→│ checking │
   └─────┬────┘           └────┬────┘            └─────┬────┘
         │                     │ tool fail              │ done
         │ done                ▼                        ▼
         │                ┌─────────┐             ┌─────────┐
         └───────────────→│ failed  │             │ success │
                          └─────────┘             └─────────┘

The lethal moment: between acting and checking — after the tool call fires but before the agent records the result. If the pod dies here, nobody knows whether the action succeeded. The user sees a spinner. The conversation is gone. The half-finished trajectory is lost.

The fix: make the state machine explicit. Write each state to durable storage. That is a checkpoint.


Checkpoint design — what, where, when

A checkpoint is a snapshot of agent state at a known-good point. Three design questions.

What to checkpoint?

Strategy Write cost Recovery cost Best for
Full state every step High Zero replay Short trajectories (<10 steps)
Delta only Low Replay from base Long trajectories, high step rate
Hybrid (full every N, deltas between) Medium Replay from nearest full Production default

Where to checkpoint?

Store Latency Use when
Redis <2 ms Hot trajectories, in-flight state
Postgres (JSONB) ~10 ms Durable, queryable, default choice
S3 / blob ~50 ms Cold storage of finished runs

When to checkpoint?

After every state transition — think → act → observe. Before every tool call with non-trivial blast radius. After every budget check. The rule: any place a crash would lose information you cannot rebuild — checkpoint there.

Here's the tension. Checkpointing every micro-step is safe but adds latency. Checkpointing nothing is fast but fragile. The right frequency depends on two things: how expensive a restart would be (tokens + wall-clock) and how likely a crash is at that point. Expensive steps with flaky tools get checkpoints. Cheap pure-reasoning steps can batch.


Resume vs restart — the core trade-off

The pod is back. State is in Postgres. Now what?

        crash at step 4
     ┌────────┴────────┐
     ▼                 ▼
 ┌────────┐       ┌─────────┐
 │ resume │       │ restart │
 │ from 4 │       │ from 1  │
 └────┬───┘       └────┬────┘
      │                │
   risk: side          cost: $$$
   effects             repeat steps 1-3
   replayed            user waits longer

Resume. Load last checkpoint. Continue from step 4. Cheap. Fast. But dangerous if step 3 had side effects that might replay.

Restart. Discard everything. Rerun from step 1. Safe. Expensive. Blows the budget if early steps cost real tokens.

Resume looks better — but it has a sharp edge. If step 3 sent an email and you re-run it "to be safe," the user gets two emails. Resume works only when steps are idempotent — running them twice produces the same outcome as running them once.


Idempotency keys — making resume safe

Recall from file 07: idempotency keys derive from intent (e.g., trip_42_charge_v1), not from attempt (a random UUID per retry). The same principle is now load-bearing in the checkpoint context.

Every tool call should carry an idempotency key. The agent saves the key in the checkpoint before the call. After crash and resume, the same key replays. The downstream service deduplicates.

agent checkpoint        tool side
       │                    │
   write key K              │
       │──── call(K) ───────→
       │                    │  check K seen?
       │                    │    yes → return cached
       │                    │    no  → execute, store K
       │←─── response ──────│
   write result

Without idempotency, resume is roulette. Restart is the only safe option. But restart blows your budget if early steps cost $0.40 in tokens. That is why blast-radius discipline (file 07) and checkpointing are joined at the hip.


At-least-once is the practical target

Distributed-systems vocabulary that applies directly to agents:

  • At-least-once. The step runs one or more times. Resume guarantees this. Safe only with idempotent tools.
  • Exactly-once. The step runs exactly once. Requires coordinated commit between agent state and tool state. Most APIs don't offer two-phase commit.

Production target: at-least-once delivery + idempotent tools = functionally exactly-once. Chase true exactly-once only when coordination cost is justified (e.g., financial settlement systems).


Compensating transactions — the saga pattern for agents

Sometimes a step is neither idempotent nor undoable through replay. You booked a flight in step 2. Step 4 crashed. You cannot restart (flight already booked). You cannot resume blindly (step 3 was a payment that cleared).

The fix: run a compensating transaction — a second action that semantically undoes the first.

step 2: book flight    → confirmation ABC
step 3: charge card    → payment XYZ
step 4: crash
compensation (reverse order):
   refund payment XYZ
   cancel flight ABC
   then restart cleanly from step 1

This is the Saga pattern in agent land. Each forward action has a paired backward action. The agent registers compensations as it goes. On unrecoverable failure, compensations run in reverse order.

When to use each strategy:

Step property Recovery strategy
Idempotent (read-only, dedup key) Resume directly
Has idempotency key on tool side Resume with same key
Non-idempotent, has compensation Compensate then restart
Non-idempotent, no compensation Restart from scratch (accept the cost)

Worked example — 7-step travel booking, crash at step 4

Budget = $2.00, 60s wall-clock.

step 1: search_flights         $0.05   2s   [idempotent read]
step 2: search_hotels          $0.04   2s   [idempotent read]
step 3: think_compare          $0.08   1s   [pure reasoning]
step 4: book_flight            CRASH — pod OOM at 18s
step 5: book_hotel             —
step 6: charge_card            —
step 7: send_confirmation      —

Checkpoint after step 3 in Postgres. Cost so far = $0.17.

Restart path. Discard. Rerun 1-7. Cost ≈ $0.34. Time ≈ 25s extra. User waits. Risk of double-booking if step 4 partially completed before crash.

Resume path. Load checkpoint c3. Run step 4 with key trip_42_book_flight_v1. If the airline booked before the crash, its API returns the original confirmation. No double-booking. Continue to step 5. Saves $0.17 and ~5s.

  step 1   step 2   step 3   step 4   step 5   step 6   step 7
   ●────────●────────●────────✗────────○────────○────────○
   │        │        │        ▲
   ▼        ▼        ▼        │
  [c1]    [c2]    [c3]      crash
   │        │        │
   └────────┴────────┘
        Postgres checkpoint table
        thread_id = user_42_session_99
        last_good = c3

Resume works because every write tool carries an idempotency key. Without that discipline, restart is mandatory and the user pays twice — in time and tokens.


Timeout ambiguity — the hardest crash to diagnose

A tool call timed out at 30 seconds. Did it succeed or fail?

You don't know. That's the point. The agent must treat timeouts as "uncertain" — not as failures. The correct response:

  1. Query the tool's status endpoint: GET /bookings?idempotency_key=trip_42_book_flight_v1
  2. If found → record success, continue
  3. If not found → retry with the same idempotency key
  4. Never retry with a fresh key (creates duplicates)

This is the single most common mistake in crash recovery: treating timeout as failure and retrying with a new key. A write that took 29 seconds may have completed — retrying with a different key produces two effects.


Now: which crash hits first depends on your topology

Everything above assumes a single agent on a single trajectory. Checkpoint, resume, done. But most production systems are not a single loop. They are orchestrators, pipelines, hierarchies. Each topology has a signature first crack — the failure that shows up before any other, predictable from shape alone.

The recovery strategy that works for one shape may be useless — or harmful — for another.


Six topologies, six first cracks

        TOPOLOGY              FIRST CRACK                FIRST SIGNAL
        ────────              ───────────                ────────────

  1.  ReAct loop          → runaway iteration         → step count > 80% of cap
       T→A→O→T→A→O→...

  2.  Orchestrator        → lost handoff              → worker silent past timeout
       boss → workers        (memo malformed or lost)

  3.  Pipeline            → cascading poison          → E2E quality drops,
       A→B→C→D              (stage A wrong, all          stages individually green
                             downstream amplifies)

  4.  Plan-Execute        → stale plan                → step k fails on
       plan → exec[k]       (world moved since plan)     "object not found"

  5.  Debate / critique   → deadlock or capitulation  → rounds hit cap,
       A ⇄ B ⇄ A ⇄ B...                                 no convergence delta

  6.  Hierarchical        → root bottleneck           → root queue depth grows,
       root → mid → leaf                                 leaves idle

The architect's job: pick a shape, then pre-wire recovery and alarms for that shape's signature crack before shipping. Not after the first outage.


ReAct crash profile — runaway, oscillation, premature stop

Three failure shapes in the basic loop:

Runaway. Same tool, similar args, no progress. Twenty iterations. Forty. Sixty. Recovery: hard step cap (the give-up rule), no-progress detection, budget trip.

Oscillation. Tool A says X. Tool B says ¬X. Agent thrashes. Recovery: surface the conflict, escalate to a human or fallback model.

Premature stop. Agent declares "done" after one call, leaving obligations unfinished. Recovery: explicit completion checklist before exit allowed.

Checkpoint strategy for ReAct: checkpoint every N steps (not every step — too expensive for fast loops). On crash, resume from last checkpoint. The give-up rule fires on total steps including pre-crash steps, so a resumed agent doesn't get a fresh budget.


Orchestrator crash profile — lost handoff, boss death, worker silence

The CEO-and-departments shape. Boss delegates, reads memos, decides. Failure shapes:

Lost handoff. Worker never replies, replies in wrong schema, or echoes the prompt. Recovery: schema-validated handoffs, hard worker timeout, automatic re-dispatch.

Boss death. The orchestrator itself crashes. Workers keep running, producing results nobody collects. Recovery: workers write results to durable queue (not just return to caller). New orchestrator instance reads the queue and resumes coordination.

Worker silence. Worker hangs on a flaky tool. Recovery: wall-clock deadline per worker, default to "failed handoff" on timeout.

Checkpoint strategy for orchestrator: checkpoint the dispatch state — which workers were assigned what, which have reported back. On orchestrator crash, the new instance knows what's outstanding without re-dispatching everything.


Pipeline crash profile — cascading poison, silent decay

A → B → C → D. No backward arrows. Signature failure: stage A emits subtly wrong data. B trusts it. C amplifies. By D, the output is confidently wrong.

clean input → [A]   →   [B]    →    [C]    →   [D]
              ✓     hallucinated   built on    confident
                    citation       fake fact   garbage out

The insidious part: every stage's local eval is green. Schema valid. No errors. But quality decayed. Per-stage monitoring misses it entirely.

Recovery: E2E validation gates between stages. If gate fails, roll back to the last stage that passed its gate — don't restart from scratch (too expensive) unless the corruption started at stage A.

Checkpoint strategy for pipelines: checkpoint between every stage. A crash at stage C resumes from the output of stage B. But if the input to B was poisoned, resuming from B's output perpetuates the poison. E2E eval decides whether to resume or roll further back.


Plan-execute crash profile — stale plan, partial execution

Agent writes a plan, then executes step by step. Failure shapes:

Stale plan. Plan written at T=0. Step 4 runs at T=60s. The world moved — file deleted, API deprecated, another agent modified the resource. Plan acts on ghost state.

Partial execution. Step 3 fails. Stop? Skip? Replan? Without explicit policy, behavior is undefined.

Recovery: plan TTL (discard if execution lags planning by > N seconds), replan trigger on any step failure, world-state validation before each step.

Checkpoint strategy: checkpoint the plan itself alongside execution state. On crash, check plan freshness before resuming. If stale, replan from current world state rather than resuming an outdated plan.


Debate crash profile — deadlock, capitulation

A drafts, B critiques, A revises. Failure shapes:

Deadlock. B says "too long." A shortens. B says "too short." No convergence ever.

Capitulation cascade. A is too agreeable. B's critiques accumulate, each breaking a previous fix. By round 5, quality is worse than the first draft.

Recovery: hard round cap (2-3 rounds), convergence delta threshold (stop if change < ε), independent judge for tie-breaks.

Checkpoint strategy: checkpoint after each round. On crash, resume from last complete round — never mid-revision (partial revisions corrupt the draft).


The diagnostic table — topology to recovery mapping

Topology Signature crack Recovery strategy Checkpoint granularity
ReAct Runaway iteration Resume + step budget carries over Every N steps
Orchestrator Lost handoff / boss death Durable result queue + dispatch replay Per-dispatch event
Pipeline Cascading poison Roll back to last clean gate Between every stage
Plan-execute Stale plan Replan from current world state Plan + execution state
Debate Deadlock / capitulation Ship last good draft Per complete round
Hierarchical Root bottleneck Shard root + summarization layer Per-level aggregation

The failure mode determines the recovery strategy. A pipeline crash doesn't need idempotency keys (stages are usually pure transforms). An orchestrator crash doesn't need sagas (workers are independent). Match the medicine to the disease.


The performance-vs-recoverability dial

Every checkpoint adds latency:

checkpoint frequency     recovery cost        runtime overhead
─────────────────────    ─────────────        ────────────────
every micro-step         zero (instant resume)  +15-40% latency
every tool call          replay 1 step max      +5-10% latency
every N steps            replay up to N steps   +1-3% latency
never                    full restart           zero overhead

The right setting depends on:

  • Crash probability. Flaky infra → checkpoint more. Stable infra → checkpoint less.
  • Step cost. Expensive steps (large context, slow tools) → checkpoint before them. Cheap steps → batch.
  • Topology. Fast ReAct loops (100ms/step) can't afford per-step checkpoints. Slow pipelines (10s/stage) can.
  • Blast radius. Steps that write to external systems → always checkpoint before. Pure reasoning → skip.

The dial is not global. It's per-step. Checkpoint before the expensive, dangerous, non-idempotent steps. Skip the cheap, safe, replayable ones.


Real-world recognition

You are debugging a production agent. How do you know which failure shape you're in?

You observe... You're likely seeing... First response
Same tool called 20+ times, no progress ReAct runaway Check give-up rule, budget carry-over on resume
Worker produced result but orchestrator has no record Boss death / lost handoff Check dispatch queue durability
Final output is wrong but every stage logged success Pipeline cascading poison Add E2E gate, check stage A output
Step fails with "not found" on object that existed at plan time Plan-execute stale plan Check plan TTL, add pre-step validation
Round count at max, output oscillating Debate deadlock Ship last good draft, add convergence delta
Agent resumes but user gets duplicate email/charge Missing idempotency key Add key before call, check tool dedup support

Common wrong model — "just add retries"

The instinct is: crash → retry. But retries without structure create new failures:

  • Retry without idempotency key → duplicate side effects
  • Retry from step 1 when only step 4 failed → wasted budget
  • Retry the plan when the plan is stale → same failure repeats
  • Retry in a debate loop → resets round count, enables infinite loops

Retries are a mechanism. Recovery is a strategy. The strategy must be topology-aware: which step failed, what was its blast radius, was it idempotent, and does the surrounding topology need its own recovery (orchestrator re-dispatch, pipeline rollback, plan refresh).


Interview Q&A

Q: Why is "exactly-once" rarely achievable in agent steps?

A: It requires coordinated commit between agent state and every external tool's state. Most APIs don't offer two-phase commit. So we target at-least-once delivery plus idempotency keys on the tool side. The tool deduplicates replays. Functionally indistinguishable from exactly-once for the user.

Wrong answer to avoid: "Use a retry counter to stop at one." A counter does nothing if the crash happens after the call succeeded but before the agent recorded success. The agent thinks it never ran. It retries. Tool called twice. Idempotency on the tool side is the only real fix.

Q: When would you choose restart over resume?

A: When any step has side effects you cannot guarantee are idempotent or compensable. Example: a tool sends a transactional Slack message with no dedup key. Restart + compensations for committed actions is safer than resuming into potential duplicates.

Wrong answer to avoid: "Always restart, it's safer." Restart blows token budgets, doubles latency. Default to resume; restart is the fallback.

Q: A team says "let's use ReAct everywhere — it's flexible." What's the risk?

A: Runaway iteration becomes the dominant failure on any task with weak grounding or flaky tools. ReAct without a tight give-up rule burns the budget on 40 iterations of the same wrong path. Flexibility is a feature only when paired with hard step caps and no-progress detection.

Wrong answer to avoid: "The model will figure out when to stop." Models drift; budgets do not.

Q: E2E quality drops in a pipeline agent but every stage's eval is green. What's happening?

A: Cascading poison. Stage A emits subtly wrong data that passes its narrow eval. B amplifies. C builds on it. Per-stage evals miss this because they test stages in isolation. Fix: E2E eval plus cross-stage validation gates.

Wrong answer to avoid: "One stage must have a bug." Often there is no per-stage bug — each stage does its job correctly on subtly bad input.

Q: A tool call timed out at 30s. Did it succeed or fail?

A: Unknown — that's the point. Query the tool's status endpoint with the original idempotency key. If found, record success. If not, retry with the same key. Never retry with a fresh key.

Wrong answer to avoid: "Treat timeout as failure and retry." A write that took 29s may have completed. Retry with new key = duplicate effect.


Apply-now exercise (10 min)

Part 1 — Recovery audit. Pick one agent you've built or studied. For each tool it calls, classify:

Tool Idempotent? Has dedup key? Has compensation? Recovery on crash
... Y/N Y/N Y/N resume / compensate / restart

Any tool with N/N/N in columns 2-4 is your resume-vs-restart decision point.

Part 2 — Topology diagnosis. Identify your agent's topology (ReAct, orchestrator, pipeline, plan-execute, debate, hierarchical, or hybrid). Using the diagnostic table, write: 1. The signature first crack for your topology 2. The first signal you'd see in production 3. Whether your current alarms would catch it

Part 3 — Sketch from memory. Draw the 7-step trajectory with checkpoints [c1] [c2] [c3] and crash at step 4. Label the idempotency key on resume. Then draw the six-topology table with first crack and first signal for each.


Operational memory

This chapter joined two ideas: (1) how to recover a single trajectory after a crash (checkpointing, resume, idempotency, sagas), and (2) which crash shows up first in each topology (the signature first crack). They belong together because the failure mode determines the recovery strategy — a pipeline doesn't need sagas; an orchestrator doesn't need per-step idempotency keys; a plan-execute topology needs plan freshness checks, not just state replay.

The core tension is performance vs recoverability. Checkpointing every step is safe but adds 15-40% latency. Checkpointing nothing is fast but means full restart on any failure. The right frequency is per-step: checkpoint before expensive, dangerous, non-idempotent operations; skip cheap, safe, replayable ones. The topology tells you which operations those are.

Remember:

  • Every state transition is a checkpoint candidate; checkpoint where loss would be unrecoverable.
  • Resume requires idempotency; restart is the fallback when idempotency is absent.
  • Idempotency keys derive from intent, not attempt — a new UUID per retry defeats the pattern.
  • At-least-once + idempotent tools ≈ exactly-once in practice.
  • Compensating transactions (sagas) handle steps that are neither idempotent nor replayable.
  • Every topology has a signature first crack — predictable from shape, visible before code ships.
  • Topology and recovery strategy are coupled: match the medicine to the disease.
  • Cascading poison hides behind green per-stage evals; only E2E validation catches it.
  • Timeout ≠ failure. Query status with the original key before retrying.

Bridge. Recovery saves the agent after a crash. But how do you know it crashed in the first place? The checkpoint didn't write itself — something had to detect the silence and trigger the resume. Next: wiring the alarm panel — the signals that tell you something is wrong before the user complains. → 10-observability-eval-gates.md