09. Checkpoint and recovery — making workflows survive what they can't predict¶
~20 min read. File 06 showed checkpointing as a LangGraph feature. This file unpacks the engineering: where to place checkpoints, what to persist, how to make recovery safe through idempotency, and how to resume into a world that may have changed. The focus is workflow-level durability — multi-step crash survival, not single-agent memory (which module 01 covers).
Built on the first-principles overview in 00-first-principles.md. Durability vs latency — the pressure that persisting every intermediate result is safe but expensive, while persisting nothing makes crash recovery impossible — is the central tension. The durable checkpoint is the mechanism: a serialised snapshot of workflow state at a recovery-safe boundary.
What file 08 established and what remains¶
File 08 introduced the approval gate: a designed pause where the workflow intentionally waits for human input, then resumes cleanly. But not all workflow interruptions are designed. Processes crash. Containers get evicted. Databases become unreachable. Deployments restart workers mid-execution. These unplanned interruptions demand the same resume capability as designed pauses — but without advance warning. This file builds the checkpoint-and-recovery mechanism that makes unplanned interruption survivable.
The loan workflow that charged the bureau twice after a pod restart¶
The loan-approval workflow runs in a Kubernetes pod. Steps completed: verify_identity (✓), pull_credit (✓, cost: $2.50 per bureau call). The pod gets evicted during compliance_check (OOMKilled — a parallel workflow consumed too much memory).
Without checkpoints: the orchestrator restarts the workflow from the beginning. verify_identity re-runs (wasted time, but safe — idempotent). pull_credit re-runs: another $2.50 bureau API call. The bureau has no idempotency — it counts this as a second hard credit inquiry on the applicant's record. The applicant now has two hard pulls instead of one. Repeated across 200 concurrent workflows during a memory pressure event: 200 duplicate bureau calls, $500 wasted, 200 applicants with spurious credit inquiries.
With checkpoints: the orchestrator loads the last checkpoint (written after pull_credit completed). State contains {credit_score: 720, identity_verified: true}. Resume skips verify_identity and pull_credit entirely. Execution continues from compliance_check. Zero duplicate calls. Zero wasted cost. Zero applicant impact.
Without checkpoint: With checkpoint:
restart → verify (repeat) restart → load checkpoint 2
→ pull_credit (repeat!) → compliance_check (continue)
→ duplicate bureau call → no duplicate
→ $2.50 wasted → $0 wasted
→ hard pull on record → clean resume
Teacher voice. Checkpoints aren't about elegance. They're about preventing the business-logic damage that happens when non-idempotent steps re-execute. The value is measured in avoided duplicate side effects, not in architectural beauty.
The invariant: a checkpoint is a resume-safe boundary, not a save point¶
Not every point in a workflow is safe to resume from. A checkpoint must satisfy: "If I restart from here, all prior work remains valid and no unsafe action will re-execute." This means:
- All side effects before this point are either confirmed complete or protected by idempotency keys
- The state captured at this point is sufficient to continue without prior context
- No in-flight operation is left in an ambiguous state (started but not confirmed)
A "save point" says "I was here." A checkpoint says "I can safely restart from here." The distinction determines whether recovery is correct or merely fast.
Checkpoint placement strategy: the side-effect boundary rule¶
The core rule: checkpoint after confirmed safe progress, before the next non-idempotent side effect.
Checkpoint placement decision tree:
Is the next step idempotent?
├── YES → checkpoint is optional (retry is safe anyway)
│ but still useful for avoiding recomputation cost
└── NO → checkpoint is REQUIRED before that step
to prevent duplicate side effects on retry
Has the current step committed a side effect?
├── YES → checkpoint immediately after
│ (confirms the effect is done, don't repeat it)
└── NO → checkpoint after validation/computation
(avoids re-doing expensive work)
Applied to the loan-approval workflow:
step idempotent? checkpoint placement
─────────────────────────────────────────────────────────────
verify_identity yes (read) checkpoint after (avoid re-doing 4s of work)
pull_credit NO (hard pull) checkpoint AFTER (confirm it's done, never repeat)
compliance_check yes (compute) checkpoint after (preserve flag result)
human_review n/a (pause) checkpoint BEFORE (file 08's rule)
issue_decision NO (DB write) checkpoint AFTER with idempotency key
What a checkpoint must contain¶
The checkpoint state must be sufficient for any step after this point to execute correctly without access to anything before this point. The minimum viable checkpoint:
checkpoint_record:
├── workflow_id: "loan-7291"
├── checkpoint_id: "ckpt-3" (monotonically increasing)
├── timestamp: "2024-03-15T14:22:03Z"
├── position: "after compliance_check, before human_review"
├── state_snapshot:
│ ├── applicant_id: "A-7291"
│ ├── identity_verified: true
│ ├── credit_score: 720
│ ├── compliance_flag: "review"
│ └── plan_version: 1
├── execution_metadata:
│ ├── completed_steps: ["verify_identity", "pull_credit", "compliance_check"]
│ ├── branch_decisions: {"compliance_routing": "human_review"}
│ ├── retries_used: {"pull_credit": 1}
│ └── budget_spent: $0.08
├── side_effect_confirmations:
│ ├── bureau_call_id: "bureau-tx-9918" (confirms pull_credit committed)
│ └── (no other side effects yet)
└── schema_version: "v2.1"
The side-effect confirmations are critical. They tell resume logic: "this external action already happened — do not repeat it." Without them, resume logic can't distinguish "step completed" from "step started but didn't finish."
Idempotency: the checkpoint's essential partner¶
A checkpoint prevents re-execution of steps before the checkpoint boundary. Idempotency keys prevent damage when steps after the boundary re-execute despite protection. They're complementary:
Defence layers:
Layer 1: CHECKPOINT (prevents re-execution entirely)
"Don't run this step again — it already completed"
Layer 2: IDEMPOTENCY KEY (makes re-execution safe if it happens anyway)
"If you do run this step again, it won't create a duplicate"
Layer 3: WRITE-ONCE GUARD (detects re-execution at the destination)
"The receiving system rejects duplicate operations"
For the loan workflow's issue_decision step (writes to loan database):
def issue_decision_node(state: LoanState) -> dict:
idempotency_key = f"loan-decision-{state['applicant_id']}-{state['workflow_id']}"
# Database uses idempotency_key to reject duplicate writes
result = loan_db.write_decision(
applicant_id=state["applicant_id"],
decision=state["decision"],
idempotency_key=idempotency_key,
)
return {"decision_committed": True, "decision_record_id": result.id}
Even if a bug in checkpoint logic causes this node to re-execute, the database rejects the duplicate write. The idempotency key is the safety net beneath the checkpoint.
| Protection mechanism | Prevents | Catches |
|---|---|---|
| Checkpoint | Re-execution of completed steps | Normal crash recovery |
| Idempotency key | Duplicate writes if step re-runs | Bug in checkpoint logic, race conditions |
| Write-once guard | Duplicate at destination system | External system ensures correctness |
Resume logic: deterministic, not improvised¶
When a crashed workflow restarts, resume logic must be deterministic. The algorithm:
Resume procedure:
1. Load latest checkpoint for workflow_id
2. Verify checkpoint integrity (schema version, checksum)
3. Read position → determine next step
4. For each side_effect_confirmation:
- verify external system confirms the effect exists
- if confirmation fails → flag for investigation (don't silently skip)
5. Run freshness checks on time-sensitive state:
- credit_score age < 24h? → still valid
- policy_version unchanged? → still valid
- if stale → mark specific fields for re-fetch (don't restart from zero)
6. Resume from next step with loaded state
The critical principle: resume logic lives in deterministic control code, not in LLM reasoning. A model that "figures out where to restart" can hallucinate progress, skip necessary steps, or re-execute dangerous ones. Resume is a mechanical operation.
Correct:
load checkpoint → read position → advance to next node → execute
Dangerous:
ask LLM "where were we?" → model says "I think we finished step 3"
→ maybe we didn't → side effect repeated
Threaded example: cascading crash and recovery in loan workflow¶
Scenario: The loan-approval orchestrator runs 500 concurrent workflows. A database failover causes 3 seconds of unavailability. During those 3 seconds, 47 workflows are mid-execution.
Workflow states at crash time:
├── 12 workflows: between verify and pull_credit (checkpoint 1 valid)
├── 8 workflows: between pull_credit and compliance (checkpoint 2 valid)
├── 15 workflows: in compliance_check computation (checkpoint 2 valid)
├── 7 workflows: waiting for human review (checkpoint 3 valid, paused)
└── 5 workflows: in issue_decision (checkpoint 3 valid, need idempotency)
Recovery: - The 12 in early stages: resume from checkpoint 1. verify_identity re-runs (safe, idempotent). No cost impact. - The 8 after pull_credit: resume from checkpoint 2. Skip bureau call entirely. Proceed to compliance. Zero duplicate pulls. - The 15 in compliance: resume from checkpoint 2. Re-run compliance_check (idempotent computation). Minor recomputation cost. - The 7 waiting for humans: already paused. Checkpoint 3 is valid. When humans respond, resume normally. - The 5 in issue_decision: resume from checkpoint 3. Idempotency keys prevent duplicate DB writes. Safe.
Total recovery time: ~5 seconds after database returns. Zero duplicate side effects. Zero applicant impact. Zero manual intervention.
Without checkpoints: 47 workflows restart from scratch. 8 duplicate bureau calls ($20 wasted, 8 applicants with extra hard pulls). 5 potential duplicate loan decisions. Manual reconciliation needed.
Checkpoint storage and the state size problem¶
Every checkpoint serialises the full workflow state. State size directly impacts:
| Factor | Impact of bloated state |
|---|---|
| Write latency | Each checkpoint write takes longer (PostgresSaver: 5ms per 10KB → 50ms per 100KB) |
| Storage cost | Checkpoints × concurrent workflows × retention period |
| Recovery speed | Loading large checkpoints slows resume |
| Schema migration | Larger state → more fields to evolve → more migration complexity |
The state compression strategies from file 05 directly reduce checkpoint cost. A workflow that stores raw 20KB documents in state creates 20KB × nodes_count checkpoints per run. The same workflow that stores 200-token structured summaries creates 100× less checkpoint data.
State size impact at scale:
500 concurrent workflows × 5 checkpoints each × 10KB state = 25MB active checkpoints
500 concurrent workflows × 5 checkpoints each × 100KB state = 250MB active checkpoints
With 7-day retention:
25MB × 7 days × ~100 workflow-runs/day = 17.5GB retained
250MB × 7 days × ~100 workflow-runs/day = 175GB retained
Teacher voice. Checkpoint design is inseparable from state design. If file 05's compression discipline breaks down, checkpoint storage becomes the first symptom. Teams that discover "our Postgres is running out of space" often trace it back to bloated workflow state persisted at every step.
Schema evolution: old checkpoints in a new world¶
Workflows evolve. Fields get added, renamed, or removed. What happens to checkpoints written under schema v1 when the code expects schema v2?
Schema evolution scenarios:
1. New field added (credit_limit not in old checkpoints)
→ Resume fills with default value or triggers re-fetch
2. Field renamed (score → credit_score)
→ Migration function converts on load
3. Field removed (legacy_flag no longer used)
→ Resume ignores the extra field
4. Field type changed (credit_score: string → int)
→ Migration function converts on load, may fail → manual intervention
Strategies: - Schema version in checkpoint — every checkpoint records its schema version. Resume logic checks version and applies migrations before continuing. - Additive-only schema changes — new fields always have defaults. Old checkpoints are compatible by construction. - Checkpoint TTL — old checkpoints expire after a retention window. No need to migrate ancient checkpoints if workflows finish or timeout within the window. - Forward-compatible state — use a flexible schema (map/dict with known keys) rather than a rigid struct, so extra fields pass through without breaking.
The rule of thumb: if your average workflow runs for under 1 hour, a 7-day checkpoint retention window means you rarely have checkpoints older than 1 schema change. If workflows run for days (human-in-the-loop), schema evolution becomes a first-class engineering concern.
Operational signals — healthy recovery, degrading recovery, broken recovery¶
Healthy behaviour: - Checkpoint write success rate > 99.99% - Resume success rate > 99% (crashed workflows recover cleanly) - Zero duplicate side effects detected per day - Checkpoint write latency < 20ms (p99) - State size stable or slowly decreasing as compression improves
First degrading signal: - Resume failures increasing → schema evolution without migration, or stale external state - Checkpoint write latency climbing → state bloat or backend pressure - Duplicate side-effect alerts → idempotency gaps or checkpoint placement errors - Checkpoint storage growing faster than workflow volume → state compression regression
Misleading metric: - "Checkpoint count per workflow" — more checkpoints isn't inherently better or worse. The question is whether every checkpoint is at a recovery-safe boundary. - "Recovery time" alone — fast recovery that resumes into stale state is worse than slightly slower recovery that validates freshness.
Expert signal: - Side-effect confirmation hit rate: when resume logic verifies external confirmations, how often do they validate? Low hit rate → confirmation storage failing. - Schema migration success rate: what percentage of loaded checkpoints need migration, and how many fail?
Boundary of applicability¶
Works unusually well: - Workflows with expensive non-idempotent side effects (API calls with cost, database writes with business logic) - Long-running workflows that span human decision times (hours to days) - Environments with frequent infrastructure interruptions (spot instances, container eviction, deployment restarts)
Becomes pathological: - Ultra-low-latency pipelines where checkpoint write adds unacceptable delay (consider batched/async checkpointing) - Workflows that are fully idempotent end-to-end (restart from scratch is safe and cheap — checkpoints add complexity without recovery value) - Rapidly changing state where checkpoints become stale within seconds (streaming scenarios)
Scale that invalidates naive intuition: - At 10,000+ concurrent workflows, checkpoint storage becomes a significant infrastructure cost — tiered storage (recent in fast store, old in cold store) becomes necessary - At sub-second workflow completion times, per-node checkpointing adds more latency than it saves in recovery value — consider workflow-level (not node-level) checkpoints
Failure-prone assumption: "checkpoints make recovery automatic"¶
The seductive wrong idea: "Once we have checkpoints, recovery is solved — the system just resumes from the last checkpoint and everything works."
The correction: Checkpoints make recovery possible, not correct. Correct recovery also requires: (1) side-effect confirmations to prevent duplicates, (2) idempotency keys for operations after the checkpoint, (3) freshness checks for time-sensitive state, (4) schema migration for evolved workflows, and (5) deterministic resume logic that doesn't improvise. A checkpoint without these is a snapshot that may resume into corruption.
Real-world implementations¶
- Temporal.io — durable execution framework where every function call is automatically checkpointed; replay reconstructs exact execution history; side effects are recorded as "activities" with built-in retry and idempotency semantics
- LangGraph with PostgresSaver — per-node checkpointing with thread-based isolation; resume loads last state and re-enters graph at the interrupted node
- Azure Durable Functions — orchestrator functions checkpoint after each
await; replay skips completed actions; external events (human input) are durable - AWS Step Functions — state machine execution history provides checkpoint semantics; each state transition is recorded and resumable
- Restate — invocation journal records every side effect with exactly-once semantics; crashes replay from journal without executing effects twice
- Netflix Conductor — workflow execution state persisted after each task; failed tasks resume from last saved state with configurable retry policies
- Uber Cadence — decision task history provides replay; activities execute at-most-once with result recording; timeouts and retries are first-class
- Prefect — data pipeline orchestration with task-level state persistence; failed flows resume from last successful task without re-running upstream
Recall checkpoint¶
- What makes a checkpoint a "resume-safe boundary" rather than just a save point?
- Why should checkpoints be placed after non-idempotent side effects (not before)?
- What's the relationship between checkpoints and idempotency keys (complementary, not redundant)?
- Why must resume logic be deterministic rather than model-driven?
- How does state size affect checkpoint cost at scale?
- What schema evolution strategies keep old checkpoints usable?
- When are checkpoints more cost than value?
Interview Q&A¶
Q: Why is "checkpoint after the side effect" the correct default rather than "checkpoint before"? A: After-checkpoint confirms the effect is done — resume skips it. Before-checkpoint means the effect may or may not have happened — resume must detect and handle the ambiguity. After is cleaner because it records confirmed progress. Common wrong answer to avoid: "Because after is faster." Speed isn't the reason — it's about what the checkpoint semantically asserts.
Q: Why are idempotency keys necessary if checkpoints prevent re-execution? A: Checkpoints are the first defence, but they can fail (corrupted checkpoint, bug in resume logic, race condition). Idempotency keys are the safety net — even if re-execution happens, the side effect isn't duplicated. Defence in depth. Common wrong answer to avoid: "Because checkpoints are unreliable." Checkpoints are generally reliable, but idempotency provides the second layer that production systems need.
Q: Why should resume logic live in deterministic code rather than LLM reasoning? A: Resume is a safety-critical operation — getting the restart point wrong can duplicate side effects or skip necessary steps. Deterministic code is testable, auditable, and produces the same result every time. LLM reasoning introduces non-determinism exactly where it's most dangerous. Common wrong answer to avoid: "Because LLMs are slow for this." Speed isn't the issue — correctness and predictability are.
Q: Why does state size matter for checkpoint-based recovery? A: Every checkpoint serialises full state. Bloated state multiplied by nodes multiplied by concurrent workflows creates storage pressure and write latency. The compounding effect can make checkpoint writes the performance bottleneck before any model call becomes slow. Common wrong answer to avoid: "Because storage costs money." Cost is one factor, but write latency impacting workflow throughput is often the binding constraint.
Q: How do you handle schema evolution for long-lived checkpoints? A: Version every checkpoint. Apply migration functions on load (convert old schema to new). Use additive-only changes when possible (new fields get defaults). Set TTL so checkpoints older than the expected workflow duration are cleaned up rather than migrated. Common wrong answer to avoid: "Just delete old checkpoints." Deleting abandons in-flight workflows. Migration is the correct approach for active workflows.
Q: When is restarting from scratch better than resuming from a checkpoint? A: When the entire workflow is cheap, fast, and fully idempotent. If re-running from the beginning takes 2 seconds and has no side-effect risk, the complexity of checkpoint-resume (storage, migration, freshness checks) exceeds its value. Common wrong answer to avoid: "When checkpoints are corrupted." Corruption is one case, but the broader design question is whether the recovery value justifies the checkpoint overhead.
Design/debug exercise (10 min)¶
Modeled: The loan workflow crashes after pull_credit (checkpoint 2 exists). Resume loads checkpoint 2: {identity_verified: true, credit_score: 720}. Checks side-effect confirmation: bureau_call_id "bureau-tx-9918" exists in bureau system → confirmed done. Freshness: credit score pulled 3 minutes ago → valid. Advances to compliance_check.
Your turn: The same workflow crashes during issue_decision — the database write may or may not have committed. Write: (1) what the checkpoint before this step contains, (2) how resume logic determines whether the write happened, (3) what the idempotency key looks like, (4) what happens if the check is ambiguous (can't confirm or deny).
From memory: Close this file and sketch: the three-layer defence (checkpoint, idempotency key, write-once guard), the checkpoint placement decision tree, and the resume procedure (6 steps from load to execute).
Operational memory¶
Checkpoints make workflow interruption survivable by recording resume-safe state at meaningful boundaries. "Resume-safe" means: all prior side effects are confirmed complete, the state is sufficient to continue without prior context, and no in-flight operation is left ambiguous. The core placement rule — checkpoint after confirmed progress, before the next non-idempotent action — protects against the most common recovery failure: duplicate side effects.
Checkpoints are necessary but not sufficient. They work in partnership with idempotency keys (safety net if re-execution happens despite the checkpoint), freshness checks (verify time-sensitive state hasn't decayed), and deterministic resume logic (never let a model improvise the restart point). State size is a first-order production concern: every byte in your TypedDict is serialised at every checkpoint, multiplied by nodes and concurrent workflows. File 05's compression discipline pays dividends directly in checkpoint storage cost.
Remember: - Checkpoint = "safe to restart here." Save point = "I was here." The difference is side-effect safety. - Place checkpoints after non-idempotent steps (confirm they're done) not before (ambiguous whether they ran) - Idempotency keys are the safety net beneath checkpoints — defence in depth against duplicate effects - Resume logic is deterministic code: load → verify → advance. Never "ask the model where we were." - State size × nodes × workflows = checkpoint storage. Compression from file 05 reduces all three multipliers. - Schema evolution needs versioned checkpoints + migration functions — additive changes are safest - Side-effect confirmations in the checkpoint let resume logic skip without ambiguity
Bridge. Checkpoints handle the case where execution stops unexpectedly. But sometimes the deeper problem isn't that execution stopped — it's that the plan itself is wrong and needs to change while work is in flight. That's dynamic replanning: revising the route without losing completed progress. → 10-dynamic-replanning.md