Skip to content

00. Durable Agent Workflows — First-Principles Overview

Module 01 taught you to build one agent that acts safely. This module teaches you to build the control plane that coordinates many agents across time, failure, and human boundaries — and survives the crash that kills the pod mid-step.


A fintech ships a loan-approval workflow. Five agents in sequence: document parser, eligibility checker, risk scorer, compliance reviewer, notification sender. Demo day: flawless. Week two in production: the risk-scoring service returns a 503 at step 3. The orchestrator has no checkpoint. It restarts from step 1. The document parser re-extracts 40 pages of KYC. The eligibility checker re-calls the bureau API — which charges per hit. By the time step 3 succeeds on retry, the company has paid for three bureau calls instead of one, the user has waited 90 seconds instead of 12, and the ops team discovers that "restart from the top" is their only recovery strategy for every failure in the pipeline.

Two floors up, a second team runs a similar workflow on a durable execution engine. Same five agents. Same 503 at step 3. The engine loads the checkpoint written after step 2, replays only the risk-scoring call with its original idempotency key, and continues. Wall-clock cost of the failure: 4 seconds. Token cost: zero — steps 1 and 2 never re-ran. The difference is not the agents. The agents are identical. The difference is the layer above them: a workflow engine that treats agent coordination as a durable, recoverable, observable state machine rather than a script that runs top-to-bottom and prays.

This module builds that layer. Module 01 gave you the agent. This module gives you the control plane that dispatches agents, tracks their progress, survives crashes, pauses for humans, adapts plans when the world moves, and isolates tenants sharing the same infrastructure.

┌──────────────────────────────────────────────────────────────┐
│                     CONTROL PLANE                             │
│                                                              │
│  ┌────────────┐  ┌──────────────┐  ┌───────────────────┐    │
│  │  workflow   │  │  dispatch    │  │  checkpoint       │    │
│  │  graph      │  │  loop        │  │  store            │    │
│  └─────┬──────┘  └──────┬───────┘  └────────┬──────────┘    │
│        │                │                    │               │
│        └────────────────┼────────────────────┘               │
│                         │                                    │
│    ┌────────────────────┼────────────────────┐               │
│    │                    │                    │               │
│    ▼                    ▼                    ▼               │
│ ┌──────┐          ┌──────────┐         ┌──────────┐         │
│ │agent │          │  agent   │         │  human   │         │
│ │  A   │          │    B     │         │ approval │         │
│ └──────┘          └──────────┘         └──────────┘         │
└──────────────────────────────────────────────────────────────┘

The control plane is not an agent. It does not reason about user intent. It reasons about execution: which step is next, what state exists, whether the previous step succeeded, whether a human must approve, and what to do when the infrastructure fails mid-transition. The agents below it are interchangeable workers. The control plane is the durable skeleton that makes their work composable, recoverable, and observable.


The recurring pressures

Every durable workflow system is pulled apart by the same tensions. These names recur in every chapter.

Pressure What it asks
coordination cost How much overhead does the orchestration layer add per step?
durability vs latency Checkpointing saves progress but adds write latency to every transition.
plan freshness A plan written at T=0 decays as the world changes during execution.
handoff fidelity Each agent-to-agent boundary is a serialization boundary — context leaks or is lost.
human-time asymmetry Agents act in milliseconds; humans approve in hours. The workflow must bridge both clocks.
tenant isolation Shared infrastructure must never leak one customer's workflow state into another's execution.
testability Multi-agent workflows are hard to reproduce, hard to mock, and hard to assert against.

The recurring vocabulary

Name What it is
the control plane the layer that dispatches agents, tracks state, and decides what runs next
the workflow graph the declared structure of steps and edges — sequential, parallel, DAG, or conditional
the durable checkpoint a snapshot of workflow state written to storage before dangerous or expensive transitions
the handoff contract the typed interface between one step's output and the next step's input
the dispatch loop the engine cycle: read state → pick next step → execute → write result → repeat
the replan trigger the condition under which the original plan is abandoned and a new plan is generated
the approval gate a pause point where execution suspends until a human or policy engine authorises continuation
the tenant boundary the isolation surface that prevents one user's workflow from reading another's state

Memory map

# File Pressure answered What it adds
01 why-orchestration coordination cost vs agent autonomy when a control plane becomes necessary
02 task-decomposition plan quality vs execution flexibility turning user intent into a workflow graph
03 agent-selection-routing generality vs specialisation choosing the right agent per step
04 workflow-patterns latency vs dependency safety sequential, parallel, DAG, conditional shapes
05 state-context-management context richness vs token cost shared state across steps without drowning
06 langgraph-deep-dive framework leverage vs lock-in LangGraph nodes, edges, state, checkpointing
07 plan-execution-manager plan rigidity vs adaptiveness tracking execution and triggering replans
08 human-in-the-loop speed vs safety approval gates and pause-resume mechanics
09 checkpoint-recovery durability vs latency surviving crashes without restarting
10 dynamic-replanning plan freshness vs execution cost when and how to abandon and rebuild a plan
11 multi-tenant-orchestration sharing vs isolation per-tenant state, quotas, and noisy-neighbor control
12 testing-orchestration confidence vs cost E2E tests, mocked agents, chaos drills
13 honest-admission humility what orchestration still cannot solve cleanly

Three traversal paths use this map. Prerequisite path — read top to bottom; each file assumes the previous. Failure path — when a production workflow breaks, find the pressure that's under-designed (checkpoint missing? handoff lossy? plan stale? tenant leaked?). Synthesis path — combine two rows from different pressures (e.g., human-time asymmetry + plan freshness = what happens when a human takes 6 hours to approve and the data underlying step 4 has changed?).


Top resources

  • Temporal docs — durable execution concepts — https://docs.temporal.io/concepts
  • LangGraph documentation — https://langchain-ai.github.io/langgraph/
  • OpenAI Agents SDK — handoffs and orchestration — https://openai.github.io/openai-agents-python/
  • Anthropic — Building effective agents (orchestration patterns) — https://www.anthropic.com/engineering/building-effective-agents
  • Inngest — durable workflow engine for AI — https://www.inngest.com/docs
  • Microsoft Semantic Kernel — process orchestration — https://learn.microsoft.com/en-us/semantic-kernel/

What's coming

  1. 01-why-orchestration.md — Why a single agent breaks under production coordination pressure, and when the control plane becomes necessary.
  2. 02-task-decomposition.md — Turning messy user intent into a structured workflow graph with typed step boundaries.
  3. 03-agent-selection-routing.md — Choosing the right agent, model, or tool path for each step in the graph.
  4. 04-workflow-patterns.md — Sequential, parallel, DAG, and conditional workflow shapes — when each fits and what each costs.
  5. 05-state-context-management.md — Shared state across steps without context explosion or stale reads.
  6. 06-langgraph-deep-dive.md — LangGraph nodes, edges, reducers, state schemas, and built-in checkpointing.
  7. 07-plan-execution-manager.md — Plan creation, execution tracking, and controlled re-planning under drift.
  8. 08-human-in-the-loop.md — Approval gates, escalation policies, and workflows that pause for hours then resume cleanly.
  9. 09-checkpoint-recovery.md — Durable checkpoints, idempotent steps, and resume-vs-restart decisions at the workflow level.
  10. 10-dynamic-replanning.md — When the original plan decays and the workflow must adapt without losing completed work.
  11. 11-multi-tenant-orchestration.md — Per-tenant isolation, concurrency limits, quotas, and noisy-neighbor defence.
  12. 12-testing-orchestration.md — E2E workflow tests, mocked agent steps, deterministic replay, and chaos injection.
  13. 13-honest-admission.md — What remains brittle, surprising, or unsolved in production workflow orchestration.

Bridge. Module 01 gave the agent a loop, tools, memory, and a leash. But a single agent running a single trajectory is not a production system — it is one worker with no manager, no schedule, and no recovery plan. The moment two agents must coordinate, or one agent must survive a crash, or a human must approve before the next step fires, you need something above the agent. That something is the control plane. Next: why orchestration becomes necessary — and what specific failure makes it obvious. → 01-why-orchestration.md