00. Durable Agent Workflows — First-Principles Overview¶

Module 01 taught you to build one agent that acts safely. This module teaches you to build the control plane that coordinates many agents across time, failure, and human boundaries — and survives the crash that kills the pod mid-step.

A fintech ships a loan-approval workflow. Five agents in sequence: document parser, eligibility checker, risk scorer, compliance reviewer, notification sender. Demo day: flawless. Week two in production: the risk-scoring service returns a 503 at step 3. The orchestrator has no checkpoint. It restarts from step 1. The document parser re-extracts 40 pages of KYC. The eligibility checker re-calls the bureau API — which charges per hit. By the time step 3 succeeds on retry, the company has paid for three bureau calls instead of one, the user has waited 90 seconds instead of 12, and the ops team discovers that "restart from the top" is their only recovery strategy for every failure in the pipeline.

Two floors up, a second team runs a similar workflow on a durable execution engine. Same five agents. Same 503 at step 3. The engine loads the checkpoint written after step 2, replays only the risk-scoring call with its original idempotency key, and continues. Wall-clock cost of the failure: 4 seconds. Token cost: zero — steps 1 and 2 never re-ran. The difference is not the agents. The agents are identical. The difference is the layer above them: a workflow engine that treats agent coordination as a durable, recoverable, observable state machine rather than a script that runs top-to-bottom and prays.

This module builds that layer. Module 01 gave you the agent. This module gives you the control plane that dispatches agents, tracks their progress, survives crashes, pauses for humans, adapts plans when the world moves, and isolates tenants sharing the same infrastructure.

┌──────────────────────────────────────────────────────────────┐
│                     CONTROL PLANE                             │
│                                                              │
│  ┌────────────┐  ┌──────────────┐  ┌───────────────────┐    │
│  │  workflow   │  │  dispatch    │  │  checkpoint       │    │
│  │  graph      │  │  loop        │  │  store            │    │
│  └─────┬──────┘  └──────┬───────┘  └────────┬──────────┘    │
│        │                │                    │               │
│        └────────────────┼────────────────────┘               │
│                         │                                    │
│    ┌────────────────────┼────────────────────┐               │
│    │                    │                    │               │
│    ▼                    ▼                    ▼               │
│ ┌──────┐          ┌──────────┐         ┌──────────┐         │
│ │agent │          │  agent   │         │  human   │         │
│ │  A   │          │    B     │         │ approval │         │
│ └──────┘          └──────────┘         └──────────┘         │
└──────────────────────────────────────────────────────────────┘

The control plane is not an agent. It does not reason about user intent. It reasons about execution: which step is next, what state exists, whether the previous step succeeded, whether a human must approve, and what to do when the infrastructure fails mid-transition. The agents below it are interchangeable workers. The control plane is the durable skeleton that makes their work composable, recoverable, and observable.

The recurring pressures¶

Every durable workflow system is pulled apart by the same tensions. These names recur in every chapter.

Pressure	What it asks
coordination cost	How much overhead does the orchestration layer add per step?
durability vs latency	Checkpointing saves progress but adds write latency to every transition.
plan freshness	A plan written at T=0 decays as the world changes during execution.
handoff fidelity	Each agent-to-agent boundary is a serialization boundary — context leaks or is lost.
human-time asymmetry	Agents act in milliseconds; humans approve in hours. The workflow must bridge both clocks.
tenant isolation	Shared infrastructure must never leak one customer's workflow state into another's execution.
testability	Multi-agent workflows are hard to reproduce, hard to mock, and hard to assert against.

The recurring vocabulary¶

Name	What it is
the control plane	the layer that dispatches agents, tracks state, and decides what runs next
the workflow graph	the declared structure of steps and edges — sequential, parallel, DAG, or conditional
the durable checkpoint	a snapshot of workflow state written to storage before dangerous or expensive transitions
the handoff contract	the typed interface between one step's output and the next step's input
the dispatch loop	the engine cycle: read state → pick next step → execute → write result → repeat
the replan trigger	the condition under which the original plan is abandoned and a new plan is generated
the approval gate	a pause point where execution suspends until a human or policy engine authorises continuation
the tenant boundary	the isolation surface that prevents one user's workflow from reading another's state

Memory map¶

#	File	Pressure answered	What it adds
01	why-orchestration	coordination cost vs agent autonomy	when a control plane becomes necessary
02	task-decomposition	plan quality vs execution flexibility	turning user intent into a workflow graph
03	agent-selection-routing	generality vs specialisation	choosing the right agent per step
04	workflow-patterns	latency vs dependency safety	sequential, parallel, DAG, conditional shapes
05	state-context-management	context richness vs token cost	shared state across steps without drowning
06	langgraph-deep-dive	framework leverage vs lock-in	LangGraph nodes, edges, state, checkpointing
07	plan-execution-manager	plan rigidity vs adaptiveness	tracking execution and triggering replans
08	human-in-the-loop	speed vs safety	approval gates and pause-resume mechanics
09	checkpoint-recovery	durability vs latency	surviving crashes without restarting
10	dynamic-replanning	plan freshness vs execution cost	when and how to abandon and rebuild a plan
11	multi-tenant-orchestration	sharing vs isolation	per-tenant state, quotas, and noisy-neighbor control
12	testing-orchestration	confidence vs cost	E2E tests, mocked agents, chaos drills
13	honest-admission	humility	what orchestration still cannot solve cleanly

Three traversal paths use this map. Prerequisite path — read top to bottom; each file assumes the previous. Failure path — when a production workflow breaks, find the pressure that's under-designed (checkpoint missing? handoff lossy? plan stale? tenant leaked?). Synthesis path — combine two rows from different pressures (e.g., human-time asymmetry + plan freshness = what happens when a human takes 6 hours to approve and the data underlying step 4 has changed?).

Top resources¶

Temporal docs — durable execution concepts — https://docs.temporal.io/concepts
LangGraph documentation — https://langchain-ai.github.io/langgraph/
OpenAI Agents SDK — handoffs and orchestration — https://openai.github.io/openai-agents-python/
Anthropic — Building effective agents (orchestration patterns) — https://www.anthropic.com/engineering/building-effective-agents
Inngest — durable workflow engine for AI — https://www.inngest.com/docs
Microsoft Semantic Kernel — process orchestration — https://learn.microsoft.com/en-us/semantic-kernel/

What's coming¶

01-why-orchestration.md — Why a single agent breaks under production coordination pressure, and when the control plane becomes necessary.
02-task-decomposition.md — Turning messy user intent into a structured workflow graph with typed step boundaries.
03-agent-selection-routing.md — Choosing the right agent, model, or tool path for each step in the graph.
04-workflow-patterns.md — Sequential, parallel, DAG, and conditional workflow shapes — when each fits and what each costs.
05-state-context-management.md — Shared state across steps without context explosion or stale reads.
06-langgraph-deep-dive.md — LangGraph nodes, edges, reducers, state schemas, and built-in checkpointing.
07-plan-execution-manager.md — Plan creation, execution tracking, and controlled re-planning under drift.
08-human-in-the-loop.md — Approval gates, escalation policies, and workflows that pause for hours then resume cleanly.
09-checkpoint-recovery.md — Durable checkpoints, idempotent steps, and resume-vs-restart decisions at the workflow level.
10-dynamic-replanning.md — When the original plan decays and the workflow must adapt without losing completed work.
11-multi-tenant-orchestration.md — Per-tenant isolation, concurrency limits, quotas, and noisy-neighbor defence.
12-testing-orchestration.md — E2E workflow tests, mocked agent steps, deterministic replay, and chaos injection.
13-honest-admission.md — What remains brittle, surprising, or unsolved in production workflow orchestration.

Bridge. Module 01 gave the agent a loop, tools, memory, and a leash. But a single agent running a single trajectory is not a production system — it is one worker with no manager, no schedule, and no recovery plan. The moment two agents must coordinate, or one agent must survive a crash, or a human must approve before the next step fires, you need something above the agent. That something is the control plane. Next: why orchestration becomes necessary — and what specific failure makes it obvious. → 01-why-orchestration.md