01. Why orchestration — when the control plane becomes necessary¶

~18 min read. A single agent handles one task beautifully. Production work is never one task — it is a sequence of tasks with failures, approvals, and shared state between them. The moment that sequence matters, a control plane appears whether you design it or not.

Built on the first-principles overview in 00-first-principles.md. The control plane — the layer that dispatches agents, tracks state, and decides what runs next — exists because coordination cost grows faster than agent capability. This file establishes when that cost forces the control plane into existence.

What module 01 left open¶

Module 01 built an agent that reasons, acts, remembers, and stops. That agent is a single worker. It handles a single trajectory — one user request, one tool sequence, one result. But production systems rarely need a single trajectory. They need five agents collaborating on a loan approval. They need a human approving step 4 before step 5 fires. They need recovery when step 3 crashes at 2 AM. They need per-tenant isolation so one customer's workflow never touches another's data.

None of that is the agent's job. The agent's job is to think and act within its step. The coordination above — which step runs next, what state passes between steps, what happens on failure, when to pause for a human — is the control plane's job. This file answers: when does that control plane become necessary, and what breaks without it?

The support team that shipped a monolith and regretted it¶

A fintech support team builds a complaint-resolution agent. V1: one ReAct agent with access to account lookup, policy search, refund API, and email sender. The prompt says "resolve the customer's complaint end-to-end." Demo: beautiful. The agent finds the account, checks the charge, retrieves the refund policy, processes the refund, and emails the customer. Five tools, one loop, eight seconds.

Week three. A compliance officer asks: "Which step decided the refund was valid? Can I see the policy version it used?" The team has no answer — the agent's reasoning is a blob in a single trace span. The decision and the action happened in the same loop iteration.

Week five. A refund of ₹1,40,000 goes through without human approval. The model decided it was "clearly valid" based on a retrieval chunk from an outdated policy document. There was no gate between "decide" and "execute" because there was no explicit boundary between those steps.

Week eight. The agent crashes mid-loop after calling the refund API but before sending the email. On restart, it re-runs the entire sequence. The refund API has no idempotency key from the orchestration layer. The customer receives two refunds. Incident cost: ₹1,40,000 and a week of engineering time.

Three failures. Same root cause: the team embedded coordination logic — sequencing, approval, recovery — inside the agent's reasoning loop instead of making it explicit infrastructure above the agent.

Teacher voice. The control plane is not an optimisation. It is the answer to a structural question: who decides what runs next, what state crosses each boundary, and what happens when something breaks mid-sequence? If that answer is "the agent figures it out from its prompt," you have a control plane — it's just hidden inside a prompt and unobservable.

The invariant: separate coordination from cognition¶

The chapter protects one rule: coordination logic belongs in explicit infrastructure, not inside agent reasoning.

An agent should think about the content of its step — evaluating a policy, writing code, summarising evidence. It should not think about whether this is the right moment to run, what happened two steps ago, or whether a human needs to approve before the next action. Those are workflow decisions. They belong in the control plane where they are observable, testable, and recoverable.

When coordination lives inside the agent's prompt, three things become impossible: 1. You cannot checkpoint between steps (the agent's context window is the only state). 2. You cannot insert an approval gate without rewriting the prompt. 3. You cannot test one step in isolation because all steps are fused into a single inference chain.

What a single agent actually does vs what production requires¶

what a single agent does well         what production workflows require
──────────────────────────────        ────────────────────────────────
one user, one task, one trajectory    many users, many tasks, concurrent
fast: 5-30 seconds                    long: minutes to days (human approvals)
stateless between requests            durable state across crashes
all-or-nothing                        partial progress preserved
self-judges completion                external validation gates
no pause capability                   pause for hours, resume cleanly
one context window                    state shared across specialists

The gap between these two columns is the control plane. Every row where the single agent falls short maps to a specific mechanism this module builds:

Gap	Control plane mechanism	File
Concurrent execution	Workflow graph with parallel branches	04
Long duration / human wait	Approval gates with durable pause	08
Crash recovery	Durable checkpoints	09
Partial progress	Step-level state persistence	05
External validation	Typed handoff contracts	02
Multi-specialist	Agent routing per step	03

Threaded example — loan approval workflow¶

This example threads through the module. Five steps, three agents, one human gate.

user submits loan application
         │
         ▼
┌──────────────────────────────────────────────────────────────────┐
│                        CONTROL PLANE                              │
│                                                                  │
│  step 1          step 2          step 3         step 4    step 5 │
│  ┌──────┐       ┌──────┐       ┌──────┐       ┌──────┐  ┌─────┐│
│  │parse │──────→│check │──────→│score │──────→│human │─→│send ││
│  │docs  │       │elig. │       │risk  │       │review│  │offer││
│  └──────┘       └──────┘       └──────┘       └──────┘  └─────┘│
│     ▲               ▲              ▲              ▲         ▲   │
│     │               │              │              │         │   │
│  checkpoint      checkpoint     checkpoint     approval   checkpoint
│     c1              c2             c3           gate        c4  │
└──────────────────────────────────────────────────────────────────┘

Without a control plane: one agent receives "process this loan application" and internally reasons through all five steps. If step 3 crashes, everything restarts. If step 4 needs a human, the agent... waits? Times out? Hallucinates approval? There is no clean mechanism.

With a control plane: each step is a node in a workflow graph. Each transition writes a durable checkpoint. Step 4 is an approval gate that suspends execution until a human acts — hours later, days later. The control plane resumes from the checkpoint. Steps 1–3 never re-run. The human sees exactly what step 3 produced. Auditability is structural, not retrofitted.

Five signals that you need a control plane¶

Not every agent system needs orchestration. A single-turn Q&A agent, a code-completion endpoint, a classification call — these work fine as standalone agents. The control plane becomes necessary when you observe any of these:

Steps have different risk profiles. A retrieval step is harmless; a payment step is irreversible. You need a boundary between them with different policies.
Execution spans human time. A workflow pauses for approval, review, or input that takes minutes to days. The agent cannot hold its context window open that long.
Crash recovery must preserve progress. If a five-step workflow fails at step 4, restarting from step 1 wastes tokens, money, and time. You need checkpoints.
Multiple specialists own different steps. A coding agent, a review agent, and a testing agent each need isolated context and typed inputs. Sharing one agent's context window creates noise.
Auditability is a requirement. Compliance, debugging, or customer support need to know which step produced which output, what version of what policy was consulted, and where the decision was made. A single agent's chain-of-thought is not an audit trail.

If none of these apply, keep the single agent. If two or more apply, the control plane is not optional — it's either explicit infrastructure or hidden prompt complexity that will eventually fail.

The wrong model: "just use a better prompt"¶

The most common mistake teams make when monolithic agent workflows start failing is to improve the prompt. Add more instructions. Add role-playing. Add "you are a careful planner who considers failures." This works briefly — then fails at scale for a structural reason: prompts are stateless, but coordination is stateful.

A prompt cannot: - Persist state across pod restarts - Pause for 6 hours waiting for human approval - Route different steps to different cost tiers - Guarantee idempotency on retry - Provide per-step observability spans - Enforce tenant isolation

These are infrastructure concerns. Solving them in a prompt is like solving database consistency in application code — it works until it doesn't, and when it doesn't, the failure mode is invisible.

Not a prompt problem. Not a model-intelligence problem. A missing-layer problem. The control plane is the missing layer.

When to stay with a single agent¶

The control plane adds coordination cost — state writes, dispatch latency, handoff serialisation. That cost is not free. For some workloads, a single agent remains the right choice:

Single-turn tasks. Classification, summarisation, extraction with no downstream action.
Low-stakes, fast tasks. A chatbot that answers questions with no side effects.
Prototype phase. When you're still discovering what the workflow should be, a monolithic agent is faster to iterate on.
No human gate needed. If every step is automated and low-risk, the overhead of durable orchestration may exceed its value.

The boundary is not binary. Many production systems start as single agents, accumulate hidden coordination logic in their prompts, and eventually extract that logic into an explicit control plane when failures become too expensive to debug.

Where this lives in the wild¶

Devin (Cognition) — decomposes "build a feature" into inspect → plan → code → test → debug steps, routing each to specialised sub-agents with shared file-system state and checkpoint-based recovery.
OpenAI Deep Research — a control plane turns a broad research question into browse → read → quote → synthesise steps rather than asking one model call to do everything.
GitHub Copilot coding agent — sequences search, edit, test, and retry actions through a workflow layer so the agent operates safely over repositories.
Intercom Fin — support orchestration combines retrieval, answer drafting, confidence scoring, and human handoff as explicit workflow stages.
Temporal + LLM workflows — teams wrap agent calls in Temporal activities to get durable execution, retry policies, and crash recovery without reimplementing workflow infrastructure.

Recall¶

What structural problem does a control plane solve that a better prompt cannot?
In the fintech support example, which three production failures traced to missing orchestration?
What is the invariant this chapter protects?
Name three of the five signals that indicate a control plane is necessary.
When should you deliberately stay with a single agent and avoid orchestration overhead?
In the loan approval example, what happens at step 4 without a control plane vs with one?

Interview Q&A¶

Q: A team says "we'll just use a better model instead of adding orchestration." What's the structural argument against this?

A: A better model improves reasoning within a step but cannot add durable state persistence, crash recovery, human-time pause-and-resume, per-step observability, or tenant isolation. Those are infrastructure concerns that exist regardless of model capability. The gap is not intelligence — it is coordination statefulness.

Common wrong answer to avoid: "Because bigger models are expensive." Cost is real but secondary. The structural argument is that prompts are stateless and coordination is stateful — no model upgrade closes that gap.

Q: When would you argue against adding a control plane to an agent system?

A: When the workflow is single-turn, low-stakes, has no human gates, no crash-recovery requirement, and no audit obligation. The coordination cost of a workflow engine (state serialisation, dispatch latency, operational complexity) exceeds its value for simple, fast, stateless tasks.

Common wrong answer to avoid: "Never — always add orchestration." Over-engineering simple tasks with workflow infrastructure adds latency and operational surface without proportional benefit.

Q: Why does hidden coordination logic in a prompt become dangerous at scale?

A: Because it is unobservable (no per-step spans), untestable (can't test steps in isolation), unrecoverable (no checkpoints between steps), and unreviewable (compliance cannot audit which step made which decision). These are tolerable at demo scale and catastrophic at production scale.

Common wrong answer to avoid: "Because prompts get too long." Length is a symptom, not the cause. The cause is structural: coordination state cannot survive prompt boundaries.

Q: What's the difference between "the agent decides what to do next" and "the control plane decides what to do next"?

A: The agent uses reasoning (token prediction) to decide the next action — this is non-deterministic, unobservable mid-step, and lost on crash. The control plane uses explicit workflow logic (a graph, a state machine, a DAG) — this is deterministic, observable per-transition, and durable across failures. One is cognition; the other is coordination.

Common wrong answer to avoid: "The control plane is just a wrapper around the agent." It is not a wrapper — it owns a fundamentally different concern: execution order, state persistence, and recovery.

Design/debug exercise (10 min)¶

Modeled example. Take the fintech support agent from the opening. The monolithic version mixes retrieval, policy evaluation, refund execution, and notification in one loop. Identify: (a) which steps have different risk classes, (b) where a human gate would prevent the ₹1,40,000 incident, (c) where a checkpoint would prevent the double-refund.

Your turn. Pick one agent system you've built or studied. List every implicit step hidden in the prompt. For each, classify: pure-reasoning / external-read / side-effect / human-gate. Draw the workflow graph with checkpoints between side-effect steps.

From memory. Sketch the five signals that indicate a control plane is needed. Then draw the loan-approval workflow with its four checkpoints and one approval gate.

Operational memory¶

This chapter established the structural argument for orchestration: coordination logic — sequencing, state persistence, approval gates, crash recovery, tenant isolation — belongs in explicit infrastructure above the agent, not inside the agent's reasoning. The invariant is simple: separate coordination from cognition. When they are fused, you lose observability (can't see which step decided what), testability (can't test steps in isolation), and recoverability (can't checkpoint between steps).

The trigger for introducing a control plane is not "the agent is struggling." It is the presence of structural requirements that prompts cannot satisfy: human-time pauses, crash recovery, per-step audit, multi-specialist routing, or tenant isolation. If none apply, keep the single agent — orchestration has real coordination cost. If two or more apply, the control plane is not optional.

Remember:

Coordination is stateful; prompts are stateless. No model upgrade closes that gap.
A control plane does not replace agents — it gives them boundaries, state, and recovery.
Five signals force the control plane: different risk profiles, human-time spans, crash recovery, multi-specialist steps, audit requirements.
Hidden coordination logic in prompts is unobservable, untestable, and unrecoverable at scale.
Start with a single agent. Extract the control plane when structural failures — not reasoning failures — appear.
The loan-approval workflow threads through this module: five steps, three agents, one human gate, four checkpoints.

Bridge. The control plane exists. It dispatches agents and tracks state. But the first thing it needs is a plan — a structured decomposition of user intent into executable steps with typed boundaries. Without that, the dispatch loop has nothing to dispatch. Next: turning messy requests into workflow graphs. → 02-task-decomposition.md