07. Plan-execution manager — the layer that writes, tracks, and revises the route¶
~20 min read. A workflow graph defines what steps exist and how they connect. The plan-execution manager is the runtime layer that decides which steps to instantiate, monitors their progress, classifies their outcomes, and triggers recovery or replanning when execution deviates from intent. Without it, a static graph can't adapt to the reality of failing tools, changing goals, or exhausted budgets.
Built on the first-principles overview in 00-first-principles.md. Plan freshness — the pressure that any plan decays as execution reveals new information — is the central tension. The dispatch loop enforces timing, but the plan-execution manager decides what to dispatch next and whether the current plan still fits reality.
What file 06 established and what remains¶
File 06 mapped our vocabulary to LangGraph: StateGraph for the workflow graph, TypedDict for the handoff contract, checkpointers for durability. The compiled graph can execute a fixed sequence of nodes with conditional branching. The gap: a compiled graph is static. It doesn't know whether the plan it's executing is still valid — whether a precondition has broken, whether a budget is exhausted, whether the user has changed scope mid-run. The plan-execution manager sits above the graph engine and provides that runtime intelligence.
The coding agent that kept editing the wrong file for 40 minutes¶
A coding agent receives: "Fix the authentication timeout bug in the user service and add a regression test." The initial plan:
Plan v1:
step 1: search for timeout-related code in user-service/
step 2: identify root cause
step 3: implement fix
step 4: write regression test
step 5: run test suite
step 6: commit if green
Step 1 finds three files mentioning timeout. Step 2 picks session_handler.py as the likely root cause. Step 3 edits it. Step 4 writes a test. Step 5 runs the suite — the original bug still reproduces. The fix was wrong.
A system without a plan-execution manager retries step 3 with the same assumption ("the bug is in session_handler.py"). It edits again. Tests fail again. It edits again. Forty minutes and 12 retries pass. The budget is exhausted. The actual bug was in token_refresh.py.
A system with a plan-execution manager classifies the step 5 failure: "regression test still fails after fix — root cause assumption may be wrong." It doesn't retry step 3. It invalidates step 2's conclusion and triggers a scoped replan: "re-execute step 2 with broader search, explicitly excluding the file already attempted." The second pass finds token_refresh.py. The fix lands on the third attempt (not the twelfth).
Without plan manager: With plan manager:
step 3 → fail → retry step 3 step 3 → fail → classify → invalidate step 2
→ fail → retry step 3 → replan step 2 with new constraint
→ fail → retry step 3 → correct root cause found
→ ... (12 attempts) → fix lands (3 attempts total)
The difference isn't smarter code generation. It's smarter execution governance.
Teacher voice. The plan-execution manager's value is not in making the plan. It's in detecting when the plan is wrong and choosing the minimal corrective action — retry, fallback, replan, or escalate — rather than mindlessly repeating failed steps.
The invariant: a plan is a falsifiable hypothesis, not a fixed script¶
Every plan is an assertion: "Given what we know now, these steps in this order should achieve the goal." Execution either confirms or refutes that hypothesis. The plan-execution manager's job is to update the hypothesis as evidence arrives — not to defend the original plan against contradicting reality.
This makes the manager fundamentally different from a simple task queue. A task queue dispatches work in order. A plan-execution manager reasons about whether the order is still correct.
Three responsibilities of the plan-execution manager¶
┌─────────────────────────────────────────────────────────────────┐
│ PLAN-EXECUTION MANAGER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. PLAN GENERATION │
│ user goal → decomposed steps with contracts │
│ │
│ 2. EXECUTION TRACKING │
│ per-step: status, retries, elapsed time, outputs, cost │
│ │
│ 3. DEVIATION RESPONSE │
│ failure classification → retry / fallback / replan / abort │
│ │
└─────────────────────────────────────────────────────────────────┘
These responsibilities can be (and often should be) implemented by different subsystems. The planner may be an LLM that generates structured step records. The tracker may be a deterministic state machine. The deviation responder may be a policy engine with rules. Separating them prevents the "one agent does everything" fragility.
Plan representation: machine-readable, not prose¶
A plan must be inspectable, trackable, and diffable. Prose ("look at the code, fix it, test it") is useless for execution governance. Each step needs typed fields:
step_record:
├── id: "pull_credit"
├── goal: "Retrieve applicant credit score from bureau API"
├── inputs: [applicant_id, identity_verification_result]
├── preconditions: ["identity_verified == true"]
├── owner: "credit_agent"
├── success_signal: "credit_score is integer between 300-850"
├── failure_class: "retrieval_failure"
├── retry_policy: {max_attempts: 3, backoff: exponential}
├── fallback: "use cached score if < 24h old, else escalate"
├── side_effects: ["external API call"]
└── idempotent: true (bureau API is read-only)
The critical fields most teams omit:
| Field | Why it matters |
|---|---|
success_signal |
Without it, the tracker can't distinguish "done" from "ran without error" |
failure_class |
Without it, the deviation responder treats all failures identically |
preconditions |
Without them, steps execute against stale or invalid state |
side_effects |
Without them, retry safety is unknown |
Mini-FAQ. "Does the LLM planner actually produce all these fields?" Often not on the first pass. A common pattern: the LLM generates a rough plan (goals + order), then a deterministic validator enriches it with default retry policies, precondition checks from the state schema, and side-effect annotations from a tool registry. The enrichment step is what makes the plan executable rather than aspirational.
Threaded example: loan-approval plan tracked through execution¶
The loan-approval workflow from previous files. The plan-execution manager produces:
Plan v1 (generated from goal: "Process loan application for applicant_id=A-7291"):
step 1: verify_identity
success: identity_verified == true
precondition: applicant_id exists
owner: identity_agent
retry: 2 attempts, 5s backoff
step 2: pull_credit
success: credit_score in [300, 850]
precondition: identity_verified == true
owner: credit_agent
retry: 3 attempts, exponential backoff
fallback: cached_score if fresh
step 3: compliance_check
success: compliance_flag in ["pass", "review"]
precondition: credit_score exists
owner: compliance_agent
retry: 1 attempt
step 4: human_review (conditional)
trigger: compliance_flag == "review"
success: human_override in ["approved", "denied"]
timeout: 48h → escalate to manager
step 5: issue_decision
success: decision in ["approved", "denied"]
precondition: compliance resolved
side_effects: [write to loan DB, trigger notification]
idempotent: no (idempotency key required)
Execution begins. The tracker maintains:
execution_state:
├── current_step: "pull_credit"
├── completed: ["verify_identity"]
├── pending: ["compliance_check", "human_review?", "issue_decision"]
├── retries_used: {pull_credit: 1}
├── elapsed: 4.2s
├── budget_remaining: $0.18 of $0.25
├── side_effects_committed: []
└── plan_version: 1
Step 2 (pull_credit) returns HTTP 503 on first attempt. The tracker increments retries. Second attempt succeeds with credit_score: 720. The tracker advances to step 3.
Step 3 (compliance_check) returns compliance_flag: "review". The conditional branch activates step 4. The tracker records the branch decision and pauses for human input.
This is not the graph executing. This is the manager watching the graph execute and maintaining a runtime view of plan health.
Failure classification: the manager's core intellectual work¶
When a step fails, the naive response is "retry." The manager's job is classification: what kind of failure is this, and what's the minimal corrective action?
Failure classification tree:
step_failed
├── transient? (network timeout, rate limit, 503)
│ └── action: retry with backoff
├── deterministic? (bad input, missing precondition)
│ └── action: check preconditions, maybe replan earlier steps
├── assumption_broken? (root cause was wrong, goal changed)
│ └── action: replan affected branch
├── budget_exhausted? (retries used up, cost ceiling hit)
│ └── action: fallback or escalate
└── unknown?
└── action: log full context, escalate to human
The coding agent example from earlier failed because it lacked this classification. It treated "test still fails" as transient (just retry the edit) when it was actually "assumption broken" (wrong root cause). That classification error turned a 3-attempt fix into a 12-attempt waste.
Teacher voice. Failure classification is the single highest-leverage piece of a plan-execution manager. Teams that build sophisticated planners but naive failure handlers (retry everything) will underperform teams with simple planners but thoughtful failure classification.
Progress visibility: what the manager exposes¶
The execution tracker serves multiple audiences:
User-facing view:
├── "Verifying your identity... ✓"
├── "Checking credit history... ✓"
├── "Compliance review in progress..."
└── (hides: retries, fallbacks, internal routing)
Operator-facing view:
├── step: compliance_check
├── status: running (attempt 1/1)
├── elapsed: 2.1s
├── budget: $0.12 remaining
├── last_failure: null
└── plan_version: 1, no replans triggered
Debug view (post-mortem):
├── full step-by-step trace with timestamps
├── every retry with error payloads
├── failure classifications and chosen actions
├── state diffs at each transition
├── replan diffs if any
└── cost breakdown per step
These aren't three separate systems. They're three projections of the same execution state, filtered by audience. The dispatch loop timestamps everything; the manager annotates with classification and decision metadata.
When the plan must change: replan triggers¶
Not every failure requires replanning. Most failures are handled by retry or fallback within the current plan. Replanning is expensive (generates new steps, may invalidate downstream state) and should have explicit triggers:
| Trigger | Example | Replan scope |
|---|---|---|
| Precondition permanently false | Identity verification reveals applicant is a minor (ineligible) | Abort workflow |
| Root cause assumption broken | Fix applied to wrong file, tests still fail | Replan from diagnosis step |
| User scope change | "Also check business credit, not just personal" | Add branch, preserve completed work |
| Budget exhaustion with work remaining | $0.25 spent, decision step not reached | Simplify remaining steps or escalate |
| New high-confidence evidence | Fraud signal detected mid-workflow | Replan to investigation branch |
The manager records: why the plan changed, which steps were invalidated, and what state from the old plan remains trusted in the new plan. Without this audit trail, replanning becomes invisible improvisation.
replan_record:
├── trigger: "step 2 retry exhausted, no cached score available"
├── old_plan_version: 1
├── new_plan_version: 2
├── invalidated_steps: ["pull_credit"]
├── preserved_steps: ["verify_identity" (completed)]
├── new_steps: ["manual_credit_entry" → human provides score]
└── trusted_state: {applicant_id, identity_verified}
Separation of concerns: planner vs tracker vs policy¶
A common mistake: one LLM agent does planning, tracking, and deviation response in its context window. This creates a single point of fragility — if the model hallucinates progress or misclassifies a failure, nothing catches the error.
Better architecture:
┌──────────────────────────────────────────────────────────────┐
│ Planner (LLM) │
│ Generates/revises plan structure │
│ Input: goal + constraints + current state │
│ Output: list[step_record] │
├──────────────────────────────────────────────────────────────┤
│ Tracker (deterministic) │
│ Maintains execution state, enforces preconditions │
│ Input: step outcomes + timestamps │
│ Output: progress view + deviation signals │
├──────────────────────────────────────────────────────────────┤
│ Policy engine (rules + optional LLM) │
│ Classifies failures, chooses response action │
│ Input: deviation signal + step metadata + budget state │
│ Output: retry / fallback / replan / escalate / abort │
└──────────────────────────────────────────────────────────────┘
The tracker is the source of truth. It never hallucinates progress — it records what actually happened. The policy engine can use an LLM for complex classification, but its decisions are logged and auditable. The planner is called only when the policy engine determines replanning is needed.
Operational signals — healthy manager, degrading manager, broken manager¶
Healthy behaviour: - Plan generation takes < 2s for standard workflows - Failure classification resolves within one step (no cascading misclassification) - Replan frequency < 1 per 5 workflow runs - Budget utilisation stays within 80% of allocation - Progress view updates are available within 500ms of step completion
First degrading signal: - Replan frequency increasing → either environment instability or weak initial plans - Retry exhaustion rate climbing → transient classification being applied to deterministic failures - Budget overruns appearing → plan complexity exceeding allocation, or steps costing more than estimated - Step completion times drifting beyond 2× estimate → execution stalling without detection
Misleading metric: - "Plan accuracy" (% of steps that execute without change) — rewards conservative plans that avoid hard problems rather than plans that adapt intelligently - "Steps completed per minute" — fast completion of wrong steps is worse than slow completion of right ones
Expert signal: - Classification accuracy: did the manager correctly identify failure type on first try? - Replan precision: did replans fix the problem, or did they create new problems? - State trust ratio: what fraction of state from old plans survived into new plans unchanged?
Boundary of applicability¶
Works unusually well: - Multi-step agent workflows (coding, research, support) where steps have explicit success signals and failure modes - Workflows with budget constraints where the manager must make tradeoffs (cheaper model vs retry vs escalate) - Long-running workflows (minutes to hours) where the environment may change mid-execution
Becomes pathological: - Single-shot generation tasks with no sequential dependencies — the overhead of plan management exceeds the benefit - Real-time systems requiring sub-second responses — plan generation adds latency - Workflows where every step is independent (no shared state, no preconditions) — a simple task queue suffices
Scale that invalidates naive intuition: - At 100+ concurrent workflows, the plan-execution manager itself becomes a bottleneck if it's a single LLM call per decision — batch classification or rule-based fast-path becomes necessary - At plan depth > 20 steps, initial plan generation quality degrades rapidly — iterative plan extension (plan 5 steps ahead, replan after 5 complete) outperforms one-shot full plans
Failure-prone assumption: "a good planner doesn't need execution management"¶
The seductive wrong idea: "If the LLM generates a perfect plan upfront, we just need to execute it sequentially — no tracking, no classification, no replanning needed."
The correction: Plans are hypotheses. Execution produces evidence. No planner — not GPT-5, not a human PM — can perfectly predict tool failures, changing requirements, ambiguous evidence, or budget exhaustion before they happen. The manager exists because reality diverges from prediction, always.
A coding agent that plans "search → fix → test" without execution management is exactly the agent that retries the wrong fix twelve times. The plan was reasonable. Reality disagreed. Without a manager, there's no mechanism to notice or respond.
Real-world implementations¶
- Devin by Cognition — maintains an explicit plan view with step tracking, allows plan revision when test failures indicate wrong assumptions, surfaces plan state to users for course-correction
- GitHub Copilot coding agent — tracks file searches, edits, test runs, and build results as execution state; classifies test failures to decide between re-editing and broadening the search
- OpenAI Deep Research — monitors research progress across browsing sessions, decides when enough evidence exists to synthesize vs when more searching is needed
- Claude computer use — maintains step-level execution tracking for multi-step desktop automation, with failure detection when UI state doesn't match expected post-conditions
- Cursor Agent — tracks plan progress across file edits, distinguishes between "edit didn't compile" (retry with different approach) and "wrong file targeted" (replan search)
- Amazon CodeWhisperer Agent — execution tracking for multi-file refactoring ensures that dependent edits aren't attempted when earlier steps failed
- Adept AI — workflow execution in enterprise software tracks preconditions (is the right page loaded? is the form in the right state?) before attempting actions
- Sierra customer support — tracks conversation state and plan progress for multi-step issue resolution, escalates when plan assumptions (e.g., "customer has order ID") prove false
Recall checkpoint¶
- What three responsibilities does a plan-execution manager own?
- Why does failure classification matter more than retry count?
- What fields in a step_record are most often omitted, and what breaks without them?
- When should the manager trigger a replan vs retry the current step?
- Why separate the planner, tracker, and policy engine into different components?
- What makes "plan accuracy" a misleading metric?
- How does the execution tracker serve different audiences (user, operator, debugger)?
Interview Q&A¶
Q: Why separate plan generation from execution tracking? A: Generation produces a hypothesis about what to do. Tracking produces ground truth about what actually happened. Conflating them in one agent loop means the system can hallucinate progress or ignore failures — there's no independent source of truth. Common wrong answer to avoid: "Because LLMs forget previous steps." Memory is one factor, but the fundamental issue is separating hypothesis from evidence.
Q: Why must step records contain success signals, not just step names? A: The execution tracker needs to know when a step is actually done vs merely ran. "Run tests" isn't a success signal. "All targeted tests pass and no new failures introduced" is. Without explicit signals, the manager advances blindly. Common wrong answer to avoid: "For better logging." Logging benefits, but the core issue is execution governance — knowing when to advance, retry, or escalate.
Q: Why is failure classification the highest-leverage piece of the manager? A: Because the correct response to "step failed" depends entirely on why it failed. Retrying a wrong assumption wastes budget. Replanning a transient network error wastes time. Classification determines which lever to pull. Common wrong answer to avoid: "Because different errors need different retry counts." Retry count is one downstream decision, but classification governs the entire response: retry, fallback, replan, or abort.
Q: Why limit replan scope rather than regenerating the entire plan? A: Full regeneration discards trusted state from completed steps, increases cost, and creates unnecessary instability. Scoped replanning (change only the affected branch) preserves prior work and maintains audit continuity. Common wrong answer to avoid: "Because replanning is expensive." Cost is one reason, but state preservation and explainability are equally important.
Q: Why should the deviation policy engine be auditable rather than a black-box LLM call? A: When a workflow takes an unexpected path (escalation instead of retry, abort instead of continue), operators need to understand why. An unauditable decision makes the system untrustworthy even if the decision was correct. Common wrong answer to avoid: "Because LLMs are unreliable." Reliability varies, but the deeper issue is that production systems require explainable governance decisions.
Q: When does a plan-execution manager add more overhead than value? A: For single-shot tasks with no dependencies, no failure modes, and no budget constraints. Also for real-time systems where the latency of plan generation and classification exceeds the SLA. The manager is for multi-step workflows where adaptation matters. Common wrong answer to avoid: "When the planner is already good enough." Even perfect planners face runtime surprises. The real question is whether the workflow is complex enough to justify the management layer.
Design/debug exercise (10 min)¶
Modeled: The loan-approval workflow's pull_credit step fails with HTTP 503. The manager classifies this as transient (network issue, retry policy allows 3 attempts). Retry 2 succeeds. No replan needed. The manager records: {step: pull_credit, failure_class: transient, action: retry, attempts_used: 2, result: success}.
Your turn: The same workflow's compliance_check step returns an unexpected value: compliance_flag: "system_error" (not in the expected set of "pass"/"review"/"fail"). The success signal isn't met. Write: (1) the failure classification, (2) the chosen action, (3) the replan record if applicable.
From memory: Close this file and sketch: the three-layer architecture (planner, tracker, policy), the step_record fields for one step of a workflow you know, and the failure classification tree with five branches.
Operational memory¶
A plan-execution manager transforms a static workflow graph into an adaptive system. The graph defines what can execute; the manager decides what should execute, monitors what actually happened, and responds when reality contradicts the plan. Its three responsibilities — plan generation, execution tracking, and deviation response — can and should be implemented by separate subsystems: an LLM planner for creativity, a deterministic tracker for ground truth, and a policy engine for auditable decisions.
The single highest-leverage piece is failure classification. Teams that retry every failure identically (the "twelve retries on the wrong file" pattern) waste enormous budget and time. Teams that classify failures into transient/deterministic/assumption-broken and respond with the appropriate lever — retry, fallback, replan, or escalate — converge on solutions faster and cheaper.
Remember: - Plans are hypotheses; execution produces evidence that confirms or refutes them - Step records need success signals, failure classes, and preconditions — not just names - Failure classification (not retry count) determines the correct response action - Separate planner, tracker, and policy engine to prevent hallucinated progress - Scoped replanning preserves trusted state; full regeneration discards it unnecessarily - Budget, SLA, and retry exhaustion are first-class inputs to the policy engine - The execution tracker serves three audiences (user, operator, debugger) with different projections of the same state
Bridge. The plan-execution manager decides retry, fallback, replan, or escalate. But "escalate" means a human must enter the loop — and that creates a completely different kind of pause. The workflow must package context, wait across human-time gaps, and resume cleanly after a decision that may take hours. That's human-in-the-loop design. → 08-human-in-the-loop.md