Skip to content

13. Honest admission — what orchestration still can't guarantee

~15 min read. Files 01–12 built a complete orchestration system: graphs, typed state, routing, patterns, checkpoints, planning, human gates, recovery, replanning, tenancy, and testing. This file is for intellectual honesty. What still breaks? What remains unsolved? What would a thoughtful engineer admit they don't yet know how to guarantee — even with every mechanism in this module deployed correctly?

Built on the first-principles overview in 00-first-principles.md. Every pressure defined there — coordination cost, durability vs latency, plan freshness, handoff fidelity, human-time asymmetry, tenant isolation, testability — has been addressed with mechanisms. This file asks: where do those mechanisms still fall short?


What this module accomplished and what it cannot claim

This module taught: how to decompose tasks into executable steps, assign agents to those steps, shape the workflow graph, manage state across boundaries, implement durable execution with LangGraph, govern plan execution with tracking and classification, pause for human judgment, survive crashes with checkpoints, adapt with dynamic replanning, serve multiple tenants fairly, and test the whole system. That's real capability.

What it cannot claim: that these mechanisms make workflow outcomes correct. They make outcomes governable — visible, auditable, recoverable, testable. The difference matters. A well-orchestrated system can still produce a wrong answer. It just fails in a way you can inspect, explain, and fix. An un-orchestrated system fails silently.


The plan-execution manager (file 07) can track plans, classify failures, and trigger replanning. But the quality of the initial plan still depends on the LLM's ability to decompose ambiguous goals into executable steps. This is not solved.

User goal: "Fix the flaky tests in the payment service"

Good decomposition (requires deep understanding):
├── identify which tests are flaky (not just failing)
├── classify flakiness cause (timing, ordering, state leak, external dependency)
├── fix the root cause per category
├── verify the fix reduces flake rate without masking real failures
└── add deterministic replay protection

Naive decomposition (surface-level):
├── find failing tests
├── fix them
└── run tests

The naive plan looks reasonable. It will waste enormous budget because "fix them" hides the real complexity. No amount of execution tracking fixes a plan that starts from a shallow understanding of the problem.

Current state of the art: initial plan quality correlates strongly with the model's domain knowledge and the specificity of the goal. Vague goals produce vague plans. Complex goals produce incomplete plans. This improves with better models — but it's not solved by orchestration.


Open problem 2: emergent failures across steps

Files 01–12 can test individual nodes, conditional edges, state contracts, and checkpoint-resume sequences. But some failures only emerge from the interaction of steps that are individually correct:

Individually correct:
├── credit check returns 720 (accurate)
├── compliance check returns "pass" (valid given score)
├── state propagation works perfectly
└── decision issues "approved" (correct given inputs)

Emergent failure:
└── the credit score is 6 months stale (cached from prior application)
    and the applicant's financial situation deteriorated
    → every step was correct given its inputs
    → the overall decision was wrong because the system
       trusted upstream data without freshness verification

No individual step was broken. The orchestration worked perfectly. The failure was in an implicit assumption (data freshness) that no explicit mechanism checked. These emergent failures are the hardest class to prevent because they require reasoning about what's missing from the workflow, not what's present and broken.


Open problem 3: evaluation of multi-step outcomes

Testing orchestration (file 12) verifies that control flow is correct. But "correct control flow producing correct outcomes" requires evaluating the quality of multi-step results — and this remains hard.

How do you score: - Whether a research workflow searched enough before synthesizing? - Whether a coding workflow's fix addresses the right root cause? - Whether a compliance decision weighted the right factors? - Whether a replan was wise vs merely triggered?

Single-step evaluation (is this model output good?) is tractable. Multi-step evaluation (did this sequence of decisions collectively produce a good outcome?) is research-frontier territory. Most teams fall back to "did the user accept the result?" — which is a lagging signal, biased by presentation quality, and unavailable for fully automated workflows.


Open problem 4: confidence calibration for escalation

File 08 established that approval gates should trigger based on explicit conditions (amount > threshold, compliance flag, confidence below threshold). But setting those thresholds correctly is an empirical challenge that remains partially unsolved:

  • Too sensitive → everything escalates → human bottleneck → automation value destroyed
  • Too permissive → risky decisions automated → failures in production → trust destroyed
  • Correct threshold → varies by tenant, time, task type, and organisational risk appetite

There's no universal formula. Teams calibrate through production experience (which means some incorrect decisions happen first) or through expensive human-in-the-loop evaluation (which doesn't scale). The threshold is always a tradeoff between false positives (unnecessary escalation) and false negatives (missed escalation), and the optimal point depends on domain-specific error costs.


Open problem 5: state compression without information loss

File 05 established that state should be compressed — each step receives only what it declared as input. But deciding what to compress requires judgment about what future steps might need. This is a prediction problem:

State after research phase:
├── 15,000 tokens of browsing notes
├── 8 source URLs
├── 3 key findings
├── 2 uncertainty flags
└── 1 contradicting source

Compressed state for synthesis:
├── 3 key findings
├── 2 uncertainty flags
└── 1 source summary

Lost in compression:
├── the contradicting source's specific wording
├── context that would help the synthesis agent
│   weigh the uncertainty correctly
└── details that a human reviewer would need
    to verify the conclusion

The compression was reasonable. The lost information turned out to be critical — but only visible in hindsight. State compression that is safe in most cases can silently degrade quality in specific cases. No current mechanism perfectly balances compression (for performance and focus) against preservation (for correctness and auditability).


Open problem 6: checkpoint migration at velocity

File 09 addressed schema evolution for checkpoints. In practice, fast-moving teams update workflow schemas weekly. Long-running workflows (human-in-the-loop with multi-day pauses) may span multiple schema versions. The migration challenge compounds:

  • Version 1 → 2: add field (easy, use default)
  • Version 2 → 3: rename field (migration function)
  • Version 3 → 4: change type (conversion, may fail)
  • Version 1 → 4: compose all three migrations (complex, error-prone)

At high velocity, the "checkpoint migration" path becomes its own source of bugs. Teams sometimes choose "fail and restart" over "migrate and resume" when the migration chain is too long — which defeats the purpose of checkpoints for long-lived workflows.


Open problem 7: routing policy learning

File 03 established that routing should be based on step requirements, not defaults. But optimal routing (which model for which step? which tool? which fallback?) depends on production experience. The dream: learn routing policies from traces of successful and failed workflows. The reality: this is a reinforcement learning problem with sparse rewards, high variance, and safety constraints. An incorrect routing policy learned from production can degrade service across all tenants.

Current practice: manual routing rules tuned by engineers reviewing traces. Better than nothing, but doesn't scale with the number of routing decisions required as workflows grow in complexity.


What a senior engineer would say in an interview

"We built a control plane that makes multi-step AI workflows governable: explicit graphs, typed state contracts, durable checkpoints, failure classification, scoped replanning, tenant isolation, and comprehensive testing. The system is operationally mature — you can inspect why any decision was made, recover from crashes without side-effect duplication, and adapt when plans break.

What remains hard: initial planning quality still depends on model capability. Emergent failures from step interactions are difficult to predict. Multi-step evaluation lacks good metrics. Confidence calibration for human gates is empirical and domain-specific. State compression can silently lose critical context. And routing policy optimisation is still mostly manual.

The honest position: orchestration makes complex agent behaviour controllable, not correct. Correctness still requires good decomposition, domain understanding, careful evaluation, and human oversight at high-consequence decision points. The control plane ensures that when things go wrong, you can see why, fix the mechanism, and prevent recurrence. That's valuable. It's not magic."


The maturity spectrum

Level 1: Ad-hoc
  Agent loop with no explicit control plane
  Failures are invisible, recovery is restart from scratch

Level 2: Structured
  Explicit graph, typed state, fixed plan
  Failures visible in traces, but no adaptive response

Level 3: Governed
  Plan-execution manager, failure classification, checkpoints
  System adapts to transient failures, recovers from crashes

Level 4: Adaptive
  Dynamic replanning, scoped revision, backtracking
  System adapts to structural assumption breaks

Level 5: Platform
  Multi-tenant, fair scheduling, policy-aware routing, comprehensive testing
  System serves many users reliably with per-tenant guarantees

Level ???: Self-improving
  Routing learned from traces, plans improve from outcomes,
  thresholds calibrate from production data
  → still mostly aspirational for production systems

Most teams operate between Level 2 and Level 3. This module teaches up to Level 5. Level "self-improving" remains a research problem with production safety concerns.


Operational signals — symptoms of hitting these open problems

Planning quality weakness: - High replan frequency (> 30% of runs) → initial plans are shallow - Replans don't fix the problem → the real issue is in goal understanding, not plan structure

Emergent failure: - Workflow passes all tests but users report wrong outcomes → interaction assumptions not tested - Post-mortem reveals "every step was correct, but the answer was wrong" → implicit assumption gap

Evaluation gap: - Can't explain why a workflow's output was good or bad except by human judgment → lacking multi-step metrics - Quality varies across task types with no signal until user feedback → lagging evaluation

Escalation miscalibration: - Human reviewers approve 99% → threshold too sensitive, gates provide false security - Production incidents on un-escalated cases → threshold too permissive, real risks automated


Real-world examples of these open problems

  • OpenAI Deep Research — planning quality for open-ended queries varies widely; some queries produce excellent multi-source synthesis while others miss obvious sources due to shallow initial decomposition
  • Devin by Cognition — coding workflows sometimes "solve" the wrong problem when the initial diagnosis is incorrect; the system iterates on a wrong assumption despite technically correct execution
  • GitHub Copilot coding agent — success depends heavily on whether the initial search identifies the right files; wrong starting point leads to correct-looking but wrong-target fixes
  • Claude computer use — multi-step desktop automation reveals emergent failures when UI state changes between steps in ways the plan didn't anticipate
  • Intercom Fin — confidence threshold tuning for escalation requires months of production data per domain; new topic areas start with conservative thresholds that over-escalate
  • Harvey (legal AI) — state compression for legal document analysis faces the information-loss problem directly; compressing case details for summary sometimes drops the clause that determines the legal outcome
  • Stripe Radar — fraud detection routing evolved from manual rules toward learned policies, but safety constraints make deployment of learned routing changes conservative and slow
  • Microsoft Security Copilot — investigation workflows reveal the evaluation problem: "was the investigation thorough enough?" has no easy metric, and the cost of false negatives (missed threats) is extreme

Recall checkpoint

  1. What does "governable but not guaranteed correct" mean for orchestration?
  2. Why are emergent failures harder than single-step failures?
  3. Why is multi-step evaluation still an open research problem?
  4. What makes escalation threshold calibration inherently empirical?
  5. How can state compression silently degrade decision quality?
  6. Why does schema evolution become harder at development velocity?
  7. What distinguishes Level 3 (governed) from Level 5 (platform) maturity?

Interview Q&A

Q: Why is orchestration necessary even though it doesn't guarantee correct outcomes? A: Because without orchestration, complex workflows are ungovernable — invisible, unrecoverable, untestable. Orchestration makes behaviour explicit so you can inspect, debug, adapt, and improve. Correctness is a higher bar that also requires domain quality, evaluation, and sometimes human judgment. Common wrong answer to avoid: "Because orchestration makes systems reliable." It improves reliability but doesn't guarantee correctness — these are different properties.

Q: Why are emergent multi-step failures harder to prevent than single-step bugs? A: Each step can be individually correct while the combined sequence produces a wrong outcome — typically because implicit assumptions (data freshness, scope completeness, context adequacy) aren't explicitly checked anywhere. Testing individual steps doesn't reveal interaction problems. Common wrong answer to avoid: "Because there are more steps to go wrong." Count isn't the issue — it's that correct components can compose into incorrect systems.

Q: What honest limitations would you acknowledge about a well-orchestrated agent system? A: Planning quality depends on model capability. State compression can lose critical context. Escalation thresholds are empirical, not provably correct. Multi-step evaluation lacks reliable metrics. Routing optimisation is mostly manual. And the system is as good as its implicit assumptions — which are hard to enumerate exhaustively. Common wrong answer to avoid: "It's just a matter of better models." Better models help planning, but the structural challenges (evaluation, calibration, compression, emergent failure) persist regardless of model quality.

Q: Why might a team still ship a system with these open problems? A: Because "governable with known gaps" is far better than "ungovernable with unknown gaps." Explicit orchestration makes problems visible, debuggable, and fixable through iteration. Waiting for perfection means shipping nothing while ad-hoc systems accumulate invisible failures. Common wrong answer to avoid: "Because deadlines force shipping imperfect systems." Deadlines matter, but the engineering reason is that explicit gaps with mitigation strategies are genuinely preferable to implicit gaps with no visibility.

Q: How should teams prioritise which open problems to address? A: By consequence severity. If wrong decisions cause high financial or safety cost → invest in escalation calibration and human oversight. If replanning frequency is high → invest in plan quality. If state compression causes quality degradation → invest in selective full-context preservation. Domain-specific error costs determine priority. Common wrong answer to avoid: "Fix the most technically interesting problem first." Interest isn't the criterion — business and safety impact is.

Q: What's the relationship between better models and these open problems? A: Better models directly improve planning quality and may improve confidence calibration. They don't solve: emergent interaction failures (architectural problem), multi-step evaluation (measurement problem), schema migration (engineering problem), or tenant isolation (systems problem). Models are one input to orchestration quality, not the whole answer. Common wrong answer to avoid: "Better models solve everything eventually." Some problems are structural, not capability-limited.


Design/debug exercise (10 min)

Modeled: The loan-approval workflow consistently approves applications from a specific employer. Investigation reveals: the compliance agent's training data over-represents that employer as low-risk, the credit check returns legitimate scores, and the workflow executes perfectly. The failure is not in orchestration — it's in the implicit assumption that model outputs are free of systematic bias. A monitoring system that tracks approval rates by employer would catch this; the workflow's test suite (file 12) cannot.

Your turn: Identify one open problem in a workflow system you've built or studied. Write: (1) which level on the maturity spectrum it operates at, (2) which open problem from this file it faces, (3) what symptom would signal the problem in production, (4) what mitigation (not solution) you'd deploy while the problem remains open.

From memory: Close this file and sketch: the seven open problems (one line each), the maturity spectrum (5 levels), and the senior interview answer (what we built + what remains hard + the honest position).


Operational memory

This module built orchestration as a control plane: graphs, state, checkpoints, planning, gates, recovery, replanning, tenancy, testing. The system makes multi-step AI workflows governable — visible, recoverable, testable, auditable. What it cannot guarantee is correctness — that outcomes are right, that plans decompose well, that compression preserves what matters, that escalation thresholds are calibrated, that step interactions don't produce emergent failures.

The honest engineering position: these are real open problems, not indicators that orchestration is useless. A system with known gaps and explicit mitigation strategies is far more trustworthy than a system with invisible gaps and no governance. The maturity path is clear: from ad-hoc (Level 1) through governed (Level 3) to platform (Level 5). Self-improving systems that learn routing and calibrate thresholds from production remain largely aspirational with active research.

Remember: - Orchestration makes behaviour governable, not guaranteed correct — these are different claims - Planning quality is the weakest link: vague goals produce vague plans regardless of execution sophistication - Emergent failures arise from step interactions, not from individual step bugs - Multi-step evaluation lacks good metrics — most teams rely on lagging user feedback - Escalation thresholds are empirical — no formula, only domain-specific calibration - State compression trades performance for potential information loss — the loss is only visible in hindsight - "Governable with known gaps" beats "ungovernable with unknown gaps" — ship with honesty, iterate with data

Bridge. This module covered the control plane above agents: how workflows coordinate multiple steps, survive failures, adapt to change, and serve many tenants fairly. The next module shifts from coordination to memory: how do agents retain useful experience across sessions, conversations, and time? → ../../11_long_term_memory_state/00-first-principles.md