12. Testing orchestration — proving the control plane works before production teaches you through pain¶
~19 min read. Testing a single prompt is straightforward: does the output look right? Testing orchestration is fundamentally harder: did the right plan form, did the right branch execute, did state transition correctly, did recovery resume without duplicate side effects, did tenant isolation hold? This file builds a testing strategy for workflow systems that exercises control logic, not just language quality.
Built on the first-principles overview in 00-first-principles.md. Testability — the pressure that every other property (durability, routing, fairness, recovery) is meaningless if you can't verify it works — culminates here. The control plane's explicit structure (typed state, named edges, checkpoints) is testable by construction — if you write the tests.
What file 11 established and what remains¶
File 11 added multi-tenant isolation: scheduling fairness, budget enforcement, policy routing, data residency. Every file from 01 to 11 introduced mechanisms — decomposition, routing, patterns, state, graphs, planning, human gates, checkpoints, replanning, tenancy. The question across all of them: how do you know these mechanisms actually work under realistic conditions? Not "does the final output look reasonable?" but "did the control plane execute the right logic?"
The workflow that looked correct because nobody tested the branches¶
A loan-approval platform passes UAT: happy-path workflow completes, decision is issued, customer notified. Shipped to production. Three weeks later: a compliance check returns "review" for the first time (UAT data never triggered that branch). The conditional edge routes to human_review. The human_review node crashes — it references a state field (reviewer_pool_id) that was never populated because the compliance agent doesn't write it. The field exists in the TypedDict schema but has no source in the "review" branch.
The bug was invisible for three weeks because no test exercised the conditional branch. The state contract between compliance_check and human_review was never validated. Every happy-path test passed. The first real conditional-branch execution failed.
Test coverage gap:
Tested: Not tested:
├── verify → credit → ├── compliance → flag:"review" → human_review
│ compliance(pass) → │ (conditional branch never triggered)
│ decide (happy path) ├── human_review state contract
├── timeout/escalation path
├── crash during compliance → resume
└── concurrent workflows competing for reviewer
Teacher voice. Orchestration bugs don't live in individual nodes. They live in transitions: the edge between nodes, the conditional routing logic, the state that crosses boundaries, the checkpoint-resume sequence, the timeout escalation path. If you only test nodes in isolation and the final output end-to-end, you miss the entire middle layer where orchestration failures occur.
The invariant: test the control plane independently from model quality¶
A workflow test should be able to say: "Given these node outputs (mocked), did the control plane make the correct routing, state management, checkpoint, and recovery decisions?" This separates orchestration correctness from model quality — both matter, but they require different testing strategies.
Separation of testing concerns:
MODEL QUALITY (tested separately):
├── Does the credit agent produce accurate scores?
├── Does the compliance agent cite correct policies?
└── Does the synthesis agent produce coherent decisions?
ORCHESTRATION CORRECTNESS (tested here):
├── Does the graph route correctly given specific state?
├── Does state propagate correctly across step boundaries?
├── Does checkpoint-resume produce identical behaviour?
├── Does the timeout policy escalate at the right time?
├── Does tenant isolation prevent cross-tenant reads?
└── Does budget enforcement halt execution at the right moment?
The testing pyramid for workflow systems¶
┌─────────────────────┐
│ E2E smoke tests │ (few, expensive, realistic)
├─────────────────────┤
│ Graph integration │ (moderate, mocked models,
│ tests │ real control logic)
├─────────────────────┤
│ Node unit tests │ (many, cheap, deterministic)
└─────────────────────┘
Layer 1: Node unit tests — test individual nodes with fixed inputs and assert correct outputs. Mock external APIs. Verify that each node reads only declared inputs and writes only declared outputs.
Layer 2: Graph integration tests — test the assembled graph with mocked node implementations. Verify routing, state transitions, conditional edges, checkpoint contents, and timeout behaviour. This is where most orchestration bugs live.
Layer 3: E2E smoke tests — test complete workflows with real (or near-real) model calls and tool integrations. Verify that the full stack works in a staging environment. Expensive, flaky, but necessary for catching integration mismatches.
The critical middle layer (graph integration) is what most teams skip. They write node tests (easy) and E2E tests (feel comprehensive), but skip the layer that tests orchestration logic specifically.
Graph integration tests: testing control flow, not language¶
A graph integration test mocks all nodes with deterministic functions, then asserts that the graph's control flow behaves correctly:
# Mock node that returns a predetermined output
def mock_compliance_check(state):
return {"compliance_flag": "review"} # force the conditional branch
# Graph integration test
def test_compliance_review_triggers_human_gate():
graph = build_loan_graph(
verify_node=mock_verify_success,
credit_node=mock_credit_720,
compliance_node=mock_compliance_check, # returns "review"
human_node=mock_human_approve,
decision_node=mock_decision_approve,
)
result = graph.invoke({"applicant_id": "test-123"})
# Assert the conditional edge fired correctly
assert result["compliance_flag"] == "review"
assert "human_override" in result # human node executed
assert result["human_override"] == "approved"
What this tests: the conditional edge logic, the state propagation between compliance and human review, and the graph's routing behaviour. What it does NOT test: whether the compliance model actually produces accurate flags. That's a model test, not an orchestration test.
State contract tests: verifying cross-node data flow¶
The most common orchestration bug: node A writes a field, node B expects to read it, but the field name, type, or presence assumption is wrong. State contract tests verify these assumptions explicitly:
def test_compliance_node_produces_fields_human_review_needs():
"""The human_review node expects compliance_flag and flag_reason in state.
Verify compliance_check actually produces both."""
compliance_output = compliance_check_node(mock_state_after_credit)
# human_review node's declared inputs
assert "compliance_flag" in compliance_output
assert compliance_output["compliance_flag"] in ["pass", "review", "fail"]
# This catches the UAT bug: human_review also needs reviewer_pool_id
# If compliance doesn't produce it, this test fails BEFORE production
required_for_human_review = ["compliance_flag", "flag_reason"]
for field in required_for_human_review:
assert field in compliance_output, f"Missing field: {field}"
State contract tests are cheap, fast, and catch the exact class of bug described in the opening example. They should exist for every edge in the graph.
Checkpoint-resume regression tests¶
The highest-value orchestration test: crash the workflow at a specific point, resume from checkpoint, and verify the result is identical to an uninterrupted run (minus timing).
def test_resume_after_credit_check_produces_same_result():
"""Crash after pull_credit, resume from checkpoint.
Final decision should be identical to uninterrupted run."""
# Run 1: uninterrupted
uninterrupted_result = run_loan_workflow(applicant_id="test-456")
# Run 2: crash after step 2, resume from checkpoint
checkpoint = run_until_step(graph, "pull_credit", {"applicant_id": "test-456"})
resumed_result = resume_from_checkpoint(graph, checkpoint)
# Core assertion: same final decision
assert resumed_result["decision"] == uninterrupted_result["decision"]
# Bonus assertion: verify_identity NOT re-executed (check call count)
assert mock_identity_service.call_count == 1 # not 2
# Bonus assertion: pull_credit NOT re-executed
assert mock_credit_bureau.call_count == 1 # not 2
These tests verify that checkpoints actually prevent re-execution of completed steps and that resume produces equivalent outcomes. Without them, checkpoint bugs are invisible until a production crash reveals them.
Threaded example: comprehensive test suite for loan-approval¶
The loan-approval workflow needs this minimum test matrix:
Node unit tests (fast, many):
├── verify_identity: valid ID → true, invalid → false, timeout → retry
├── pull_credit: success → score, 503 → retry, invalid applicant → error
├── compliance_check: clean → "pass", flagged → "review", invalid → "fail"
├── human_review: approve → "approved", deny → "denied", timeout → escalate
└── issue_decision: approved → write loan DB, denied → write denial
Graph integration tests (medium, core):
├── happy_path: all pass → decision issued
├── compliance_review_branch: flag="review" → human gate fires
├── compliance_fail_branch: flag="fail" → denial without human gate
├── human_timeout: 48h expires → escalation triggers
├── budget_exhaustion: mid-workflow budget exceeded → graceful degradation
├── replan_trigger: employment inconsistency → verification branch added
└── concurrent_branches: parallel credit + sanctions → merge correctly
Checkpoint-resume tests (medium, critical):
├── crash_after_verify: resume skips verify, re-runs credit
├── crash_after_credit: resume skips both, continues at compliance
├── crash_during_human_wait: resume returns to paused state correctly
├── crash_during_decision: idempotency key prevents duplicate DB write
└── schema_v2_resume: old checkpoint loads under new schema with migration
Multi-tenant tests (medium, platform):
├── tenant_isolation: workflow A can't read workflow B's state
├── concurrency_cap: tenant limited to N concurrent workflows
├── budget_enforcement: over-budget workflow halts gracefully
├── routing_policy: EU tenant → EU model endpoint
└── reviewer_queue: enterprise tenant → dedicated pool
Chaos tests (slow, few):
├── random_node_failure: recovery handles any single-node crash
├── checkpoint_write_failure: system detects and retries safely
├── reviewer_timeout_during_burst: escalation handles queue pressure
├── concurrent_resume: two instances resume same workflow → only one wins
└── rate_limit_during_parallel: parallel branches respect per-tenant limits
Chaos testing: revealing interaction failures¶
Orchestration failures are often combinatorial — they emerge from the interaction of two or more conditions that are individually fine:
Single conditions (individually safe):
├── one node retries once → fine
├── checkpoint write takes 50ms → fine
├── human reviewer is slow → fine
Combined conditions (failure emerges):
├── node retries + checkpoint write slow + concurrent resume
│ → two instances both resume, one commits duplicate side effect
├── human reviewer slow + timeout fires + reviewer responds late
│ → both timeout-escalation AND original approval arrive
├── parallel branches + rate limit hit + one branch slower
│ → merge node waits indefinitely for slow branch
Chaos test patterns for orchestration:
| Chaos injection | What it reveals |
|---|---|
| Terminate process mid-node-execution | Checkpoint-resume correctness |
| Delay checkpoint writes by 500ms | Race conditions in concurrent resume |
| Return malformed output from one node | Error handling and state contract validation |
| Fire human timeout while approval is in-flight | Race between escalation and original response |
| Exhaust rate limit mid-parallel-branch | Merge-node timeout handling, partial completion |
| Inject stale checkpoint (old schema) | Schema migration correctness |
| Submit same workflow twice simultaneously | Idempotency at workflow level |
Deterministic replay: using checkpoints for test reproducibility¶
Checkpointed workflows are inherently replayable — you can load any intermediate state and re-execute from that point. This makes orchestration testing more tractable than testing arbitrary distributed systems:
def test_specific_failure_scenario():
"""Reproduce the exact scenario from production incident INC-4421:
compliance returned 'review', human approved, but resume failed
because policy_version changed during the 6-hour wait."""
# Load the exact checkpoint from the incident
checkpoint = load_checkpoint("prod-backup/incident-4421/ckpt-3")
# Simulate the resume with the state that existed at the time
result = resume_from_checkpoint(graph, checkpoint)
# Before fix: this crashed with KeyError on policy_version_v2
# After fix: migration handles the schema change gracefully
assert result["decision"] in ["approved", "denied"]
This pattern turns production incidents into regression tests — load the exact checkpoint that caused the failure, verify the fix handles it correctly.
Operational signals — test health indicators¶
Healthy test suite: - Graph integration tests run in < 30s (mocked nodes are fast) - Checkpoint-resume tests run in < 60s - Chaos tests run in < 5 min (even with injected delays) - Zero orchestration bugs escape to production that a graph test would have caught - New conditional branches always added with corresponding graph tests
First degrading signal: - Test suite time growing beyond 5 min → likely running real models in integration tests (should be mocked) - Flaky tests appearing in graph integration → usually race conditions in async node execution or timing-dependent assertions - Production incidents not reproducible from checkpoints → checkpoint data incomplete or test environment diverges
Misleading metric: - "Code coverage %" — high node coverage doesn't mean edge coverage. You can cover 100% of node code without testing a single conditional branch. - "All tests pass" — if the test matrix doesn't include branch tests, resume tests, and failure injection, green doesn't mean correct.
Boundary of applicability¶
Works unusually well: - Graph-based orchestration (LangGraph, Temporal, Step Functions) where control flow is explicit and deterministic - Workflows with typed state schemas that can be validated programmatically - Systems with checkpointing where replay is a native capability
Becomes pathological: - Purely stochastic workflows where "correct" depends entirely on model judgment (no deterministic control logic to test) - Workflows that change structure every run (dynamically generated graphs) — test fixtures can't keep up - Trivial workflows (single node, no branching) — orchestration testing adds overhead without value
Scale that invalidates naive intuition: - At 50+ conditional branches, maintaining a test per branch becomes a combinatorial challenge — prioritise by risk (highest-consequence branches first) - At frequent graph changes (weekly schema updates), test maintenance cost rivals development cost — invest in auto-generated contract tests from schema definitions
Failure-prone assumption: "E2E tests are sufficient for orchestration"¶
The seductive wrong idea: "If the final output looks correct, the orchestration must be working."
The correction: A correct final output can mask many orchestration bugs: wrong branch taken (but output happened to be similar), state not properly propagated (but the model compensated), checkpoint not actually saved (but the workflow didn't crash this time), duplicate side effect executed (but the system tolerated it). E2E tests check the output. Graph integration tests check the path. You need both.
Real-world implementations¶
- LangGraph test utilities — the LangGraph framework provides state inspection, step-by-step execution, and checkpoint loading for deterministic replay in test environments
- Temporal test framework — workflow-level mocking, activity stubbing, and time manipulation for testing long-running workflow logic without real delays
- AWS Step Functions Local — local execution environment for testing state machine logic, branch conditions, and error handling without deploying to AWS
- Uber Cadence test framework — deterministic workflow replay from recorded history, activity mocking, and clock manipulation for timeout testing
- Netflix Conductor — test mode that records workflow execution and replays from any task for regression testing
- Prefect — task-level testing utilities with mocked dependencies and flow-level integration testing with local execution
- Dagster — asset-level unit testing with mocked I/O, graph-level integration testing, and materialisation replay for debugging
- Restate — journal-based replay enables deterministic re-execution from any point for both testing and debugging
Recall checkpoint¶
- Why do most orchestration bugs live in transitions (edges) rather than in nodes?
- What does the graph integration test layer test that E2E tests miss?
- Why are state contract tests high-value and low-cost?
- What does a checkpoint-resume regression test verify?
- Why are chaos tests necessary in addition to deterministic tests?
- How does checkpoint-based replay enable incident reproduction?
- Why is "code coverage" misleading for workflow testing?
Interview Q&A¶
Q: Why separate orchestration testing from model quality testing? A: They test different failure modes. A model can produce perfect outputs while the orchestration routes them incorrectly, loses state, or creates duplicate side effects. A model can produce mediocre outputs while the orchestration handles routing, recovery, and state flawlessly. Both need testing, but with different tools and assertions. Common wrong answer to avoid: "Because model testing is harder." Difficulty isn't the reason — it's that they're testing different properties of the system.
Q: Why is the graph integration test layer the most commonly skipped? A: Because it requires mocking nodes to isolate control logic — more setup than node unit tests. And because E2E tests feel more comprehensive (they exercise the full stack). But graph tests catch routing bugs, state contract violations, and conditional edge errors that neither node tests nor E2E tests reliably surface. Common wrong answer to avoid: "Because teams are lazy." It's not laziness — it's that the value of the middle layer is unintuitive until you've been burned by an orchestration bug that unit tests and E2E tests both missed.
Q: What makes checkpoint-resume tests uniquely valuable? A: They verify the core promise of durable execution: that a crashed workflow resumes correctly without duplicate side effects. Without these tests, you're trusting that checkpoint logic works without evidence — and checkpoint bugs are invisible until a production crash reveals them. Common wrong answer to avoid: "For compliance auditing." Auditing benefits, but the primary value is verifying that recovery actually works before you need it.
Q: Why can't E2E tests alone verify orchestration correctness? A: E2E tests check the final output. A correct output can be produced via the wrong path (model compensated for routing error), with duplicate side effects (happened to be idempotent this time), or without proper state propagation (model had enough context anyway). Graph tests verify the path, not just the destination. Common wrong answer to avoid: "Because E2E tests are slow." Speed is one issue, but the deeper problem is that they test the wrong property (output) rather than the right property (control flow).
Q: How do chaos tests differ from fuzz testing for orchestration? A: Fuzz testing generates random inputs to individual components. Chaos testing injects realistic infrastructure failures (delays, crashes, partial writes) into the execution environment to reveal interaction bugs between components. The difference is system-level failure injection vs component-level input variation. Common wrong answer to avoid: "They're basically the same thing." They share randomness, but chaos testing targets the infrastructure layer while fuzz testing targets the input layer.
Q: When does orchestration testing become more expensive than the bugs it prevents? A: For trivial workflows (single path, no branching, no state, no checkpoints) where the only failure mode is "model gave bad output" — which orchestration tests can't catch anyway. Also for one-off workflows that won't be reused (the test outlives the code). For any workflow that's deployed to production with branching or state, the testing cost is almost always justified. Common wrong answer to avoid: "When you have good monitoring instead." Monitoring detects bugs in production. Testing prevents them from reaching production. They're complementary, not substitutes.
Design/debug exercise (10 min)¶
Modeled: The loan-approval workflow has a bug: when compliance_flag == "fail", the graph routes to issue_decision with decision = "denied" — but forgets to write denial_reason to state. The issue_decision node expects denial_reason and crashes with KeyError. Write a state contract test: assert that the compliance-fail path produces all fields that issue_decision requires.
Your turn: Write three tests for the loan-approval workflow: (1) a graph integration test that verifies the "review" branch triggers human_review, (2) a checkpoint-resume test that verifies crash-after-credit doesn't re-call the bureau, (3) a chaos test that verifies the workflow handles simultaneous reviewer timeout + reviewer response correctly.
From memory: Close this file and sketch: the testing pyramid (3 layers with examples), the test matrix for a 5-step workflow (categories and cases), and one chaos test scenario with the specific failure it reveals.
Operational memory¶
Orchestration bugs live in transitions, not in nodes. A node can produce correct output while the graph routes it incorrectly, drops state fields at boundaries, fails to checkpoint, or resumes with duplicate side effects. The testing strategy must verify control-plane behaviour independently from model quality — which means mocking nodes to isolate routing logic, validating state contracts at every edge, and verifying that checkpoint-resume produces equivalent outcomes to uninterrupted execution.
The three-layer pyramid (node tests → graph integration tests → E2E smoke tests) is the structure, but the critical insight is that most teams skip the middle layer. Graph integration tests are where orchestration bugs are caught cheapest: they use mocked nodes (fast, deterministic) but exercise real routing, real state propagation, and real conditional logic. Chaos tests add the failure-interaction dimension that deterministic tests can't cover.
Remember: - Test the control plane separately from model quality — mock nodes to isolate routing and state logic - State contract tests verify that every edge carries the fields the downstream node expects - Checkpoint-resume tests verify the core durability promise: crash → resume → same result, no duplicates - Most orchestration bugs live in conditional branches that are never tested (only happy path) - Chaos tests reveal interaction failures: conditions that are individually fine but combined create corruption - Checkpoint replay turns production incidents into regression tests — load the exact failure state - "Code coverage" is misleading — you can cover 100% of nodes without testing a single conditional edge
Bridge. We've built a complete orchestration system: graphs, plans, state, gates, checkpoints, recovery, tenancy, testing. The final file is for honesty. What still breaks? What remains unsolved? What would a thoughtful engineer admit they don't yet know how to guarantee? → 13-honest-admission.md