04. When tools chain and race — Parallel independence and serial handoff¶
~16 min read. Tool composition has exactly two patterns. Parallel: independent tools fire together, wall-clock collapses to the slowest branch. Chaining: one tool's output feeds the next tool's input, latency accumulates honestly because each step discovers what the next step needs. Every production agent mixes both — fan out the reads, gather, chain the writes. The discipline is knowing which pattern applies at each boundary, because parallel failures are harder to diagnose and chained failures cascade.
The 1,110 ms that should have been 540¶
A commute-assistant agent receives: "Can I make the 9 am meeting on time from home?"
It needs three facts — weather (rain affects the route), calendar (confirm the 9 am slot), and live traffic. Each tool has independent latency:
The first engineer ships sequential calls: 350 + 220 + 500 + 40 = 1,110 ms. Nothing forces this order — none of the calls depend on each other's output — but the loop ran them serially because that is what loops do by default.
A second engineer notices the independence and fires all three concurrently. The slowest branch (traffic, 500 ms) dominates. Total: 500 + 40 = 540 ms. Same cost in tokens, half the wall-clock. The model didn't get smarter; the schedule did.
Now consider a different task: "Refund Priya's delayed order and email her the confirmation." The agent must find Priya's customer ID before it can look up orders, must confirm eligibility before issuing the refund, must have a refund ID before sending the email. Five tools, four handoffs — each step discovers what the next step needs. No parallelism is possible because the dependency is real.
These are the two modes of tool composition. One saves latency by removing artificial waits. The other accepts latency as the cost of discovery. Every agent combines both — and the tension is the same: more composition means more power, but parallel failures are harder to diagnose and chained failures cascade.
What file 03 left open¶
A single tool contract is clear: typed inputs, typed outputs, a schema that declares what the tool does. But real tasks require multiple tools working together. File 03 gave us the individual contract. This file gives us the two ways contracts compose — side by side (parallel) and end to end (chaining) — and the failure modes unique to each.
The independence test — when parallel is safe¶
Three properties must hold simultaneously for a set of tool calls to fan out safely:
Same starting context. Each branch operates on the state that existed at fan-out time. If branch A's call modifies a shared cache that branch B reads, the order of completion matters and "parallel" stops being well-defined. Diagnostic: if these calls ran in either order, would the results be the same?
No shared writes. Two calls writing the same resource — same row, same file, same external state — create a lost-update problem. The later completion wins silently. Diagnostic: does any branch mutate state that any other branch could observe?
Outputs joined later. The agent's next move must wait for all branches and synthesise from the union. If only one branch's result matters, the others are wasted tokens. Diagnostic: would the next think step actually use every branch's result?
When all three hold, fan-out is safe. When any one fails, serialise — or parallelise the safe subset and chain the rest.
parallel-safe set parallel-unsafe set
├── weather lookup (read) ├── read order
├── calendar lookup (read) ├── update order (writes same row)
├── traffic lookup (read) ├── recompute total (depends on update)
└── combine into advice └── notify customer (depends on recompute)
Hidden coupling — four shapes that fake independence¶
"Independent" lies often enough to be dangerous. Four patterns pass the naive independence check but fail in production:
Shared rate-limit bucket. Branches A and B both call the same vendor API. Each passes the agent's per-call check, but they share the vendor's per-account quota. Concurrent burst → 429 → retry storm → the agent DDoS's itself.
Shared cache lock. Both branches write to the same in-memory cache. One acquires the lock, the other blocks. "Parallel" is actually serial in disguise, plus context-switching overhead.
Same downstream write target. Two calls that look like independent reads fan out to the same database write at the backend (a "lookup or create" tool). Both find the row missing, both insert, one hits a unique-constraint violation.
Order-sensitive side effects. Both branches append to a shared list — audit log, notification queue, span tree. Parallel completion makes the order non-deterministic; the log becomes unreliable even though every individual entry is correct.
The unifying diagnostic: what does each branch touch, and what else touches that? If the answer involves shared state at any layer the agent does not control, serialise.
The gather step — what turns racing into parallelism¶
An agent that fires calls and acts on the first one back has not implemented parallelism — it has implemented racing. The gather step is what makes fan-out reliable.
Partial-failure tolerance. One branch might fail — the traffic API times out, the calendar returns empty instead of an error. Acting on the fast branches without waiting for the slow ones means deciding on an incomplete snapshot. The right design: wait for all branches (within a per-branch timeout), then handle missing data explicitly.
Cross-branch consistency. Even when every branch succeeds individually, the union might be inconsistent — weather says "clear" but calendar says "blocked." The agent must reason about all results together, not branch-by-branch. Gathering before the next think step makes cross-branch reasoning possible.
Join policies — what happens when a branch fails¶
The gather step needs an explicit policy:
| Policy | Behaviour | When to use |
|---|---|---|
| All-or-nothing | One failure aborts the batch | Partial data is dangerous (booking without calendar check) |
| Best-effort | Use whatever succeeds; mark gaps | Partial data is useful (summary with one source missing) |
| Required-plus-optional | Mandatory branches must succeed; optional ones enrich | One fact is load-bearing, the rest are nice-to-have |
| Cancel-on-first-failure | One failure cancels remaining in-flight branches | Answer is meaningless without all branches anyway |
The policy is an architectural decision, not a runtime accident. Naming it at design time forces the team to decide what "partial success" means in their domain — before discovering the answer empirically when a branch first fails in production.
Parallel anti-patterns¶
Parallelising writes to the same state. Two update_ticket(ticket_id=t_1) calls fan out. Both touch the same status field. The later one wins, the earlier one is silently lost. Fix: never parallelise writes to the same state.
Parallelising the action and its notification. issue_refund(...) and send_customer_email(...) fan out concurrently. The email lands first because it is faster. Then the refund fails. The customer has a confirmation for a refund that didn't happen. Fix: action before notification. Chain, don't fan out.
Parallelising over a shared connection. The database client uses a single persistent connection. Two queries fan out; they queue at the connection layer; nothing is actually parallel, but the agent paid coordination overhead. Fix: use a connection pool, or serialise honestly.
The lesson: parallelism is a scheduling improvement, not a correctness pattern. It saves wall-clock when independence is real. It costs correctness when independence is fake.
The latency arithmetic of fan-out¶
Parallelism's saving is dominated by the slowest branch. When branch latencies are similar, the win is large. When one branch dwarfs the others, the right move is to make the slowest branch faster (caching, smaller query, fallback) — not to parallelise around it.
The breakeven: parallelism is worth the implementation cost (gather logic, join policy, per-branch error handling) when the saving is large enough that the user notices. Roughly: under 100 ms saving is invisible; 100–300 ms feels "snappier"; over 300 ms feels "much faster." For a three-branch fan-out with similar latencies, the saving is roughly 50% of the sum — so when individual branches exceed 200 ms, parallelism usually earns its wiring.
Past five to six branches per turn, three pressures appear: provider concurrency caps (most providers cap at five to ten per request) silently serialise extras, every branch costs tokens and API dollars without proportional value, and the gathered context grows fast — eight branches at 500 tokens each is 4,000 tokens of observation on top of whatever the model already held. The practical sweet spot is two to five branches. Anything wider should usually become a single richer batch tool that combines what the multiple branches were fetching.
Crossing the boundary — from fan-out to chain¶
Parallelism handles the case where tools are independent. But most production workflows have steps that genuinely depend on prior results. A customer email becomes a customer ID, which becomes an order list, which becomes an eligibility check, which becomes a refund. Each link produces the handle the next link needs. The latency is the cost of discovery — unavoidable because the information doesn't exist until the previous step runs.
This is tool chaining — the schedule for everything parallelism can't handle. The shape is not slower by accident; it is slower because information is being uncovered step by step.
The danger in chaining is rarely the calls themselves — each individual tool usually works fine in isolation. The danger lives in the spaces between calls: the transfer points where one tool's output becomes another tool's input. A five-link chain has four such boundaries; each is a chance for a small drift to cascade into a confidently wrong answer at the end.
Typed handoffs — why boundaries need validation¶
The naive approach passes tool A's raw output directly into tool B. This works until it doesn't, and when it doesn't, the failure is silent.
tool A result
│
├── normalise (reshape into tool B's expected format)
├── validate (assert fields tool B needs are present and typed)
└── feed tool B
Three recurring drift modes at transfer points:
Field-name drift. Tool A returns customerId (camelCase); tool B expects customer_id (snake_case). The framework silently coerces or ignores the unknown field; the next call goes in with null.
Format drift. Tool A returns "2026-05-22T14:02:00Z"; tool B expects a Unix epoch integer. The string is interpreted as zero by a lenient parser; the query returns empty; the agent concludes the customer has no orders.
Envelope drift. Tool A wraps results in {"data": [...], "meta": {...}}; tool B expects the bare list. The agent passes the wrapped envelope; tool B treats the object as a single-element list; the chain runs on the metadata blob.
Each is invisible until something downstream behaves unexpectedly. The fix: every transfer point gets an explicit normalise + validate step. The cost is tiny (a few lines per boundary); the benefit is that the chain fails fast and loud on a drift, instead of producing a confidently wrong answer three links later.
The refund chain — five links, four handoffs¶
Link 1: find_customer_by_email("priya@example.com")
→ customer_id = 882
Link 2: list_orders(customer_id=882)
→ latest_order = 4481, days_late = 9
Link 3: get_refund_policy(order_id=4481)
→ eligible = true, reason = "delay"
Link 4: issue_refund(order_id=4481, reason="delay")
→ refund_id = rf_77, amount = ₹1,250
Link 5: send_customer_email(refund_id="rf_77")
→ status = sent
Total wall clock: 120 + 180 + 90 + 250 + 110 = 750 ms. The agent cannot reduce this by parallelising — every link depends on the previous link's output. The 750 ms is the honest cost of discovery.
Transfer-point validation at each boundary:
Link 1 → 2: customer_id must be int, > 0
Link 2 → 3: pick latest order by created_at desc; status in {delivered, shipped}
Link 3 → 4: reason from policy must match issue_refund's enum; eligible == true
Link 4 → 5: refund_id format ^rf_\d+$; refund record exists (read-after-write)
These four checks cost almost nothing at runtime and catch every drift mode that would otherwise cascade.
Chain corruption — how small drifts amplify¶
bad link 1 output → wrong customer ID
wrong customer ID → empty order list
empty order list → "no orders found" fallback
"no orders found" → refund denied
refund denied (wrong) → customer told "no order to refund"
(three perfectly refundable orders exist)
The agent at link 5 has no way to know link 1 was the root cause. From its perspective, "no orders found" was an observation, not a hypothesis. Every link inherited the corruption silently. The agent's confidence grows as the chain progresses — each successful call feels like progress — even though the underlying state was wrong from step one.
Four risks that compound specifically in chains:
- Stale output. A customer ID looked up an hour ago; the customer was since merged into a different account.
- Wrong field mapping. Tool A returns
id; the agent passes it to a field that expectsaccount_id. Both are integers; the SDK doesn't complain. - Partial failure in the middle. Link 3 returns success but with missing fields. The agent passes a partially-populated handoff to the write step.
- Side-effecting write before final validation. The refund commits, the email sends, then a consistency check fails — but both are already out.
The safeguard is always at the transfer points: validate eagerly, fail loud, never let a partial result become input to a state-changing step.
Scratchpad state — auditing the chain¶
Chains longer than three links benefit from explicit intermediate state — a typed object holding the facts each link produced:
after link 1: { customer_id: 882, email: "priya@example.com" }
after link 2: { ..., latest_order_id: 4481, days_late: 9 }
after link 3: { ..., refund_eligible: true, reason: "delay" }
after link 4: { ..., refund_id: "rf_77", amount: 1250 }
after link 5: { ..., email_status: "sent", sent_at: "2026-05-22T14:04:12Z" }
Three benefits: Auditability — two months later, reconstruct which fact came from which link. Retry safety — when link 4 fails, the retry continues from the established state, not from link 1. Cross-turn continuity — if the chain spans an approval gate or overnight pause, the scratchpad is what the next turn picks up from.
Checkpoints and idempotency — making chains replay-safe¶
When a chain has an irreversible step (a write, a charge, a message) followed by more work, two reliability rules earn their keep:
Checkpoint after the last read, before the first irreversible write.
find customer → find order → check policy → CHECKPOINT → issue refund → send email
(durable (idempotency
scratchpad) key required)
If the process crashes between the refund and the email, the recovery path is "send the email" — the checkpoint ensures we don't re-issue the refund. The placement is load-bearing: steps before the checkpoint are read-only and can be safely re-run on resume. Steps after are state-changing and need deduplication to replay safely. Place the checkpoint elsewhere and you either over-checkpoint (every step writes to durable storage, expensive) or under-checkpoint (a crash after the refund loses the fact, and the resume re-issues it).
Idempotency keys on every state-changing link. The refund step carries idempotency_key = f"refund_{order_id}_{reason}". On replay, the backend recognises the duplicate and returns the original result instead of issuing a second refund. Without idempotency keys, replay is a coin flip between lost work and doubled work.
The pattern extends: emails carry message IDs, ticket updates carry version IDs (optimistic concurrency), payments carry transaction IDs. Side effects need replay-safe identifiers. The discipline that ties chaining to reliability is the same discipline as the rest of any production system's safety layers — every irreversible operation needs a handle that makes "did this already run?" answerable.
The standard composition — read-fanout-write-chain¶
Real agents rarely run pure chains or pure fan-outs. They mix. The workhorse schedule:
┌── get_customer (read) ──┐
│ │
fan-out reads ────►┤── get_order (read) ─────┼── gather ──→ decide ──→
│ │ │
└── get_policy (read) ────┘ │
▼
chain writes:
checkpoint → issue_refund → email
The reads fan out because they are mutually independent and read-only. The gather assembles the union. The decision step reads the gathered state and routes to either issue refund (with checkpoint + idempotency) or escalate to human. The writes chain because each depends on the previous and because side effects need ordering and durability.
The travel-advisory agent from the opening follows the same shape: four reads fan out (weather, traffic, calendar, hotel), gather, decide, then one controlled write (book the hotel). The booking must not fan out with the reads — it would commit on a stale snapshot. The reads fan out because they are read-only; the write chains because it depends on the gathered state.
Two other compositions appear in production:
Replan-on-failure chaining. When link 3 of a chain fails in a way suggesting the plan was wrong, the agent doesn't just retry — it returns to think and re-plans. The chain becomes a DAG with a feedback edge back to the planning step.
Human-gated chains. Certain links (typically irreversible writes) pause for an approval gate before firing. The chain resumes after the gate clears. The scratchpad persists across the pause; the idempotency key ensures the write is safe even if approval comes hours later.
When the pattern is wrong — anti-patterns in both modes¶
Chained reads that could have fanned out. Two reads that don't depend on each other, run sequentially anyway. Diagnostic: did link N use any output of link N-1? If not, fan them out.
Chains that should have been one richer tool. A five-link chain where every link is in the same backend — CRM lookup, record fetch, related-record fetch, status check, update. The chain ferries data through the agent for no reason; a single backend procedure could produce the final state. Diagnostic: does the agent's behaviour change based on any intermediate's value? If not, fold the chain into one tool.
Fan-outs that should have been one batch tool. Eight per-record lookups fan out when a single batch-lookup tool would return all records in one call. The fan-out costs eight times the token overhead. Diagnostic: is the fan-out wider than five branches with similar schemas? If yes, batch it.
Chains without handoff validation. The dependencies are real but the boundaries have no normalise + validate step. Works on the happy path, silently corrupts on edge cases.
Where this lives in production¶
The pattern across production systems: read fan-out is normal, write fan-out is forbidden, gather always precedes decision, chains carry idempotency keys.
| System | Parallel pattern | Chain pattern |
|---|---|---|
| GitHub Copilot agent | Multi-file search + diagnostics fan out | search → read → edit → test → commit |
| Cursor agent mode | Semantic search + lint + file reads fan out | search → read → propose edit → run test → apply |
| Claude Code | Grep + Read fan out per turn | discover → read → reason → edit → verify |
| Cognition Devin | File inspection + test status + log fetch fan out | plan → execute → verify → iterate |
| Intercom Fin / Zendesk AI | Profile + order + policy lookups fan out | lookup → check → refund or escalate → notify |
| Salesforce Agentforce | Multi-record reads (account, contact, opp) fan out | retrieve → check → update → notify |
| Harvey (legal) | Multi-corpus retrieval (case law, statutes, memos) fans out | lookup → precedent search → draft → citation check |
| LangGraph orchestrators | Explicit fan-out/fan-in nodes | Dependent nodes wired as chain |
| Temporal / Step Functions | Workflow-level fan-out for batch ops | Chained activities with checkpoints |
| Inngest / Mastra | Parallel step.run() calls with gather |
Sequential step.run() with built-in idempotency |
The diagram doesn't change across domains. The tools change. The discipline at boundaries is the same.
Recognition — spotting the pattern in unfamiliar code¶
You're reviewing an agent and see:
results = await asyncio.gather(
get_weather(city),
get_traffic(route),
get_calendar(user_id)
)
# immediately call book_hotel(...)
The red flag: the write follows the gather with no validation of the gathered state. What if calendar returned "blocked"? The agent would book a hotel for a conflicted day. The fix: insert a decision step between gather and write that inspects the union and aborts if any mandatory branch contradicts the action.
You see another agent:
customer = await find_customer(email)
orders = await list_orders(customer["customerId"]) # camelCase from tool A
The downstream tool expects customer_id (snake_case). If the framework doesn't auto-map, orders comes back empty and the agent concludes "no orders found." The fix: normalise at the boundary — one line that maps the field name before the next call.
A third pattern to watch for:
# Three reads that look chained but aren't dependent
weather = await get_weather(city)
traffic = await get_traffic(route) # doesn't use weather
calendar = await get_calendar(user_id) # doesn't use traffic
These should be fanned out. The diagnostic: trace which outputs feed which inputs. If there's no data dependency between adjacent calls, the chain is artificial and can be parallelised for free latency.
The wrong model — what people get wrong¶
"Parallel is always faster." Not when one branch dominates latency. Not when branches share a rate-limit bucket. Not when the fan-out is so wide that context bloats the next turn. Parallelism's saving is sum - max; when one branch dwarfs the rest, sum - max approaches zero.
"Chains are just slow code." The latency is the cost of discovery, not laziness. If the agent cannot know step N's input without step N-1's output, the chain is unavoidable. Optimisations go into making individual links faster (caching, batched backends, faster networks), not pretending the chain is parallelisable.
"If individual tools work, the composition works." The composition fails at boundaries, not inside tools. Field-name drift, format drift, envelope drift — all invisible when testing tools in isolation, all catastrophic when tools are wired together in production.
"More tools means more capability." More tools means more boundaries. Each boundary is a chance for drift. Composition's power scales with the number of clean handoffs, not the raw number of tools. Ten tools with validated boundaries beat fifty tools with implicit assumptions.
Interview Q&A¶
Q1. Why is parallelism an independence question, not a speed trick? A. The prerequisite for safe parallel execution is the absence of shared state across branches. Independence guarantees that completion order doesn't change results, making the schedule swappable. When independence is real, parallelism is free latency; when faked, it trades correctness for speed. The framing "is this independent?" is the design question; "can we parallelise?" is downstream. Avoid: "Because GPUs like parallel work." Hardware parallelism is unrelated; the issue is workflow structure.
Q2. What goes wrong if the travel agent fans out the booking alongside reads? A. Two failures. Decision-time corruption — the book fires before reads gather, committing on a stale snapshot (calendar might conflict). State-time corruption — the booking changes the world; if reads then reveal a conflict, the only recovery is a compensating action (cancel + refund). The right design is read-fanout-write-chain. Avoid: "It might be slow but would work." It wouldn't — booking on incomplete information is wrong by definition.
Q3. Your chain "sometimes" produces wrong answers but every individual tool works. Where do you look? A. The transfer points. Chain corruption almost always lives at boundaries — field-name drift, format drift, envelope drift, stale cached outputs. Audit each handoff's normalise + validate step. The tool that "works" in isolation is producing output that doesn't match the next tool's expectations under some edge case. Avoid: "Add more retries." Retries don't fix format mismatches; they amplify them.
Q4. How do you choose between all-or-nothing and best-effort join policy? A. By the cost of acting on incomplete data. If partial data is dangerous (booking without calendar check), use all-or-nothing. If partial data is useful (weather summary with one source missing), use best-effort and mark the gap explicitly. Most agents use both: best-effort on read fan-outs feeding advisory replies, all-or-nothing before writes. Avoid: "Use whichever returns faster." Speed is not the design axis; stakes are.
Q5. Where should checkpoints sit in a chain, and why? A. After expensive read work, before the first irreversible write. Steps before the checkpoint are read-only and safe to replay. Steps after need idempotency keys for safe replay. The checkpoint is the boundary between "safe to re-run" and "needs deduplication to re-run." Place it wrong and you either over-checkpoint (expensive) or under-checkpoint (duplicate side effects on crash). Avoid: "Checkpoint everywhere." Over-checkpointing has real overhead; discipline is per-boundary.
Q6. Your team has a five-link chain where every link is in the same backend. What's the smell? A. Agent-as-router. The agent isn't reasoning over intermediates; it's just plumbing data between links of the same system. The alternative: one richer backend procedure that takes the original input and produces the final state directly. Five round trips collapse to one. The exception: if the agent branches on an intermediate (escalates instead of proceeding), the chain is justified. Avoid: "Always prefer many small tools." That's right for routing flexibility, wrong when the backend can produce the final state directly.
Apply-now exercise (15 min)¶
Part 1 — Parallel audit. Pick a workflow in your system where you fan out (or could). Fill this table for every branch:
| Branch | Reads or writes? | Shared state? | Failure mode if absent | Join requirement |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Mark any row with shared state — those branches must serialise. Compute the latency saving: sum-of-latencies minus max-of-latencies. Is the saving above 100 ms? If not, the fan-out may not be worth the gather complexity.
Part 2 — Chain audit. Pick a chain from your own work (order fulfilment, ticket triage, code change). Fill this table for every transfer point:
| From → To | Field passed | Normalise needed? | Validate needed? | Failure if absent |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Mark any boundary where you can't name a validation — those are your silent-corruption points. Write the normalise + validate step for each.
Part 3 — Composition sketch. Draw the read-fanout-write-chain for one real workflow. Label which branches fan out, where the gather sits, where the checkpoint goes, and which links chain with idempotency keys. Verify that no write appears before its prerequisite gather completes.
Operational memory¶
Tool composition has two patterns: parallel (independent tools fire together for free latency) and chaining (one tool's output feeds another's input, latency accumulates as the cost of discovery). The tension is power vs blast radius — composition makes agents more capable, but parallel failures are harder to diagnose and chained failures cascade through every downstream link.
The unifying discipline is at the boundaries: for parallel, the independence test before fan-out and the gather step before any decision; for chaining, the normalise + validate step at every transfer point. The standard composition — read-fanout-write-chain — combines both: fan out reads where independent, gather, decide, then chain writes where dependent with checkpoints and idempotency keys.
Carry these diagnostics forward: - Before fanning out: "What does each branch touch, and what else touches that?" If shared state exists at any layer, serialise. - Before passing a handoff: "Does the next tool's input match the previous tool's output in field name, format, and envelope?" If you can't prove it, validate. - Before committing a write in a chain: "Is there a checkpoint before this point and an idempotency key on this call?" If not, replay is unsafe.
Remember:
- Three independence properties for parallel safety: same starting context, no shared writes, outputs joined later. Any one false → serialise.
- The gather step is non-optional — it turns racing into parallelism. Pick a join policy at design time.
- Past five branches, fan-out usually wants to become one richer batch tool.
- Chain boundaries — not tools — are where corruption lives. Every transfer point gets normalise + validate.
- Maintain a scratchpad through chains for auditability, retry safety, and cross-turn continuity.
- Checkpoint between the last read and the first irreversible write. Idempotency keys on every state-changing link.
- Read-fanout-write-chain is the workhorse schedule of any agent that consults multiple sources before acting.
Bridge. Tools compose beautifully when they live inside one codebase — you control the schemas, the naming conventions, the envelope formats. But real agents reach across organizational boundaries — different teams, different repos, different languages. The field-name drift that breaks a chain within one service becomes a certainty when spanning many. Next: the protocol that standardizes tool communication across that divide. → 05-mcp-protocol.md