12. Multi-agent handoff bugs — the seams are the crime scene¶

~12 min read. Two agents talk. The bug lives between them. Not inside either one.

Built on the ELI5 in 00-eli5.md. The case file holds the trace. But when many agents touch one task, the witness notes live across multiple traces. The lineup must now check the seams, not just the rooms.

The picture before the bugs¶

One agent is a room. Two agents is a corridor between two rooms, and most bugs live in the corridor.

   HEALTHY HANDOFF                       BROKEN HANDOFF
   ┌──────────┐                          ┌──────────┐
   │ Agent A  │── envelope ───→          │ Agent A  │── "do this" ──→
   │ (planner)│  state + intent          │          │  (last msg only)
   │          │  + success crit          └──────────┘
   │          │  + return path                │
   └──────────┘                                ▼
        │                                ┌──────────┐
        │       ┌──────────┐             │ Agent B  │
        └──←─── │ Agent B  │             │ who am I?│
       result   │(executor)│             │ planner? │
       + status └──────────┘             │ executor?│
                                         └──────────┘
                                             │  ?
                                          deadlock,
                                          silence

The corridor lost everything except a sentence, so the second agent guesses role, scope, and success criteria all at once. That is where the suspects hide in multi-agent systems. For full multi-agent debugging depth, see module 17, file 13.

The seven handoff bugs — each with trace signature, elimination test, fix¶

1) Lost context¶

Symptom. A had full user history. B got only the last message. Nuance gone. B answers a different question than user asked. Trace. B's input span has messages: [last_user_msg] only. A's had [m1..mN]. B's token count suspiciously low. Test. Replay A→B with full history forced into B's input. If B now answers correctly — confirmed. Fix. Envelope carries the relevant history slice. Older turns summarized into a session_state field.

2) Role confusion¶

Symptom. B oscillates. Sometimes plans, sometimes executes. Same input, different behavior across runs. Trace. B's system prompt is generic. Envelope has no intent field. B's first tokens vary — "Let me think..." vs "Calling tool X...". Test. Force B's prompt to specify role explicitly ("You are executor only"). If behavior stabilizes — confirmed. Fix. Envelope carries intent: plan | execute | review. Receiver branches on this field, not guesswork.

3) Deadlock¶

Symptom. A waits for B. B waits for A. Two agents stuck running. No progress. Trace. Two parallel spans, no end timestamp. No tool calls. No completion tokens. Test. Add hard timeout on both sides. If both hit timeout — deadlock confirmed. Fix. Explicit return_path in envelope. Exactly one agent owns the next step. Timeouts as safety net, not logic.

4) Capitulation cascade¶

Symptom. A is uncertain. B sees A's hedging. B mirrors it. Both agree on a weak answer. Confidence drops every hop. Trace. Each agent's output starts with hedges. Downstream agents quote them back. Final confidence lower than any input. Test. Strip uncertainty markers from A's output before passing to B. If B returns a stronger answer — confirmed. Fix. Separate answer and uncertainty fields. Receiver judges the answer on its merits. Uncertainty informs routing only.

5) Message-bus failure¶

Symptom. Handoff vanishes. A says "sent". B says "never got it". No error. Silent loss. Trace. A's outgoing-message span has status: success. B has no incoming span. Queue depth spiked briefly. Test. Inject 100 handoffs. Count send vs receive. Drop rate > 0 — bus is lossy. Fix. At-least-once delivery with idempotency keys. Each handoff has a handoff_id. Receiver acks. Sender retries on missing ack.

6) Token-budget partition error¶

Symptom. Total budget 100K. Orchestrator gave A 90K, B 10K. B starves — truncates output, drops fields. Trace. B's completion span shows finish_reason: length. Output suspiciously short. Per-agent token histogram is skewed. Test. Re-run with 50/50 split. If B now finishes cleanly — partition was wrong. Fix. Budget per agent matches work per agent. Profile typical usage, allocate proportionally. Envelope carries a budget field.

7) Ambiguous ownership¶

Symptom. Both agents think the OTHER will produce the final answer. User sees nothing. Or both produce — user sees two contradictory answers. Trace. Both spans end with "passing to peer". Neither emits a final_answer span. Or both do, with different content. Test. Inspect envelope's return_path. Missing or ambiguous? Confirmed. Fix. Exactly one agent owns the final answer. Envelope encodes final_responder: A | B. Orchestrator validates only one final_answer span exists.

The handoff envelope pattern¶

Every handoff must carry exactly four things — not more, not less.

┌────────────────────────────────────────────────┐
│           THE HANDOFF ENVELOPE                  │
├────────────────────────────────────────────────┤
│  state           : what is known so far         │
│  intent          : what the receiver must do    │
│  success_criteria: how we know it is done       │
│  return_path     : who gets the result next     │
└────────────────────────────────────────────────┘

Drop any of these and you invite one of the seven bugs above.

No state → lost context.
No intent → role confusion.
No success_criteria → capitulation cascade, ambiguous done.
No return_path → deadlock, ambiguous ownership.

The envelope is the witness note of the corridor. It is the single artifact the lineup inspects to find the first bad seam. For the full field-level recipe, see 10-handoff-design.md in module 17.

Worked example — the travel booking that dropped the dates¶

Picture an orchestrator with three workers — flight, hotel, car.

User: "Book me Mumbai to Bangalore, May 13 to May 16, one adult."

┌──────────────┐
│ Orchestrator │
└──────┬───────┘
   ┌───┼───┬────────┐
   ▼   ▼   ▼        ▼
 flight hotel  car  reviewer

Flight booking worked. Hotel returned: "Booked Grand Park Bangalore for May 1 to May 2." User filed a complaint slip — "wrong dates".

We open the case file. The hotel agent's input span shows the envelope:

{
  "intent": "book hotel",
  "city": "Bangalore",
  "guests": 1
}

No dates field. The hotel agent hallucinated nearby plausible defaults. The orchestrator constructed the envelope per-agent and forgot to forward dates to the hotel worker. The flight worker's envelope had dates, so flight booking looked fine. Each witness note alone looked plausible.

The suspects are walked in order: - Prompt? Hotel prompt was fine. - Tool? Tool accepted dates, none were sent. - Loop? Single shot, no loop bug. - Memory? No state was carried. - Model? Output was internally consistent. - Multi-agent handoff? The envelope dropped dates. Confession.

The fix is the lock — make the envelope a strict schema.

class TravelEnvelope(BaseModel):
    intent: Literal["book_flight", "book_hotel", "book_car"]
    city: str
    dates: tuple[date, date]   # required, not optional
    guests: int

# orchestrator now cannot construct an envelope without dates
# schema validation fails before the worker ever runs

Add a regression eval: "user gives date range → every worker envelope must carry it". The bug cannot return without breaking the schema or the eval.

The first bad artifact was the envelope, not the hotel agent. Walk backward, always — from the visible failure to the first witness note that already carried the poison.

Handoff-bug patterns across multi-agent stacks¶

CrewAI multi-agent patterns — researcher/writer/editor handoffs use a Task.context field; bugs hit when context is omitted, mirroring the lost-context pattern in this file.
CrewAI process types — Process.sequential vs Process.hierarchical change who controls the envelope; hierarchical mode adds a manager agent whose missing return_path is a frequent deadlock source.
AutoGen GroupChat — the GroupChatManager can route a message to the wrong speaker if next_speaker_selection_method is auto; teams debug with full chat history dumps and explicit speaker selection.
Microsoft AutoGen Studio — visual workflow editor exposes handoff edges as first-class objects; ambiguous ownership shows up as two edges pointing to the same "final" node with no merge rule.
LangGraph multi-agent supervisor — the supervisor node uses structured output ({next: worker_name}) to assign work; ambiguous-ownership bugs appear when supervisor returns end while a worker is still mid-task.
LangGraph swarm pattern — agents hand control to each other via Command(goto=...); missing state fields in Command.update reproduce the lost-context bug at framework level.
MetaGPT role handoff — its software-company pipeline (PM, architect, engineer, QA) hit early bugs where the engineer agent started planning instead of coding; fix was tightening role-specific system prompts.
Anthropic Computer Use multi-tool sequencing — when a planner hands a browser-control task to an executor, missing screenshots in the envelope cause the executor to act on stale UI state.
OpenAI Agents SDK handoff API — Agent.handoffs=[...] exposes the corridor as a typed object with input schema; schema validation catches dropped-field bugs before the receiver runs.
Swarms framework (Kye Gomez) — supports concurrent and hierarchical swarms; deadlocks observed when two worker swarms wait on a shared resource without a tie-breaker policy.
AgentOps multi-agent trace view — visualizes inter-agent messages as a sequence diagram; lost-context bugs show up as a single arrow with a tiny payload between two large agent boxes.
BAML multi-agent workflows — typed function signatures between agents make the envelope a compile-time contract; capitulation cascade still leaks through if uncertainty is not a separate field.
MotleyCrew — orchestrates LangChain, LlamaIndex, and CrewAI agents in one DAG; mismatched envelope conventions between frameworks force a translation layer where most seam bugs land.
LlamaIndex AgentWorkflow — uses typed events between agents; ambiguous-ownership bugs surface when two agents both emit a StopEvent for the same task.
Microsoft Semantic Kernel agent groups — AgentGroupChat with selection and termination strategies; weak termination strategies produce role-confusion loops where one agent keeps re-entering.

Recall — handoff bugs and the envelope discipline¶

Why is the corridor between two agents bug-richer than either agent alone?
What four fields must every handoff envelope carry, and which bug shows up when each is missing?
In the travel booking example, what was the first bad artifact — the hotel agent or the envelope? Why does that distinction matter?
How does role confusion differ in trace signature from a capitulation cascade?

Interview Q&A¶

Q: Why does adding more agents often make a system less reliable, not more? A: Because every new agent adds two corridors — incoming and outgoing handoffs. Each corridor is a place where state, intent, success criteria, or return path can be dropped. Reliability decays multiplicatively across handoffs, not additively. Three agents with weak envelopes can be worse than one agent doing everything.

Common wrong answer to avoid: "Because models are unreliable" — model errors are roughly the same per agent. The compounding failure mode is at the seams, not inside the agents.

Q: A multi-agent trace shows two agents in running state with no progress for 90 seconds. What is your first diagnostic move? A: Inspect the envelope each agent received and the return_path they emit. If both believe the other owes the next move, it is a deadlock or ambiguous-ownership bug. Add a forced timeout and explicit final_responder to disambiguate. Then re-run.

Common wrong answer to avoid: "Increase the model timeout" — that just delays the symptom. The bug is missing return_path semantics, not a slow model.

Q: Your reviewer agent agrees with everything the writer says, even when the writer hedges. Quality is dropping. Why? A: Capitulation cascade. The reviewer reads the writer's uncertainty markers as signal and mirrors them. Separate answer and uncertainty into distinct envelope fields so the reviewer evaluates the answer on its merits, with uncertainty informing routing, not the verdict.

Common wrong answer to avoid: "The reviewer prompt is too weak" — making the prompt sterner without separating fields just produces louder agreement.

Q: When debugging a multi-agent failure, should you fix the last agent that produced the visible bad output? A: Almost never. Walk backward through the lineup to the first bad witness note. The visible failure is usually faithful forwarding of an earlier broken envelope. Fix the upstream seam — that removes the poison earlier and improves every downstream path.

Common wrong answer to avoid: "Yes, that is where the bug is visible" — visibility is symptom, not cause. The first bad artifact is what to repair.

Apply now (5 min)¶

Step 1 — model the exercise. Walk the travel-booking case from this chapter. The orchestrator built three envelopes — flight, hotel, car. The flight envelope carried dates. The hotel envelope did not. The hotel worker received {intent, city, guests} and invented May 1–2 as plausible defaults. The first bad witness note is the hotel envelope, four spans upstream of the visible failure. The lineup walks prompt, tool, loop, memory, model, multi-agent — and only the last suspect confesses, because every earlier layer was internally consistent on bad input. The lock is a strict TravelEnvelope schema with dates required, plus a regression eval that asserts every worker envelope carries the date range. The bug cannot return without breaking either.

Step 2 — your turn. Take one multi-agent workflow you have built or seen. Sketch the agents as boxes and the handoffs as labeled arrows. For each arrow, write the envelope fields it carries. Mark any missing among state, intent, success_criteria, return_path. For each missing field, predict which of the seven bugs is most likely.

Step 3 — reproduce from memory. Without scrolling, draw the handoff envelope with its four fields and write one sentence per field on what bug fires when it is dropped. Then redraw the travel-booking pipeline showing the broken envelope versus the fixed schema. If you can do this cold, the corridor model is yours.

What you should remember¶

This chapter moved the lineup from a single room to a corridor. When one agent does the work, the suspects are the five layers inside that agent — prompt, tool, loop, memory, model. When two or more agents share a task, a sixth suspect joins the row: the seam between them. Everything that crosses the corridor is the handoff envelope, and every bug in this chapter is the same bug worn seven ways — the envelope dropped one of its four load-bearing fields.

The seven patterns each have a trace signature that points at the seam, not at either agent. Lost context shows a suspiciously short input span on the receiver. Role confusion shows a generic system prompt and an oscillating first token. Deadlock shows two parallel spans with no end timestamps. Capitulation cascade shows hedges quoted forward and confidence dropping per hop. Message-bus failure shows a success send span with no matching receive. Token-budget partition shows a finish_reason: length on the starved agent. Ambiguous ownership shows either two final_answer spans or zero.

The fix is structural, not behavioural. A strict envelope schema with state, intent, success_criteria, and return_path makes most seam bugs unrepresentable. The lock is the schema plus a regression eval that asserts every envelope carries its required fields. The travel-booking case showed why: the hotel agent did not invent dates because it was a bad model; it invented dates because the orchestrator handed it an envelope with a missing field, and the model did what models do when fields go missing.

Carry the diagnostic move forward — when a multi-agent system fails, do not fix the agent that produced the visible bad output. Walk backward through the corridor to the first witness note with a missing field, and lock that field into the envelope schema.

Remember:

The suspects in a multi-agent system include the corridor itself. The seam is a suspect, not just a transport.
Every complaint slip about a multi-agent failure resolves through the same four-field envelope test — state, intent, success_criteria, return_path.
The first bad witness note is upstream of the visible failure. Faithful forwarding of a broken envelope looks like an agent bug and is not.
A lineup that ends at multi-agent must check the envelope before blaming the receiving agent's prompt.
The lock for seam bugs is a typed envelope schema plus a regression eval, not a stricter receiver prompt.
Reliability decays multiplicatively across handoffs. Three agents with weak envelopes can be worse than one agent doing everything.

Bridge. The lineup can solve single-trace cases. Even multi-agent handoffs leave fingerprints in the trace. But some bugs leave no fingerprint in any single trace — they only show up in aggregate. Yesterday's runs looked fine. Today's runs also look fine. But the distribution shifted. That is the cold case. → 13-drift-detection.md