Skip to content

07. Handoff Design — structuring what passes between agents

~10 min read. The handoff is where most multi-agent systems actually break. Design it first.

Built on the ELI5 in 00-eli5.md. The handoff — what one department passes to the next — deserves as much design attention as the prompt itself.


1) Why handoffs break systems

Picture first.

CLEAN HANDOFF
Research
  -> task, inputs, constraints, risks
Writer
  -> knows scope, writes within limits

MESSY HANDOFF
Research
  -> essay, mixed links, hidden assumptions
Writer
  -> guesses facts, scope, and next step
Look. Most multi-agent failures happen at boundaries, not inside one agent. The sender may do decent work, and the receiver may also be capable. But the handoff between them is weak.

The handoff is the contract between agents. If the sender sends a vague essay, the receiver guesses. If the receiver expects a checklist but gets paragraphs, parsing fails. If constraints are missing, the next step confidently does the wrong thing. Simple, no?

Teams often obsess over prompts and ignore boundary design. That is backwards. A sharp worker with a sloppy handoff still creates drift. A modest worker with a clear handoff often performs well. See.

Think in the ELI5 picture. One worker, in ELI5 terms, is the department handing work to another the department. If the note is messy, the second desk starts with confusion. If the memo format is stable, the second desk starts with clarity.

So what to do? Design the payload before tuning the prompt. Name the fields, the checks, and the stop condition. Then let agents speak.

Quick boundary test: - Can the receiver act without rereading the full chat? - Are constraints explicit? - Is the done condition checkable? - Are open risks named?


2) The handoff template

Picture the shape before the explanation.

task: Draft the executive summary
goal: Convert validated claims into concise prose
inputs:
  - validated_claims.json
constraints:
  - Use only approved claims
  - Maximum 180 words
done_definition:
  - Exactly 3 paragraphs
  - Every claim traceable to a citation
open_risks:
  - Claim 4 confidence is medium
budget:
  tokens: 3000
  latency_ms: 5000
This handoff is boring. That is why it works. In the ELI5 picture, this reusable template becomes the memo format. It should feel dull, predictable, and easy to scan. The next agent should not do archaeology. It should just execute.

Now walk field by field. task states the immediate job, not the whole project dream. goal explains why this step exists and guides local choices. inputs lists the allowed artifacts, so the worker does not go hunting. constraints says what must never be violated.

done_definition is the stop signal. Without it, agents ramble or stop too early. open_risks tells the next worker where to be careful. That is honest coordination. budget makes cost and latency explicit. Now the receiver knows how much thinking room exists.

Each field earns its place because it reduces ambiguity. Remove a field only if nobody uses it. Add a field only if it repeatedly prevents failure. That is how the memo format stays lean.

The handoff should be small, explicit, and testable. Not poetic. Not chatty. Not heroic.

A good template also creates auditability. When something fails, you can inspect the exact field that was missing, wrong, or ignored. That makes debugging far easier than reading long conversational traces.


3) The four-field worker output pattern

So what should every worker return? Keep it structural. Ask for four fields every time.

  • answer — the actual output
  • evidence — supporting data
  • uncertainty — confidence and known gaps
  • recommended_next_action — what should happen next

Picture why this matters. The orchestrator should compare shapes, not interpret essays. Free-form prose sounds smart and breaks automation. Structured returns look plain and scale better. Simple, no?

answer is what downstream actually needs. Maybe it is a summary, a patch, a label, or a query. evidence shows why the answer deserves trust. This can be source IDs, extracted fields, or tool results.

uncertainty prevents fake certainty and surfaces weak spots early. recommended_next_action helps the orchestrator move without rereading everything. Now the coordinator can do structural checks instead of literary interpretation.

Did two workers disagree? Compare evidence. Did one worker admit gaps? Check uncertainty. Should the system escalate, retry, or proceed? Read the next action field. No mind-reading needed.

This pattern standardizes the handoff on the output side. Every worker may differ inside, but each still exits through the same small door. That makes routing, ranking, and retries much easier.

Look. The worker can think freely. The boundary still stays disciplined.


4) Worked example — bad handoff vs good handoff

Task: Research agent passes findings to Writer agent. Picture the bad version first.

Bad handoff: "I found several articles about UPI. Growth is happening. Market is big. See the links."

Now what happens? The Writer guesses which claims are solid. The Writer guesses which links matter. The Writer guesses what is missing. The Writer may write smooth nonsense. Quality drops even if the research step was decent. That is the danger.

Now see the good handoff.

{
  "claims": [
    {"text": "UPI processed 12B txns in Dec 2023", "source": "NPCI official", "confidence": 0.95},
    {"text": "YoY growth is 42%", "source": "RBI annual report", "confidence": 0.88}
  ],
  "coverage_gaps": ["competitor analysis missing"],
  "recommended_action": "proceed to draft, flag gap in risk section"
}
Look at the difference. The writer now has claim objects, named sources, and confidence values. The missing area is called out directly. The next action is already suggested. This is boring in a good way.

Writer before good handoff: - guess the facts - guess the scope - guess the missing area

Writer after good handoff: - draft supported claims - flag the coverage gap - ask for one targeted follow-up if needed

Now the Writer's job becomes narrow and clear. Write only supported points. Do not invent competitor analysis. Mention the gap in the risk section. If needed, ask research for one more pass. That is a usable the handoff. Not a pile of vibes.

A good handoff turns downstream work from interpretation into execution. That is why output quality jumps. The writer spends time composing, not reconstructing the research process. See.


5) Handoff compression — summarization is memory management

Picture a pipeline full of raw notes. Search results. Tool logs. Scratch reasoning. Half-formed ideas. If all of that moves forward, the next agent drowns. So compression is not decoration. It is memory management.

Raw context from one agent is often too large for the next. Compression means summarization at the boundary. Keep facts, risks, constraints, and next actions. Drop chain-of-thought, irrelevant detail, and raw search dumps. The handoff is a filter, not a pipe.

This is where many systems quietly waste tokens. They forward everything because forwarding feels safe. But giant payloads create new failures. Important facts get buried. Latency rises. Costs rise. The receiver misses the one thing that mattered. Look. Sharper is better.

Boundary compression recipe: - keep only decision-relevant facts - keep evidence that supports those facts - keep known gaps and limits - keep the next recommended move

So what to do at the boundary? Summarize into stable fields. Preserve traceable evidence. Retain known gaps and constraints. Name the next decision. That gives the next worker enough memory to act, and no extra noise. That is disciplined the handoff design.


Where this lives in the wild

  • GitHub Copilot agent handoffs — software engineer sees file search results compressed into structured context before the code generation agent receives them.
  • Customer support escalation — support operations lead wants the tier-1 agent to pass a structured case summary, not the full transcript, to the tier-2 specialist.
  • Medical AI pipelines — clinical AI product manager needs a diagnosis agent to pass a structured differential, not raw notes, to the treatment recommendation agent.
  • Legal review systems — legal ops engineer wants an extraction agent to pass clause-level findings with confidence scores to the risk assessment agent.
  • Content workflows — content strategist needs a research agent to pass claim objects with source links and confidence, not raw article dumps.

Pause and recall

  1. Why do most multi-agent failures happen at the boundary, not inside one worker?
  2. Which fields in the template remove the most guessing for the next agent?
  3. Why does the four-field output pattern help an orchestrator compare workers?
  4. Why is boundary compression really a memory management problem?

Interview Q&A

Q: Why design the handoff first, not just keep improving prompts? A: Because prompt quality cannot fully rescue a broken boundary. If inputs, constraints, and done conditions arrive vaguely, the next agent still guesses and drifts. Common wrong answer to avoid: "Because prompts do not matter" — prompts matter, but boundary design decides whether downstream work is interpretable and testable.

Q: Why use a structured handoff instead of passing the full conversation? A: Because downstream agents need the right state, not every state. Structured payloads reduce token waste, cut noise, and make automation reliable. Common wrong answer to avoid: "Because shorter context is always better" — shorter but incomplete handoffs also fail; the goal is compressed sufficiency.

Q: Why prefer four fixed output fields over open-ended worker narratives? A: Because the orchestrator can compare outputs structurally across many workers. That makes retries, ranking, and escalation much easier than reading essays. Common wrong answer to avoid: "Because prose is bad" — prose is useful inside the agent; structure matters most at the boundary.

Q: Why compress the handoff instead of forwarding everything for safety? A: Because unfiltered payloads bury critical facts, increase latency, and force the next agent to parse noise. Boundary summarization preserves what matters and drops what does not. Common wrong answer to avoid: "Because token limits are the only issue" — cost matters, but clarity and failure isolation matter too.


Apply now (5 min)

Exercise: Take one workflow you know. Write a handoff from Agent A to Agent B using the template in this lesson. Then underline the fields that prevent guessing.

Sketch from memory: Draw one bad handoff and one good handoff. Mark where facts, risks, constraints, and next action appear. Then rewrite the bad one into the four-field worker output pattern.


Bridge. The handoff structure is clear. But how does the whole system share and manage state? Two fundamentally different approaches: a shared store that everyone reads and writes, or explicit message payloads that flow between agents. That choice shapes everything. → 08-shared-state-vs-messages.md