10. Debugging Multi-Agent — finding the broken handoff¶
~10 min read. The output is wrong. Four agents touched it. Where did failure start? That is the debugging problem.
Built on the ELI5 in 00-eli5.md. The handoff — what passes between departments — is where most bugs hide. The CEO sees the final result. The detective work is tracing backward.
1) Why multi-agent debugging is hard¶
Look. Picture first.
Agent 1 Agent 2 Agent 3 Agent 4
│ │ │ │
│ ✓ looks ok │ ✗ subtle │ ✓ passes │ ✗ bad output
│ │ error │ error along │ visible here
▼ ▼ ▼ ▼
output 1 ──→ handoff ──→ output 3 ──→ handoff ──→ final result
▲
bug hides here
2) The debugging checklist¶
The move is always the same: reduce the blur. Do not read the whole run as one blob. Break the chain into visible checkpoints. Use this checklist every time. 1. Reproduce the same task with tracing enabled. Freeze the prompt, inputs, and major settings. If the run is not reproducible, comparison becomes theatre. 2. Inspect each handoff output in isolation — is it well-formed? Check schema, required fields, source links, status flags, and obvious omissions. A pretty paragraph can still be a malformed payload. 3. Identify the FIRST bad intermediate artifact. This is the key move. Do not stop at the first visible symptom. Stop at the first visible cause. 4. Check whether the artifact is under-specified or overlong. Under-specified payloads invite guessing. Overlong payloads bury the important field. Both create the same downstream pain. 5. Check whether tool use matched the task charter. If the agent was supposed to verify sources, did it verify? If it was supposed to call a calculator, did it call one? Many failures are charter drift, not model stupidity. 6. Check whether the next agent misread the handoff. The payload may be correct but badly consumed. A reviewer may miss a flag. A publisher may treat notes as facts. That is a reader bug. 7. Check whether the orchestrator used a weak aggregation rule. Maybe the CEO picked the longest answer. Maybe it merged outputs without validation. Maybe low-confidence results were treated as final. That is an orchestration bug. You can remember the checklist in one line. Reproduce. Inspect. Find first failure. Judge payload size. Check tool fit. Check reader fit. Check aggregation. Simple, no? A debugging checklist is not bureaucracy. It is a way to stop guessing.
3) What to log at every handoff¶
Now make the picture concrete. If you do not log the handoff, you cannot debug the handoff. Vague memory is not observability. Structured trace is observability. Minimum trace fields: - agent name - task id - input summary (not full prompt — too expensive to store) - output summary - tool calls made - latency (ms) - token usage (input + output) - error type (if any) - confidence score (if available) A concrete trace entry may look like this.
{
"agent_name": "research-agent",
"task_id": "brief-2041",
"input_summary": "Find 5 recent sources on Indian fintech funding trends",
"output_summary": "Returned 6 sources and 4 claims; claim C3 linked to source S3",
"tool_calls": ["web_search", "fetch_url", "extract_quotes"],
"latency_ms": 8420,
"token_usage": {
"input": 1880,
"output": 640
},
"error_type": null,
"confidence_score": 0.71
}
agent_name tells you which department owned the step. Without ownership, failure review becomes gossip.
task_id groups all spans from one run. When retries happen, this saves your sanity.
input_summary gives intent without storing the whole prompt. Cheap enough to keep. Rich enough to compare across runs.
output_summary lets you spot drift quickly. If the writer claims six citations but the reviewer saw four, you already have a clue.
tool_calls tells you whether behavior matched charter. No search call during research is a smell. No validator call during review is another smell.
latency_ms catches silent operational problems. A timeout often creates partial output, and partial output often becomes hidden downstream damage.
token_usage shows prompt bloat. If a handoff suddenly becomes huge, the next agent may ignore the important lines. That is not magic. That is context crowding.
error_type should be typed, not poetic. "Something went wrong" is not a useful state. A typed error tells downstream code what to do next.
confidence_score is not truth. But it is routing signal. Low confidence should change what the CEO does next.
Look.
When teams skip trace design, they force every postmortem into memory and vibes. That does not scale.
4) Worked example — tracing a content brief failure¶
Task: generate a market brief on digital lending. Final output contains one unsupported claim: "Tier-2 digital lending default rates fell 35% in 2025." There is no valid source behind it. Step 1: Reproduce with tracing. All four agents log cleanly: research, writer, reviewer, and publisher. Good. Now we walk backward. Step 2: Check publisher output. The claim is present in the final brief, and there is no source link beside it. Publisher is the visible failure point. But publisher may only be forwarding, so do not stop here. Step 3: Check reviewer output. Reviewer passed the brief. No unsupported-claim warning appears. That means reviewer missed the gap. Important, yes. But still maybe not root cause. A checker can miss a bad input created earlier. Step 4: Check writer output. Writer included the claim and cited it as "source 3." Now the path tightens. The writer believed support existed. So either the writer misread source 3, or research mislabeled source 3. Step 5: Check research output. Source 3 exists. It is a market note about loan growth. But it does not support the default-rate claim. Research agent hallucinated relevance. The source was real. The alignment was false. That is why the bug travelled so far. A real source with fake relevance is a dangerous failure mode because downstream agents relax. So what is the root cause? The research agent's done condition only checked source count and freshness. It did not verify claim-source alignment. That missing test poisoned the handoff. Every downstream department trusted the pack. The claim looked sourced. It was only source-shaped. Now write the fix in operational language. Add claim-source verification to the research agent's charter. Require each claim to include a supporting quote or extracted evidence span. Reject any source pack where a claim lacks explicit support. Then let reviewer validate claim-source alignment again as a second line of defence. Notice the lesson. Reviewer failed too, yes. But the first bad artifact came from research. That is why the repair starts upstream. A short backward trace would read like this.
final brief -> unsupported claim visible
publisher payload -> claim present, no source link
review verdict -> passed, missed unsupported claim
writer draft -> claim cites source 3
research pack -> source 3 real, but irrelevant
root cause -> no claim-source verification rule
5) Error categories and response strategies¶
Errors should not vanish into vague text. They should propagate with types so downstream agents respond appropriately. | Error type | Example | Response | |---|---|---| | retryable_tool_error | API timeout | Retry same agent | | validation_error | Output missing required field | Re-run with clearer constraints | | missing_dependency | Upstream agent didn't provide needed data | Fix upstream handoff | | insufficient_confidence | Agent reports low confidence | Escalate to human or stronger model | | human_review_required | High-stakes action | Route to approval gate | Look at the pattern. The type should decide the next move. If the problem is retryable, retry. If the problem is structural, fix the payload. If confidence is low, escalate. Do not let downstream agents guess the policy. That guessing creates hidden inconsistency. Good systems treat error types as routing instructions. Bad systems bury them inside a polite paragraph. Simple, no?
Where this lives in the wild¶
- LangSmith — LLM platform engineer: traces multi-agent LangGraph runs and shows each node's input, output, latency, and token counts.
- Arize Phoenix — evaluation engineer: visualizes agent traces as spans and highlights where quality degrades between handoffs.
- Datadog LLM Observability — production ML engineer: monitors multi-agent systems with per-agent latency, failure, and cost dashboards.
- OpenTelemetry for LLMs — platform engineer: provides an emerging tracing standard for agent workflows across services and teams.
- Internal debugging at Anthropic — applied AI engineer: uses structured logging of multi-turn agent interactions for failure analysis.
Pause and recall¶
- Why can the first visible bad output be different from the first bad artifact?
- What is the fastest way to reduce blur in a multi-agent failure?
- Why should error types propagate as typed states instead of vague text?
- In the worked example, what exact missing done condition caused the bad claim to survive?
Interview Q&A¶
Q: Why trace every handoff instead of only logging the final output and the failing agent? A: Because the failing agent may only expose upstream damage. Handoff traces let you find the first broken artifact rather than the last visible symptom. Common wrong answer to avoid: "Because more logs are always better" — quantity alone is not the point; causal visibility is. Q: Why typed error propagation instead of free-form failure messages? A: Typed errors support deterministic routing. Retries, escalations, and upstream fixes should depend on machine-readable state, not prose interpretation. Common wrong answer to avoid: "Because typed errors look cleaner" — the real value is reliable downstream behavior under failure. Q: Why fix the research charter first, not the reviewer, in the content-brief example? A: Because research created the first bad intermediate artifact. Reviewer missed it, but upstream correction removes the poison earlier and improves every downstream path. Common wrong answer to avoid: "Because upstream agents are more important" — importance is not the criterion; first-cause location is. Q: Why inspect tool-use traces, not only text outputs, when debugging multi-agent runs? A: Text may look reasonable while the actual tool plan was wrong. Tool traces reveal charter drift, skipped validation, and missing evidence collection. Common wrong answer to avoid: "Because tools sometimes fail" — true, but the deeper reason is that tool choice explains behaviour, not just outages.
Apply now (5 min)¶
Exercise: Take one four-agent workflow you know. Write the four handoffs as short boxes. Then mark where a bad fact could hide for two steps before becoming visible. Sketch from memory: Redraw the backward-debugging path. Start at final output. Trace to the first bad artifact. Add the minimum trace fields beside each step.
Bridge. We can now build, budget, and debug multi-agent systems. But mature engineers also know what these systems still cannot do well. Not failures to fix — genuine open problems. That honest view is next. → 11-honest-admission.md