11. Honest Admission — what multi-agent still gets wrong¶

~12 min read. Multi-agent demos look smooth. Real systems still surprise us in ordinary ways.

Built on the ELI5 in 00-eli5.md. The org chart — who talks to whom — can be drawn neatly on a whiteboard. The real execution is messier. Here is what we honestly admit.

1) Planning depth is still shallow¶

See. Agents plan well for one or two steps. After that, confidence often outruns real foresight. A smooth paragraph can still hide a brittle plan. Fluency is not depth. Fluency ≠ depth.

The hard part is hidden dependency discovery. An agent sees the obvious next move, then misses the quiet prerequisite behind it. That is why the CEO can sound wise and still steer badly.

Look at a common software flow.

goal: ship the fix
      │
      ├── search the files
      ├── edit the function
      ├── run the tests
      └── remember the hidden migration step  ← often missed

Each visible step looks reasonable. The missing migration ruins the whole result. That is the problem.

Longer horizons worsen the issue. Local choices can be sensible, while the full path stays poor. One agent picks the fastest route. Another agent inherits stale context. Then the handoff carries a weak plan forward.

This is why demos can mislead teams. The output sounds organized. The strategy underneath may be shallow. The org chart looked neat. The actual execution was thin. Simple, no?

2) Compound error still multiplies quietly¶

Now what is the structural problem? Small misses do not stay small in a chain. They multiply. That is arithmetic.

Take a four-link workflow. Assume each step is 90% accurate. That sounds strong in isolation. Now calculate the full chain.

Step 1 succeeds with probability 0.9. Step 2 also succeeds with probability 0.9. After two steps, total success is 0.9 × 0.9 = 0.81. Add step 3. 0.81 × 0.9 = 0.729. So after three steps, success is 72.9%. Add step 4. 0.729 × 0.9 = 0.6561. So end-to-end success is 65.61%. Round it and you get 65.6%.

See how quietly the chain degrades. No single step looked alarming. Together, the system misses roughly one time in three. Mostly correct is not reliably correct.

In practice, the links are not only model outputs. They include tool choice, argument formation, tool reliability, state quality, and stopping policy. All of them stack.

tool choice
   ↓
argument formation
   ↓
tool response
   ↓
state update
   ↓
stop or continue decision

Prompt harder if you want. The structural math still remains. This is a chained probabilistic system. Better wording may help some links. It does not erase compounding. That is why the department can look competent while the full workflow disappoints.

3) Interaction failures, not single-point failures¶

Teams love one villain. One bad prompt, one bad tool, one bad model. Real failures are usually interactions. Look.

slightly vague schema
      │
      ├──→ slightly bad argument
      ├──→ slightly wrong tool result
      └──→ slightly wrong next decision

Each step looks survivable alone. Together they become a visible miss. That is why blame hunts waste time.

A schema field is loosely named. The caller fills it almost correctly. The tool returns something almost useful. The next agent reads it too literally. The final answer is wrong, yet nobody sees one dramatic break.

This is why workflow evals matter. A unit test may show that each part works alone. The system still fails in combination. The handoff is where many quiet losses begin.

Now say it plainly. Multi-agent failure is often relational. It lives in coordination, sequencing, and shared state. That is also why the org chart is not enough. Boxes and arrows show ownership. They do not guarantee understanding.

Mature teams test interactions directly. They replay traces, inspect payloads, and score end-to-end tasks. They do not stop at component pass rates. Simple, no?

4) What mature teams admit plainly¶

Strong teams still use multi-agent. They just stop pretending it is settled science. They say the limits out loud. That honesty improves design.

They admit these points plainly. - Long-horizon planning is still fragile. - Compound errors still stack. - Tool ecosystems still drift. - State can go stale between agents. - Some actions still need humans. - Coordination overhead sometimes exceeds the benefit. - The field lacks agreed-upon evaluation standards for multi-agent quality.

Look carefully at the middle of that list. Tool drift matters more than people expect. An API, schema, or permission changes. Soon the department is following yesterday's rules.

State staleness also bites hard. Agent A reads an old snapshot. Agent B acts on a newer snapshot. Agent C combines both and sounds certain. That is how the CEO gets a polished but stale briefing.

Human gates remain necessary in some flows. Refund approval, trade execution, policy escalation, and production deletion all fit here. If the action is costly, uncertain automation is not enough.

The mature rule is simple. Use a single agent first. Add more agents only when the failure pattern earns it. If coordination cost exceeds quality gain, collapse the team. Simple, no?

5) The four foundations Module 11 assumes¶

Before the next module, check your footing. Evaluation frameworks feel abstract when the basics are shaky. See the four foundations.

The agent loop concept from Module 09.
Multi-agent coordination basics from this module.
When to split versus keep single from file 01 and file 09.
Cost and latency tradeoffs from file 12.

If one of these feels soft, pause now. Do not race ahead. Module 11 will ask harder questions about quality, traces, and silent degradation. Those questions become easier only if the basics are clear.

Think of it this way. If you cannot explain the CEO, the department, and the handoff, your eval design will stay fuzzy. If you cannot explain why the org chart sometimes hurts, your production metrics will also stay fuzzy. That is the honest admission before we move on.

Where this lives in the wild¶

GitHub Copilot agents — developer productivity engineer still sees compound errors across file search, edit, and test, so surprising failures still appear in production.
Zendesk AI support workflows — support operations lead still sends complex multi-department cases to humans because planning depth collapses on edge cases.
OpenAI Deep Research — research analyst can amplify a small misunderstanding across long search loops and produce a confident but drifting synthesis.
Morgan Stanley wealth assistant — risk manager keeps strict approval gates because a mostly-correct financial chain is still unacceptable for high-stakes actions.
Jasper content workflows — content lead can combine research, writing, and fact-check steps into a polished draft that is plausible but wrong.

Pause and recall¶

Why can a fluent multi-agent plan still be strategically shallow?
In the 90%-per-step example, why does the chain fall to 65.6%?
Why do workflow evals catch failures that component evals miss?
Why is single-agent-first a maturity rule, not a lack of ambition?

Interview Q&A¶

Q: Why prefer a single agent first, not a multi-agent team by default? A: Because each extra handoff adds coordination cost, latency, and another place for state or tool errors to accumulate. Split only when the observed failure pattern justifies the extra structure. Common wrong answer to avoid: "Because multi-agent is overhyped" — hype is not the key issue; operational overhead and error compounding are.

Q: Why are workflow evals more important than component evals for multi-agent systems? A: Because the visible failure often emerges from interactions between acceptable parts. The chain fails at composition, not at one isolated node. Common wrong answer to avoid: "Because components do not matter" — components still matter, but workflow quality cannot be inferred from them alone.

Q: Why not just prompt the planner harder to fix long-horizon failure? A: Better prompting can improve local decisions, but it does not remove hidden dependencies, stale state, or the structural brittleness of long chains. Common wrong answer to avoid: "A smarter prompt solves planning" — prompts help, but they do not turn shallow search into robust foresight.

Q: Why keep human gates in high-stakes workflows instead of trusting confidence scores? A: Because confidence text is not the same as verified correctness, and the cost of one compounded miss can be unacceptable. Common wrong answer to avoid: "Because humans are always more accurate" — the point is risk containment, not human perfection.

Apply now (5 min)¶

Exercise: Take one workflow you know. Mark four links in the chain. Write the per-step accuracy you believe each link has. Then multiply them. See whether the full pipeline still looks safe.

Sketch from memory: Draw one box for the CEO, two boxes for the department, and one arrow for the handoff. Then mark where stale state, bad tool output, and a wrong stopping decision could enter.

Bridge. We have built multi-agent systems, designed their protocols, and admitted their limits. Now comes the hardest question: how do you know any of this actually works? Evaluation frameworks for LLMs, agents, and RAG in production. → ../00_ai_evals_release_gates/00-eli5.md