09. Cost & Latency in Multi-Agent — the multiplication problem¶
~10 min read. Every extra agent, every handoff, every retry multiplies tokens and time. Control it or drown.
Built on the ELI5 in 00-eli5.md. The CEO — the orchestrator — must watch budgets. More departments means more memos, more meetings, more spend.
1) Why multi-agent multiplies cost¶
Picture first. Think of one packet moving through four desks. Every desk reads context, writes output, and forwards a fresh packet. That forwarded packet is the handoff. It carries tokens, not magic. See.
user brief
-> orchestrator (+800)
-> agent 1 [1500 in | 500 out | 2000 total]
-> orchestrator (+800)
-> agent 2 [1200 in | 600 out | 1800 total]
-> orchestrator (+800)
-> agent 3 [1400 in | 400 out | 1800 total]
-> orchestrator (+800)
-> agent 4 [1000 in | 300 out | 1300 total]
Now compare with one strong single agent. Single agent means one call with 4000 input tokens and 1000 output tokens, so total single-agent tokens = 4000 + 1000 = 5000. Now do the pipeline math exactly. Agent totals are 2000, 1800, 1800, and 1300; worker subtotal = 2000 + 1800 + 1800 + 1300 = 6900. Orchestrator overhead per stop = 800 tokens, and four stops means 800 × 4 = 3200. Grand total = 6900 + 3200 = 10100 tokens.
So the pipeline uses 10100 tokens, while the single agent uses 5000. Multiplier = 10100 / 5000 = 2.02×. That is before retries. Simple, no? If agent 3 fails once, you often rerun agent 3, agent 4, and controller work. This is why the CEO must count packets, not just agents. One more the department is never only one more call. It also means one more context package and one more chance to repeat.
2) The cheap-router, expensive-generator pattern¶
Look. Most requests do not deserve the premium brain. Many requests are boring, repetitive, and easy to classify. So what to do? Put a cheap router first, then wake the bigger model only when needed.
incoming ticket
-> cheap router decides: simple or complex
-> simple: template / canned action
-> complex: expensive generator writes reply
Use the given prices. GPT-4o-mini routes at $0.15 per million tokens, and GPT-4o generates at $2.50 per million tokens. Assume 1000 support tickets arrive, and a full generated reply uses 1500 tokens.
Case A: everything goes to GPT-4o. Total GPT-4o tokens = 1000 × 1500 = 1500000 tokens. Convert to millions = 1500000 / 1000000 = 1.5, so total cost = 1.5 × $2.50 = $3.75.
Case B: route first, then send only complex tickets to GPT-4o. Assume the router uses 200 tokens per ticket. Router tokens = 1000 × 200 = 200000, router token units = 200000 / 1000000 = 0.2, router cost = 0.2 × $0.15 = $0.03. Now apply the 60% simple-ticket assumption. Simple tickets = 1000 × 0.60 = 600, complex tickets = 1000 - 600 = 400. GPT-4o handles only 400 tickets, so GPT-4o tokens = 400 × 1500 = 600000. GPT-4o token units = 600000 / 1000000 = 0.6, GPT-4o cost = 0.6 × $2.50 = $1.50.
Total routed cost = $0.03 + $1.50 = $1.53. Savings = $3.75 - $1.53 = $2.22. Savings rate = $2.22 / $3.75 = 59.2%. That is nearly the full avoided 60% expensive inference, minus a tiny routing fee.
This pattern works because classification is cheap and long generation is not. Let the department that only sorts work stay cheap. Let the CEO send only the hard packets upstairs. That is architecture, not just pricing.
3) Latency control — sequence depth is the enemy¶
Picture before math again. If four agents stand in one line, the user feels every wait. Each pause stacks on the next pause. Latency loves depth. See.
Sequential: A 2s -> B 4s -> C 4s -> D 2s = 12s
Parallelized middle: A 2s -> [B 4s || C 4s] -> D 2s = 8s
The trick is not only faster models. The trick is shallower waiting chains. When two steps are independent, run them together. Research gathering and style-check setup can overlap. A likely policy pack or file list can be pre-fetched before the worker asks.
Ways to reduce latency are plain. - Parallelize independent workers. - Pre-fetch likely resources before agent requests them. - Use smaller prompts for intermediate steps. - Stop early when confidence is already sufficient. - Keep human approval only at meaningful gates.
Now read that operationally. Parallelism removes waiting edges. Pre-fetching removes idle gaps. Smaller prompts cut transfer and model time. Early stop kills useless extra turns. Meaningful approvals stop humans from becoming accidental bottlenecks.
A common mistake is tuning one prompt very hard. But the bigger loss is often the chain shape itself. A 10% faster stage inside a deep chain helps little. Removing one whole waiting edge helps more. That is why the handoff design and topology must be discussed together.
4) Worked example — budget planning for a content workflow¶
Production teams need a budget sheet, not vibes. So let us build one. Suppose a content workflow uses four specialists and one controller. Use this estimate.
| Agent | Model | Input tokens | Output tokens | Cost | Latency |
|---|---|---|---|---|---|
| Research | GPT-4o-mini | 800 | 400 | $0.0002 | 1.5s |
| Writer | GPT-4o | 1500 | 800 | $0.0060 | 3.0s |
| Reviewer | GPT-4o-mini | 1200 | 300 | $0.0002 | 1.2s |
| Publisher | GPT-4o-mini | 600 | 200 | $0.0001 | 0.8s |
| Orchestrator | GPT-4o-mini | 500×4 | 100×4 | $0.0010 | 2.0s |
Now add cost step by step. Research + Writer = $0.0002 + $0.0060 = $0.0062. Add Reviewer = $0.0062 + $0.0002 = $0.0064. Add Publisher = $0.0064 + $0.0001 = $0.0065. Add Orchestrator = $0.0065 + $0.0010 = $0.0075. Rounded total cost = about $0.008 per workflow.
Now add latency step by step. Research + Writer = 1.5s + 3.0s = 4.5s. Add Reviewer = 4.5s + 1.2s = 5.7s. Add Publisher = 5.7s + 0.8s = 6.5s. Add Orchestrator = 6.5s + 2.0s = 8.5s sequential.
Now make the estimate more realistic. Let reviewer prep run during research, which hides 1.2 seconds of checklist and rubric loading. Let 0.5 seconds of orchestration overlap with writer startup. Saved wall-clock time = 1.2 + 0.5 = 1.7s, so parallel total = 8.5 - 1.7 = 6.8s.
See what happened. Cost barely changed, but latency changed a lot. Parallelism often buys time more than money. Cheap routing often buys money more than time. Different levers, different budgets.
This is the estimate production teams need before launch. If product says, "Keep cost under one cent and time under seven seconds," you now have a testable plan. Without this sheet, people argue from vibes. With this sheet, the CEO can make trade-offs in public.
5) Budget as architecture constraint¶
Do not treat budget as finance paperwork. Treat it as design. Token budget, latency budget, and cost budget belong in the system brief. Write them before coding starts. Simple, no?
Useful limits are plain and practical. - Max turns per agent. - Max total tool calls per workflow. - Max orchestration depth. - Max end-to-end latency. - Max spend per workflow.
Add one more rule. Define what happens when a limit is hit. Should the system stop, downgrade, escalate, or ask a human? If that fallback is missing, the budget is fake. See.
A safe design brief might say this. Max 2 retries per agent. Max 10 total tool calls. Max depth 4. Max latency 9 seconds. Max spend $0.02 per workflow. Those numbers are not universal, but the habit is universal.
If you do not set limits, loops stay invisible. Retries stay invisible. Controller chatter stays invisible. Then the bill arrives before the bug report. And when finance asks why, nobody knows which the department caused the leak. That is why cost and latency belong beside quality, not below it.
Where this lives in the wild¶
- OpenAI API billing dashboards — an AI platform lead often sees multi-agent flows jump to 3-5× token usage, so budget caps become mandatory.
- Anthropic Claude enterprise usage tiers — a procurement or platform owner sets per-workflow spend ceilings before allowing many Claude calls in one chain.
- AWS Bedrock with cost allocation tags — a cloud architect tags each specialist so finance can see which agent role is driving the bill.
- Stripe support automation — a support operations manager uses a cheap classifier to route roughly 70% of tickets to templates and saves premium generation for the hard cases.
- Bloomberg terminal AI assistants — a product manager working under a 2-second SLA trades model size against parallelism very aggressively.
Pause and recall¶
- Why does one extra the handoff usually add more than one extra call?
- In the router-generator pattern, where does most of the dollar saving actually come from?
- Why is sequence depth usually a bigger latency villain than one slightly slow stage?
- What budget limits would you write before launching a five-agent workflow?
Interview Q&A¶
Q1. Why use a cheap router before an expensive generator, not just a better prompt on the big model? A. Because routing removes whole expensive calls, while prompt tuning usually trims only one call a little. Common wrong answer to avoid: "Because small models are always accurate enough for every task."
Q2. Why optimize orchestration depth before micro-optimizing one worker prompt? A. Depth multiplies waiting across the whole workflow, while a local prompt tweak helps only one segment. Common wrong answer to avoid: "Because latency only depends on the slowest single model."
Q3. Why set hard spend caps per workflow instead of relying on monthly budget alerts? A. Workflow caps stop runaway loops in real time; monthly alerts only tell you after the burn happened. Common wrong answer to avoid: "Because monthly billing dashboards update too slowly to matter."
Q4. Why not collapse every multi-agent system back into one large model if cost rises? A. One model may cut coordination cost, but it can lose specialization, observability, and control over failure boundaries. Common wrong answer to avoid: "Because multi-agent is always more advanced, so it must always be better."
Apply now (5 min)¶
Exercise. Pick one workflow you already know. Write a tiny budget sheet for it. Choose four agents or fewer. For each one, note model, token estimate, cost, and latency. Then decide one place to route cheaply and one place to parallelize.
Sketch from memory.
workflow: __________________________
agent 1: __________ cost: ________ latency: ________
agent 2: __________ cost: ________ latency: ________
agent 3: __________ cost: ________ latency: ________
agent 4: __________ cost: ________ latency: ________
cheap router at: ___________________
parallel pair: _____________________
max spend: _________________________
max latency: _______________________
Bridge. Budget is set. But what happens when things go wrong? Multi-agent failures are harder to find. The error might be in agent 2's output, but it only surfaces in agent 4's result. Next: how to find the broken handoff. → 10-debugging-multiagent.md