10. Cost, Quality, and Latency Tradeoffs — The frontier you must defend in numbers¶
~12 min read. A better answer is not better business if it arrives too late or costs too much. Memorise the 2026 price/latency table — interviewers will draw it on the whiteboard.
Built on the ELI5 in 00-eli5.md. the time budget — justified against user value, error cost, and SLA — is what turns a model choice into a business decision.
The triangle you cannot escape¶
Every reasoning deployment sits inside three forces.
More quality often means more compute. More compute often means more latency. Less latency often means less reasoning. You rarely maximise all three. Choose the point that matches the task.
Live chat support wants low latency more than max quality. Audit analysis wants quality and tolerates latency. Batch reporting tolerates both. The triangle is the same; the operating point is task-specific.
The 2026 price table you should know¶
Memorise the relative shape; the exact numbers shift quarterly. Output prices, May 2026 ($/M tokens):
| Tier | Model | Input | Output | Reasoning tokens |
|---|---|---|---|---|
| Fast non-reasoning | gpt-4.1-mini, Haiku 4.5, Gemini Flash | \(0.10–\)0.40 | \(0.40–\)1.60 | n/a |
| Mid-reasoning | o4-mini, Gemini 2.5 Pro | \(1.10–\)1.25 | \(4.40–\)10.00 | billed as output |
| Frontier reasoning | GPT-5 | $1.25 | $10.00 | billed as output |
| Frontier reasoning | o3 | $2.00 | $8.00 | billed as output |
| Frontier reasoning | Claude Sonnet 4.6 | $3.00 | $15.00 | billed as output |
| Frontier reasoning | Claude Opus 4.7 | $5.00 | $25.00 | billed as output |
| Pro-tier reasoning | o3-pro | $20.00 | $80.00 | billed as output |
| Pro-tier reasoning | GPT-5.5 Pro / 5.2 Pro | $20.00 | \(80.00–\)120.00 | billed as output |
| Open reasoning | DeepSeek R1-0528 (Together) | $0.55 | $2.19 | included |
| Open distilled | R1-Distill-Llama-70B | $0.20 | $0.80 | included |
The ratio between fast and frontier is ~25–60× on output. The ratio between fast and pro-tier is 200–500×. Routing matters precisely because of these ratios.
P50 latency typical numbers (May 2026)¶
| Config | TTFT | P50 total |
|---|---|---|
| gpt-4.1-mini, single short response | < 1 s | 1–2 s |
| Claude Haiku 4.5 streaming | < 1 s | 1–3 s |
o3 effort=low |
2–4 s | 5–10 s |
o3 effort=high |
6–10 s | 30–45 s |
GPT-5.5 Pro effort=high |
~8 s | ~67 s |
Claude Sonnet 4.6 budget=8k |
3–5 s | 10–20 s |
Claude Opus 4.7 effort=high |
4–8 s | ~28 s |
Claude Opus 4.7 effort=max |
6–12 s | 60–90+ s |
Gemini 3 Pro thinkingLevel=HIGH |
4–8 s | ~52 s |
| Gemini 2.5 Pro thinking dynamic | 2–6 s | 8–30 s |
TTFT (time to first token) is the critical number for chat UX. Anything > 5 s without a streaming indicator feels broken. Reasoning models silently consume reasoning tokens before any visible output, which is why the perceived "the model is dead" sensation is worse than the wall-clock latency suggests. Always stream a status message: "Thinking…" when reasoning is enabled.
Think in expected business loss, not raw accuracy¶
Raw model accuracy is not the right axis. What a mistake costs is. A wrong movie recommendation is cheap. A wrong tax calculation is not. A wrong refund authorisation has chargeback and audit consequences. Combine:
This is the framework that wins senior interviews.
Worked example: expected cost of two model choices¶
Serve 10,000 requests/day. Two options.
Fast model. Accuracy 88%, output cost $0.50/M, average 800 output tokens, avg error business cost $25.
- Wrong:
10000 × 0.12 = 1,200errors/day →1,200 × $25 = $30,000. - Serving:
10000 × 800 / 1e6 × $0.50 = $4. - Total: $30,004/day.
Reasoning model. Accuracy 96%, output cost $15/M (Sonnet 4.6), 800 visible + 4,000 reasoning = 4,800 effective output tokens, avg error cost $25.
- Wrong:
10000 × 0.04 = 400errors/day →400 × $25 = $10,000. - Serving:
10000 × 4800 / 1e6 × $15 = $720. - Total: $10,720/day.
The reasoning model is ~$19,300/day cheaper because error cost dominates. The expensive model is the cheaper system. Now flip it — if error cost is $0.50 instead of $25:
- Fast total:
1200 × 0.50 + 4 = $604. - Reasoning total:
400 × 0.50 + 720 = $920. - Fast wins. Error cost was too low to justify reasoning serving cost.
The lesson: the right answer depends entirely on error cost. Reasoning models win when errors are expensive. Fast models win when errors are cheap.
Latency changes the answer¶
A customer support chatbot loses users at P95 > 8 seconds. A research agent is fine at 60 seconds. Same model can be the right answer in one product lane and the wrong answer in another.
The fix: route by lane (chapter 09) and use streaming aggressively. Anthropic's extended thinking emits thinking blocks first, then answer tokens — surface "Analysing..." UI immediately so the user perceives action even before the answer streams. OpenAI hides reasoning; you need a "Thinking, this may take 30 seconds..." indicator for any reasoning-enabled call.
For tasks > 30 s, prefer async / background mode — OpenAI explicitly recommends background for o3-pro because of timeouts. Anthropic recommends batch API for budgets > 32K thinking tokens.
A practical decision checklist¶
When choosing a model for a workflow, run through:
- Estimate error cost. Average $ harm per wrong answer. Be honest.
- Estimate latency budget. P50 and P95 user-acceptable times.
- Estimate volume. Daily requests.
- Estimate accuracy lift. From golden-set A/B between candidate models.
- Compute expected cost per request including error cost.
- Pick the cheapest option that clears the quality bar.
That sentence — cheapest option that clears the bar — is the discipline. Not the fanciest, not the prestige choice. The cheapest with adequate quality.
Pareto frontier — visualising the choice¶
Plot every candidate model on cost-vs-accuracy axes. The Pareto frontier is the curve of models where no other model is both cheaper and better. Models off the frontier are strictly dominated and should be eliminated.
accuracy
▲
│ ● Opus 4.7 high
│ ● GPT-5.5
│ ● Sonnet 4.6 medium
│ ● o4-mini
│ ● gpt-4.1-mini
│
└────────────────────────────────────→ cost
In 2026 the frontier has flattened — open models like R1-Distill-70B sit on the frontier for medium tasks, pulling closed prices down. The frontier shifts every quarter; rebuild it whenever a major model ships.
Where this lives in the wild¶
- Customer support chatbots (Intercom Fin, Ada) — prefer fast non-reasoning models for the 90% of intent and FAQ traffic; reserve reasoning models for refund eligibility, account-specific multi-step issues. Error cost typically \(1–\)5 per wrong answer; routing is heavy.
- TurboTax Live Assist — error cost on tax math is high (\(50–\)500 plus brand damage); reasoning model use justified for the compute-heavy paths; batch overnight runs for return finalisation use deeper effort.
- GitHub Copilot agent mode — background coding tasks tolerate 5+ minutes; uses Opus 4.7 or GPT-5.5 with high effort; inline autocomplete uses fast non-reasoning models because TTFT < 200 ms is required.
- Perplexity Pro Deep Research — users explicitly opt into longer waits (~3 min) for higher quality; cost-per-query justified by the subscription tier; latency is part of the UX, not a problem.
- Healthcare documentation assistants (Abridge, Suki) — wrong summary has compliance cost in HIPAA-regulated environments; reasoning model use justified; human review on high-risk encounter types regardless of model.
Pause and recall¶
- What are the three forces in the triangle, and which two usually trade against each other?
- In the worked example, what flipped the answer from "use reasoning model" to "use fast model"?
- What's the typical P50 latency of o3 at
effort=high? - What does "cheapest option that clears the quality bar" mean in practice, and why is it the production discipline?
Interview Q&A¶
Q: Your app serves 1M reasoning-class queries per day at P95 ≤ 4 seconds. Walk me through model selection.
A: Constraints first. 4 s P95 rules out o3 at effort=high (P50 ~30 s), Opus 4.7 max (60 s+), and any pro-tier. Candidates: GPT-5 at medium effort (~12 s P50, often outside SLA), Sonnet 4.6 at effort=low (~8 s P50), Gemini 3 Pro thinkingLevel=LOW (~6 s P50), o4-mini (~5 s P50). At 1M/day, output cost dominates: o4-mini at \(4.40/M output × 1.6K avg out × 1M = ~\)7K/day. Gemini at \(12/M = ~\)19K. Decision: o4-mini or Gemini Flash 2.5 with thinking, tested on golden set for quality. Cascade in front for the 60–70% of traffic that needs no reasoning — drop costs by another ~5×. Final estimate: \(1.5K–\)3K/day with cascade. Defend the choice with cost-per-request, P95 measured, and accuracy on golden set.
Common wrong answer to avoid: "We'll use o3-pro for quality" — fails the 4 s SLA before you start. The constraint set rules out the prestige answer; senior loops grade you on whether you noticed.
Q: Error cost is $0.10 per wrong answer. Volume is 10M/day. Why does reasoning probably not pay here? A: Wrong-cost ceiling per day: 10M × $0.10 = $1M total errors-cost budget if accuracy is zero. Reasoning model at, say, $15/M output × 5K avg out × 10M = $750K/day in serving cost. So even if reasoning model reduces errors from 10% to 1%, it saves only $900K in error cost while costing an extra $700K in serving over a fast model. Math is borderline. Fast model at 90% accuracy: 10% × 10M × $0.10 = $100K errors + $30K serving = $130K. Reasoning model: 1% × 10M × $0.10 + $750K serving = $760K. Fast wins by $630K/day. When error cost is low, serving cost dominates; fast model wins almost regardless of accuracy gap.
Common wrong answer to avoid: "Higher accuracy always wins" — at low error cost and high volume, the serving math beats the accuracy math. Senior engineers carry this mental model.
Q: A reasoning model produces 4,000 hidden CoT tokens per call. Why does this matter for cost forecasting? A: Three reasons. Direct cost — those tokens are billed at output rate. At Claude Sonnet 4.6 ($15/M out), 4K reasoning tokens = $0.06 per call just in hidden CoT, before you account for the visible answer. Context-window consumption — hidden tokens count toward your context limit; long agent loops can run out of context faster than expected. Cache misses — provider-side prompt caching covers input prefixes; reasoning tokens are output, so they never benefit from input caching. Net: budget for ~5× the visible-token cost when reasoning is enabled, and watch context-window utilisation in long sessions.
Common wrong answer to avoid: "We're billed for visible output only" — false; reasoning tokens are output tokens by every major provider's billing definition.
Q: How do you decide whether a 2-point accuracy lift on a benchmark justifies a 4× cost increase? A: Three numbers. Volume × error cost × accuracy delta — gives the gross benefit. Volume × serving-cost delta — gives the gross cost. Confidence interval on the accuracy lift — if the benchmark only measured 200 examples, 2 points might not be statistically real. Run the math: e.g. at 1M/day, $50 error cost, 0.02 accuracy lift → \(1M/day gross benefit. If the 4× cost increase is +\)5K/day, ship it. If +$2M/day, don't. The accuracy lift only matters in context of volume and error cost; never argue from the benchmark alone.
Common wrong answer to avoid: "Always pay for accuracy" or "Never pay 4× for 2 points" — both are bad heuristics. The right answer is the math.
Apply now (5 min)¶
Pick one production workflow. Compute its current cost-per-day. Compute expected error cost using a recent week's wrong-answer rate × estimated business cost per error. Total: current operating cost. Now estimate two alternatives — one cheaper model and one more expensive — and recompute total cost. Plot on cost-vs-quality. The Pareto-optimal model is your candidate switch.
Sketch from memory: Draw the quality-cost-latency triangle. Mark which corner each of your product lanes lives in.
Bridge. We can price reasoning only if we can measure it honestly. Benchmarks lie, traces fool us, eval is its own discipline. Next. → 11-evaluating-reasoning.md