02. The cost-latency-quality triangle — pick two, the third bends¶
~15 min read. Every model selection sits on a three-cornered tradeoff. You can buy quality with cost. You can buy quality with latency. You cannot have all three at the top — the math forbids it. The matching habit is the discipline of knowing which corner you are sacrificing on this particular ticket.
Builds on 01-the-tier-anatomy.md. Now that the four cooks have names, the next question is what forces the kitchen manager to pick one over another.
1) Hook — the same task at three tiers, three different numbers¶
A product team needs to summarize a 4000-token customer-support transcript into a 250-token executive summary, once per closed ticket, roughly 50,000 times a day. The team has three candidate cooks.
Watch the numbers.
HAIKU 4.5 SONNET 4.6 OPUS 4.7
───────── ────────── ────────
quality (judge 78/100 91/100 94/100
score, 1-100)
P50 latency 640 ms 1900 ms 5200 ms
(full response)
$/call $0.0013 $0.0150 $0.0788
(4000 in,
250 out)
$/day @ 50K $65 $750 $3,940
$/year $24K $274K $1.44M
Three different answers depending on which corner you weigh. The team that optimizes for quality picks Opus at $1.44M/year. The team that optimizes for latency picks Haiku at 640ms. The team that optimizes for cost picks Haiku at $24K/year.
Now the real question — what is the quality floor for this task? If a summary scoring below 85 is unacceptable to executives, Haiku is off the table regardless of how cheap it is. Sonnet clears the floor at 91. Opus clears it more comfortably at 94, but the marginal three points cost 5.3x more per call.
That is the triangle. You picked quality and cost — Sonnet at $274K/year, 1900ms latency. You sacrificed the third corner — Haiku-level latency. If the workload were a real-time chat where 1900ms felt slow, you would be forced into a different corner, paying for either Haiku quality or for the Sonnet quality served via streaming with optimistic UI.
Simple, no? The triangle is not a slogan. It is the actual shape of every production model decision.
2) The metaphor — three knobs and a fixed budget¶
The kitchen has three knobs on the wall. Quality. Cost. Latency. There is a single budget — call it "what the business will tolerate." You can push two knobs to the top. The third is forced down by the budget.
The master chef pushes the quality knob to the top. Pushes the latency knob all the way down because she is slow. Pushes the cost knob all the way down because she is expensive. So the manager gets quality, and loses both speed and savings.
The apprentice pushes the latency and cost knobs to the top. Pushes the quality knob down because she only handles labelled work. The manager who picks her keeps the kitchen fast and cheap, but the menu shrinks.
The workhorse cook sits in the middle on all three knobs. He is the compromise that wins most of the menu. Not as good as the master chef on the tasting menu. Not as fast as the apprentice on chopping onions. But the only cook whose three knobs are all acceptable on most tickets.
The matching habit is reading the ticket and asking which knob does this ticket forgive. A long batch summarization forgives latency — you can run it overnight, so latency goes to zero priority and quality wins. A real-time chat forgives quality slightly — users tolerate a 90/100 reply more than a ten-second wait. A high-volume classification forgives cost — at 100M calls a month, you choose the cheap cook even if the slow cook would have been 2 points more accurate.
Every ticket has a forgiveness profile. The triangle is the manager's job to read.
3) The anatomy of the triangle¶
Three corners. Three edges. Each edge is a workload type.
The quality-cost edge (sacrifice latency): batch summarization, overnight document generation, async report writing, weekly digests. Latency does not matter, so you pay for quality with cost-savings — pick a mid or small tier with deeper reasoning, or batch frontier-tier calls.
The quality-latency edge (sacrifice cost): real-time chat with hard quality requirements, voice assistants that must sound smart, live trading explainers, medical decision support. The user is waiting, the answer must be good. You pay for both with money.
The cost-latency edge (sacrifice quality, within limits): high-volume classification, intent routing, log triage, spam filtering. The volume is so high that cost dominates, latency must be low to keep the queue from backing up, and quality only needs to clear a floor.
Now, the rule that traps most engineers. The triangle has a quality floor per task. Below that floor, the cook is not "lower quality" — they are unusable. You cannot trade infinite quality for cost. There is a step function: above the floor, the cook works; below it, the workload breaks.
The matching habit is — find the floor, then optimize the other two corners.
4) Worked example — an agent step at three tiers¶
A document-review agent processes legal contracts. One step in the agent loop is "given this contract clause and the company policy, decide if the clause is compliant; if not, propose a redline." This step is invoked roughly 200,000 times a month across thousands of contracts.
Measured numbers from a bake-off:
HAIKU 4.5 SONNET 4.6 OPUS 4.7
───────── ────────── ────────
correctness 64% 89% 95%
(judge graded)
avg in tokens 2200 2200 2200
avg out tokens 600 550 500
$/call $0.0014 $0.0149 $0.0705
$/month (200K) $280 $2,980 $14,100
P50 latency 720 ms 2100 ms 4800 ms
The legal team's quality floor is hard — anything below 85% correctness generates downstream rework that costs more than the model bill. Haiku at 64% is off the table. The remaining choice is Sonnet at $2,980/month or Opus at $14,100/month.
The marginal cost of moving from Sonnet to Opus is $11,120/month, or $133,440/year, for a 6-point correctness lift (89→95). Is that worth it?
Three sub-questions decide.
One — what is the cost of a wrong answer? A missed redline that goes into a signed contract creates downstream legal risk. If the average cost of a missed redline is more than $11,120 across the 200K calls, Opus pays for itself. The legal team estimates each missed redline at roughly $5,000 in remediation time. 6% extra correctness = 12,000 fewer missed redlines, but the high-stakes missed-redline rate is much lower — say 1%, so 2,000 serious misses avoided, worth $10M.
That is Opus territory. The cost-quality math flips clearly when the cost of a wrong answer is high.
Two — is there a router that catches the 6%? What if you ran Sonnet on all 200K and escalated only the cases where Sonnet self-reported low confidence? If 15% of calls escalate to Opus, the bill becomes:
$5,095 vs $14,100 — a 64% saving over pure-Opus, and if the escalation classifier is accurate, the correctness lifts from 89% (pure Sonnet) close to the 95% pure-Opus number. The router is almost always the right answer when the marginal frontier cost is large.
Three — what is the latency budget? If this step is part of an interactive review session where lawyers wait for the agent, 4800ms feels slow. Sonnet at 2100ms feels acceptable. If it is an async batch overnight, both feel fine. Latency is part of the choice.
5) Latency is not one number¶
A trap that catches engineers new to production AI — latency is treated as a single number. It is not. Three numbers matter, and they tell different stories.
LATENCY METRIC WHAT IT MEASURES WHO CARES
first-token latency time from request to first realtime chat,
(TTFT) streamed token voice, IDE
full-response latency time to last token of model async UI, agent
response step, tool call
end-to-end latency TTFT to last action including user-facing
tool calls and follow-ups experience
A 2026 example. Streaming Opus 4.7 emits the first token at roughly 800ms P50. The full response on a 500-token output takes about 5200ms. If the response is part of a three-step agent loop with a tool call between each step, end-to-end is roughly 15-25 seconds. Three very different numbers, all called "latency" in casual conversation.
For real-time chat, TTFT is what users feel. Below 1 second feels instant. 1-2 seconds feels live. Above 2 seconds feels broken — even if the full response is fast, the user has already given up waiting for the cursor to move.
For agent loops, end-to-end latency dominates. A user clicking "summarize this email thread" does not see streaming tokens — they see a spinner for 6, 8, 12 seconds. Optimizing TTFT here is wasted; optimize the total round trip.
For batch jobs, none of these matter individually — throughput (tokens per second per dollar) is the right metric.
6) The "good enough" floor — the most underused concept¶
Most engineers reason about model quality on a continuous curve. "Sonnet is better than Haiku. Opus is better than Sonnet." True, but useless. The right mental model is — most tasks have a quality floor below which the model is unusable, and above which marginal quality stops mattering.
QUALITY
│ ●●●●●●●●●●●●● Opus
│ ●●●●●●●
│ ●●●●●●●●● Sonnet
│ ●●●●●●●●
│ ●●●●●●● |
│● | ← floor for THIS task
│ |
│ | Haiku falls below the floor here
└───────────────────────────────────────── COST
cheap expensive
The job of the matching habit is — find the floor, then pick the cheapest cook above it. The floor lives in business reality, not benchmarks. A classifier that is 91% accurate is fine for ad targeting. The same 91% accuracy on medical triage is malpractice. The floor depends on the cost of a wrong answer.
Three questions find the floor.
- What is the worst output the business can absorb? If a customer gets a wrong refund, what does the company eat?
- What rate of wrong outputs is tolerable per 1000 calls?
- How is "wrong" detected — and what is the catch rate?
A workload with high cost-per-wrong-output and low catch rate forces a high floor — frontier-tier territory. A workload with low cost-per-wrong-output and high catch rate tolerates a low floor — small-tier territory.
Mid-content recall¶
- The triangle's three corners are quality, cost, and latency. What is the rule about how many you can have at the top?
- Why is "latency" not one number, and which latency metric matters for an agent loop?
- The "good enough" floor — what business question finds it?
7) The streaming caveat and the agent caveat¶
Two situations where the naive triangle reasoning goes wrong.
Streaming changes what users perceive as latency. If your interface streams tokens as they are generated, TTFT is what users feel — not the full response. A 5200ms full response from Opus that streams its first token at 800ms feels much faster than a 2100ms full response from Sonnet that returns all at once. For chat interfaces, optimizing TTFT can let you choose a slower-per-call model without users noticing.
Agents amplify latency multiplicatively. A three-step agent loop with Sonnet (2100ms each) is 6300ms. With Opus (5200ms each) it is 15600ms — a 9.3-second extra wait. The triangle bends harder for agents because the multi-step nature compounds the latency corner. A model that is only 2.5x slower per call becomes 2.5x slower per user interaction, which is much more visible.
Both caveats push you toward different routing patterns. Streaming pushes toward frontier-when-it-streams-well. Agents push toward small/mid tiers for sub-steps and frontier only for the planning step.
8) Failure modes — when the triangle is misread¶
SIGNAL FIX
────── ───
"we picked Opus for quality" → measure: does Sonnet clear the floor
without measuring floor for this task? If yes, demote.
"users complain it's slow" → check TTFT vs full-response. If full
but full-response is 1500ms response, the model is fine. If TTFT,
switch to streaming or smaller tier.
cost reported as $/call, → re-aggregate at $/month per workload.
not $/workload The right denominator decides the tier.
frontier model on batch workload → batch APIs charge 50% — use them. Or
with no batch API drop a tier; latency does not matter
for batch.
mid-tier on a hard reasoning step → audit error rate by case complexity.
with rising error rate If hard cases dominate, promote.
router rule based on input length → length is a poor proxy. Use complexity
only signals — entity count, ambiguity, prior
failures.
quality floor not written down → write it. "<5% error rate measured by
X judge on Y eval set." The number
decides the tier.
latency budget is "fast" → write a real number. "P95 < 2s TTFT,
P95 < 8s end-to-end." Then optimize.
The pattern across every row — measure the floor, measure the latency in the right units, measure the workload at the right denominator. Most triangle misreads come from arguing about an unmeasured corner.
9) Worked numerical example — when 8x more is worth it, and when it isn't¶
A search startup runs a query-rewriting model — it rewrites a user's natural language question into an optimized search query. The current pipeline uses Sonnet 4.6 at $0.005 per call. A bake-off shows Opus 4.7 at $0.041 per call delivers a 4-point improvement in downstream search satisfaction (measured by click-through and dwell time).
8.2x more cost. Is it worth it?
volume per day: 2,000,000 calls
Sonnet cost/day: $10,000
Opus cost/day: $82,000
delta: $72,000/day, $26.3M/year
The 4-point search-satisfaction lift on 2M daily users — does it generate $26M/year of additional revenue? The team estimates that satisfied users generate $1.50 of ad revenue per session and the lift translates to a 2% revenue increase. 2M users * $1.50/user/day * 2% = $60,000/day, $21.9M/year.
The math is close — Opus delivers $21.9M of value for $26.3M of extra cost. It does not pay for itself.
But — what about a router? Run Sonnet on the 80% of queries that are simple, Opus on the 20% that are complex.
$24,400 - $10,000 = $14,400/day delta, $5.3M/year. If the router catches roughly 70% of the quality improvement (a typical number when the router classifier is decent), the lift is $42,000/day in revenue against $14,400/day in cost. Now Opus pays for itself, three times over.
This is the deepest lesson of the triangle. The naive "use the frontier" or "use the mid tier" choices both lose. The router lets you have most of the quality at a fraction of the cost — bending the triangle in a way no single-model choice can.
Where this lives in the wild¶
The triangle shows up everywhere model decisions are made — but usually under different names.
- Anthropic Console — quality/cost/latency exposed side by side per model.
- OpenAI usage dashboard — cost-per-day broken down by model.
- Vertex AI Pricing Calculator — input/output cost per million by tier.
- AWS Bedrock — cost per model surfaced before model selection.
- OpenRouter — explicit cost-per-token shown next to each model in the routing UI.
- LiteLLM — built-in fallback and cost-tracking across tiers.
- Helicone, Langfuse, LangSmith, Braintrust — production traces tagged with model, cost, latency, judge score.
- Vellum, PromptLayer, Pezzo — A/B testing infrastructure that compares the same prompt across tiers.
- Cursor, Windsurf, GitHub Copilot — small models for autocomplete, larger for agent mode, exposing the tradeoff in product UI.
- Glean — small models for query reformulation, larger for synthesis.
- Notion AI — small for routing, larger for content generation.
- OpenAI Playground — side-by-side panel for comparing model outputs at identical prompts.
- Together AI / Fireworks AI / Replicate — open-weight pricing pages with explicit latency benchmarks.
- Modal / Banana / Anyscale — self-hosted GPU pricing where the cost-latency tradeoff is bent by GPU choice and batching.
- Anthropic Batch API, OpenAI Batch API — 50% discount in exchange for 24-hour latency. The cost-latency edge made explicit.
- MMLU, GSM8K, HumanEval, MTEB — public quality benchmarks per model.
- BIG-bench — reasoning quality across model families.
- Berkeley Function-Calling Leaderboard — tool-use quality across tiers.
- HumanEval-Plus, SWE-Bench, AppWorld — code and agent benchmarks where the quality-cost curve is visible across providers.
- Artificial Analysis — independent pricing-and-latency comparison site for all major models.
- Together AI Speed Leaderboard — tokens-per-second per model on the same hardware.
- vLLM, TensorRT-LLM, llama.cpp — open serving stacks where the triangle is bent by inference optimizations.
Pause and recall¶
- State the law of the triangle in one sentence.
- Name three "edges" of the triangle (quality-cost, quality-latency, cost-latency) and one workload type for each.
- What is the difference between TTFT and full-response latency, and which matters for streaming chat?
- In the search-startup example, why did the pure-Opus plan lose money but the router plan make money?
- Give the three questions that find the "good enough" floor.
- Why does the triangle bend harder for agents than for single calls?
- When does a batch API change the cost-latency tradeoff materially?
Interview Q&A¶
Q1. Walk me through how you would decide between Sonnet 4.6 and Opus 4.7 for a given step. A. Three measurements. First — establish the quality floor for the task, writing it as a number ("error rate below X% on eval set Y"). Second — run a bake-off on a representative sample. If Sonnet clears the floor, the default is Sonnet. Third — measure the cost of a wrong answer and the marginal cost of Opus over Sonnet. If Opus's marginal cost is less than Sonnet's marginal wrong-answer cost across the volume, Opus is justified. Otherwise, route — Sonnet by default, Opus on escalation. Trap: Picking based on intuition or benchmark scores rather than floor-and-marginal-cost reasoning.
Q2. A PM says "our chat feature feels slow." How do you debug it? A. Measure three latencies separately — TTFT, full-response, end-to-end. TTFT slow means the model is taking long to start streaming, fix with smaller tier or warmer routing. Full-response slow but TTFT fast means streaming is masking the cost, but agent latency or non-streaming downstream will show it. End-to-end slow with fast TTFT and full-response means tool calls or follow-up calls dominate — fix by parallelizing or pre-fetching. Trap: Treating latency as one number. The fix is different per metric.
Q3. When does batch pricing change the tier decision? A. Whenever the workload tolerates 24-hour latency. Both Anthropic and OpenAI offer roughly 50% discounts on batch APIs. That makes frontier batch calls compete with mid-tier realtime calls on cost. For overnight analytics, weekly digests, document processing, and any async pipeline, batch frontier often dominates realtime mid. Trap: Assuming batch is only for "small" workloads. Frontier batch is often the right choice for high-quality async work.
Q4. Your team is using Opus for everything. How do you find what to demote? A. Pull traces for the last 30 days, group by step type, compute per-step average error rate against ground truth and per-step volume. The demote candidates are high-volume, low-error-rate steps. Run a bake-off — same step, Sonnet vs Opus on a 200-example eval set, paired McNemar test for significance. If Sonnet matches Opus on the step, demote and save the delta. Common wrong answer to avoid: "Demote everything to Sonnet and see what breaks." That is an A/B in production without controls — you cannot attribute regressions cleanly.
Q5. How do you think about the cost of a wrong answer? A. Two factors — expected cost per wrong output, times rate of wrong outputs not caught. Both depend on the catch infrastructure. If a wrong refund is caught by a human reviewer in 90% of cases, the cost is the review time. If it ships to production and a customer disputes, the cost is the refund plus complaint handling plus brand. Cost-per-wrong-output is not a model property — it is a system property. Trap: Treating cost-per-wrong-output as a number you can read off a benchmark.
Q6. A streaming chat with a 1.5s TTFT — is that fast or slow? A. Depends on context. For pure conversational chat with humans, 1.5s TTFT is on the edge — users tolerate up to about 2s before perceived staleness. For an agent that is also calling tools, 1.5s TTFT is one of many latency components and not the bottleneck. For voice, 1.5s is way too slow — voice needs sub-500ms TTFT. The right answer is "what is the user doing while they wait?" Trap: Quoting a generic number like "users tolerate 2s" without asking the use case.
Q7. Why might a 5x more expensive model still pay for itself? A. Three patterns. One — the marginal quality lift unlocks revenue larger than the cost delta (search relevance, conversion rate). Two — the marginal quality lift reduces downstream rework or human-handoff cost (legal review, customer support escalation). Three — the marginal quality lift reduces brand risk (one viral bad output costs more than years of model spend). The math depends on which pattern applies. Trap: Justifying frontier by quality alone with no business linkage.
Q8. What is the most overrated metric in model selection, and what is the most underrated? A. Overrated — generic benchmark scores like MMLU. They are aggregates over thousands of tasks that have nothing to do with your workload, and correlation with production quality is weak. Underrated — variance. Average quality is one number; the rate of catastrophic failures (toxic, PII leak, completely wrong) is the number that decides whether the model ships. Two models with equal MMLU can have wildly different tail behavior. Trap: Citing MMLU in an interview as a selection criterion. The interviewer will ask what you used your own eval set for.
Apply now (5 min)¶
Step 1 — write the floor. Pick one model step in your system. Write its quality floor as a number — error rate, judge score, or task-specific metric. If you cannot write a number, that step does not have a measurable floor and the rest of the matching habit is guesswork.
Step 2 — measure three corners. For your current model on that step, write down its quality (vs floor), cost per call, and latency (TTFT and full-response). One row, four columns.
Step 3 — find the bend. Imagine moving up one tier and down one tier. Estimate the new quality, cost, latency. Which direction stays above the floor? Which direction is cheaper? Which is the right move for this step?
Bridge. You know the triangle exists. The next question is how to actually measure quality across cooks so the tier decision is grounded in evidence and not vibes. The bake-off — same eval set, same prompts, same judge, statistical significance.