01. The tier anatomy — four cooks for one kitchen¶

~14 min read. A model tier is not a brand. It is a price-per-token, a latency profile, and a capability ceiling stapled together. Frontier, mid, small, on-device — every serious AI kitchen runs all four. That is the whole law of this page.

Builds on 00-eli5.md. The kitchen does not have one cook — it has a team. Telling them apart is the first job of the kitchen manager.

1) Hook — the support inbox that needs all three cooks¶

Look. A real ticket lands in a real support AI at 11:42 a.m.

"Hi, my order #4481 hasn't shipped in 6 days. Can you check what happened and refund me if it's stuck? My address is the same as last time. Thanks."

Now watch what the system actually has to do.

First, decide what kind of ticket this is. Refund? Status check? Complaint? Three buckets. One label out. That is a closed-set classification problem with a fixed vocabulary of maybe twelve intents. The right cook here is the one who has chopped a thousand onions and knows the rack — fast, cheap, narrow.

Second, pull structured fields out of the prose. order_id = "4481", days_since_order = 6, requested_action = "refund_if_stuck". Extraction. Reasoning is shallow, the output schema is fixed. A workhorse cook handles this without sweating.

Third, the hard part. The system has to read the order history, see that shipment was attempted twice and failed at the courier hub, decide whether that warrants a refund versus a re-ship, check the customer's lifetime value tier, draft a reply that says the right thing in the right tone, and decide whether to escalate to a human. This is multi-step reasoning, planning, and judgement stitched together. The master chef earns her rate here.

ONE TICKET, THREE COOKS
───────────────────────
classify intent      → Haiku 4.5 / GPT-4o-mini    (~$0.0003)
extract fields       → Sonnet 4.6 / GPT-4o         (~$0.004)
plan + draft + judge → Opus 4.7 / GPT-5            (~$0.035)
                       ─────────────────────────
                       total per ticket   ~$0.040

If the kitchen ran every step on the master chef, the ticket costs roughly $0.12 instead of $0.04 — three times the bill for marginal quality gain on the cheap steps. Across a million tickets a month, that gap is eighty thousand dollars. The matching habit is not a style point. It is the P&L.

2) The metaphor — three cooks and a robot prep counter¶

The kitchen has a master chef, a workhorse cook, an apprentice, and a small robotic prep counter near the door. Each has a different cost, a different speed, a different ceiling, a different risk profile.

The master chef takes the hardest tickets — the ones that need invention, multi-step reasoning, a recovery from a bad sauce. She is slow because she thinks. She is expensive because she is rare.

The workhorse cook owns the menu. He has cooked the standard dishes ten thousand times. He is fast enough for the lunch rush, cheap enough that the kitchen runs in the black, and accurate enough that nine out of ten tickets never need to leave his station.

The apprentice handles the labelled work. Chop onions. Plate bread. Stamp the order ticket. She moves in seconds and costs almost nothing. Ask her to invent a dish and you will get garbage. Ask her to do the same thing a million times and she is unbeatable.

The robot prep counter is bolted to the kitchen itself. It does not even talk to the supplier. Quiet. Private. Slower than the apprentice on most jobs but the only one that works when the internet goes out, when the data must not leave the premises, when the customer is sitting in front of a phone with no signal.

Four cooks. Four price points. Four jobs. The whole module is one habit — deciding which cook touches which ticket.

3) The anatomy — four tiers in 2026¶

┌────────────────────────────────────────────────────────────────────────┐
│ TIER         REPRESENTATIVES (2026)              IN $/M    OUT $/M     │
├────────────────────────────────────────────────────────────────────────┤
│ FRONTIER     Opus 4.7, GPT-5, Gemini 2.5 Pro      $15        $75       │
│ MID          Sonnet 4.6, GPT-4o, Gemini 2.5 Flash $3         $15       │
│ SMALL        Haiku 4.5, GPT-4o-mini, Flash-Lite   $0.25      $1.25     │
│ ON-DEVICE    Llama 3.3 8B Q4, Phi-4, Gemma 3      $0 (cap)   $0 (cap)  │
├────────────────────────────────────────────────────────────────────────┤
│              TYPICAL P50 LATENCY (full response, 500 tokens out)       │
│ FRONTIER     3000-8000 ms                                              │
│ MID          800-2500 ms                                               │
│ SMALL        300-900 ms                                                │
│ ON-DEVICE    400-2000 ms (depends on hardware, not network)            │
└────────────────────────────────────────────────────────────────────────┘

Four bands. Roughly 5x apart in cost between adjacent bands, and 3-4x apart in latency. That is not an accident — vendors price tiers to be roughly this far apart so the choice between them is forced. If they were 1.5x apart, the cheaper tier would not be worth the quality risk. At 5x, the math gets honest.

Frontier — the master chef¶

The frontier tier is where the strongest available capability lives. In May 2026 that means Anthropic Opus 4.7, OpenAI GPT-5, Google Gemini 2.5 Pro. These models do long-horizon reasoning, novel code, hard math, agent planning with twenty-step plans, and arbitration between conflicting evidence. They also cost roughly $15 in / $75 out per million tokens, take three to eight seconds for a typical response, and have rate limits that force production teams to negotiate.

The frontier tier is reserved for tickets that genuinely require it. "Genuinely" is doing work here. If a mid-tier model clears your quality bar, running the frontier is wasted money. If it does not, no amount of prompt engineering will rescue the cheaper tier — you need the frontier.

Mid — the workhorse¶

The mid tier is where most production AI lives. Anthropic Sonnet 4.6, OpenAI GPT-4o, Google Gemini 2.5 Flash. Roughly $3 in / $15 out per million, one to two seconds end-to-end, generous rate limits.

A mid-tier model in 2026 is what a frontier model was in 2024 — strong enough for almost any standard task. Extraction, summarization, drafting, light reasoning, function-calling, RAG synthesis, classification on open vocabulary. Most "we use AI for X" features at most companies are mid-tier workloads whether the team realises it or not.

Small — the apprentice¶

The small tier is where high-volume, narrow, well-shaped tasks live. Anthropic Haiku 4.5, OpenAI GPT-4o-mini, Google Gemini 2.5 Flash-Lite. Roughly $0.25 in / $1.25 out per million, sub-second responses, huge rate limits.

Small does not mean dumb. Small means narrow. Haiku 4.5 will out-classify a generic frontier model on closed-set intent classification if you give it a clean prompt, because Haiku 4.5 is trained to do exactly this kind of work fast. What small tiers cannot do is long-horizon reasoning, novel code synthesis, or planning. Ask them and the cracks show.

On-device — the prep counter¶

The on-device tier runs on the box itself. No network. No vendor invoice. No data leaving the building. Llama 3.3 8B quantized to 4-bit, Microsoft Phi-4, Google Gemma 3, Apple's on-device foundation models on iPhone 16+ and M4 Macs.

Why it exists: privacy, offline, regulatory, and latency-floor for very narrow tasks. The customer's phone running a local model is the only way some workloads can ship. Healthcare apps in jurisdictions that forbid PHI egress. On-board car assistants in tunnels. Government workflows on air-gapped networks. Consumer apps that promise "your data never leaves the device."

The on-device tier pays its bill once — the engineering cost of packaging, quantizing, and distributing the model. After that, the marginal cost per inference is electricity. For a workload that does ten million inferences a day on user hardware, that math is unbeatable.

4) Worked numerical example — the support kitchen for a month¶

A mid-sized e-commerce support team handles 1,000,000 tickets a month. Each ticket goes through three stages — intent classification, field extraction, and response drafting. Average tokens per stage:

STAGE         IN TOKENS   OUT TOKENS
─────────     ─────────   ──────────
classify      400         20
extract       600         150
draft         1200        400

Now run the math three ways.

Naive — everything on the frontier (Opus 4.7).

classify:  1M * (400 * $15/M + 20 * $75/M)         = $7500
extract:   1M * (600 * $15/M + 150 * $75/M)        = $20,250
draft:     1M * (1200 * $15/M + 400 * $75/M)       = $48,000
                                                  ─────────
                                          TOTAL    = $75,750

Lazy — everything on the mid tier (Sonnet 4.6).

classify:  1M * (400 * $3/M + 20 * $15/M)          = $1500
extract:   1M * (600 * $3/M + 150 * $15/M)         = $4050
draft:     1M * (1200 * $3/M + 400 * $15/M)        = $9600
                                                  ─────────
                                          TOTAL    = $15,150

Matched — classification on Haiku, extraction and draft on Sonnet, hard cases (~5% of tickets) escalate to Opus.

classify (Haiku 4.5):
  1M * (400 * $0.25/M + 20 * $1.25/M)              = $125

extract (Sonnet 4.6):
  1M * (600 * $3/M + 150 * $15/M)                  = $4050

draft (Sonnet 4.6, 95%) + Opus (5%):
  0.95M * (1200 * $3/M + 400 * $15/M)              = $9120
  0.05M * (1200 * $15/M + 400 * $75/M)             = $2400
                                                  ─────────
                                          TOTAL    = $15,695

Wait. The matched plan came out slightly more expensive than lazy. Why? The 5% Opus escalation costs $2400 — more than the $480 saved by moving classification from Sonnet to Haiku.

That is the lesson buried in the math. Tier-matching only pays when each step is genuinely on the right tier. If the lazy plan was already correct on 95% of cases, escalating the hard 5% costs real money — but those are exactly the cases where the lazy plan would have given a wrong answer and created a refund, a complaint, or a churned customer. The $2400 escalation buys an insurance policy that the lazy plan does not have.

Compare the matched plan to the naive plan: $15,695 vs $75,750. That is $60,000 a month, or $720,000 a year, on one kitchen. Across an org with five such kitchens, the matching habit pays a senior engineer's salary three times over.

5) Why all four tiers in one kitchen¶

A common pushback from teams new to this — "can't we just pick one tier and keep it simple?" Three reasons no.

The first is the cost gradient. The 60x gap in price between Haiku and Opus is not subtle. If half your workload is classification and half is hard reasoning, running the whole thing on Opus burns five-figures-a-month in pure overhead. Running it all on Haiku produces wrong answers on the hard half. Neither extreme is rational.

The second is the latency budget. Real-time chat needs sub-second first-token latency. Frontier models cannot hit that for typical context sizes. Batch overnight summarization can tolerate ten seconds per call — so the cost trade-off is the only thing that matters. Different latency budgets force different tiers in the same system.

The third is the failure profile. Frontier models are smarter but also slower to recover from outages — provider rate limits hit them first, queues back up, and your whole stack stalls if you have no fallback. Small models with high-throughput limits give you headroom. The kitchen that has all four cooks has redundancy the single-tier kitchen does not.

Mid-content recall¶

Why does the matched plan in the worked example end up roughly the same cost as the lazy plan, and what does that gap buy?
What is the rough price ratio between adjacent tiers, and why do vendors price them this far apart?
Name one workload where on-device is the only tier that ships.

6) Failure modes — the matching habit goes wrong¶

SIGNAL                                FIX
──────                                ───
"we just use GPT-5 for everything" →  audit per-step cost; route the cheap
                                      steps to small/mid

"we use the cheapest model for      →  identify the steps that drive quality
 everything to save cost"             complaints; promote them one tier up

frontier model on a closed-set     →  move to small tier; frontier is wasted
 classifier                           on a 12-way intent label

mid-tier on multi-step planning    →  promote planning step to frontier; keep
 with rising error rate               sub-steps on mid

small tier on long-context         →  small tiers degrade above 32K-64K
 summarization at 100K tokens         context; promote to mid or frontier

on-device model used because       →  audit — if no privacy or offline
 "it sounded smart"                   requirement, on-device is the wrong
                                      tool

mid tier on agent tool-selection   →  most cases mid is fine; for ambiguous
 with 30+ tools                       toolbelts, promote to frontier

frontier latency blocking          →  parallelize, stream, or downshift the
 user-facing chat                     non-critical sub-steps

The pattern across every row — the wrong tier is almost always too expensive for what it gives you, or too weak for what you need. The matching habit is the diagnostic. Look at each step, ask what the worst output would cost the business, and pick the cheapest cook whose worst output is still acceptable.

Where this lives in the wild¶

The four-tier landscape is not a textbook diagram — it is what every serious AI infrastructure surface exposes.

Anthropic API — Opus 4.7, Sonnet 4.6, Haiku 4.5 as the explicit three tiers; pricing pages list them side by side.
OpenAI API — GPT-5, GPT-4o, GPT-4o-mini as the analogous three.
Google AI Studio / Vertex AI — Gemini 2.5 Pro, Flash, Flash-Lite.
AWS Bedrock — multiple tiers across Anthropic, Meta, Mistral, Cohere, Amazon Titan, all priced by the same anatomy.
Azure OpenAI — GPT-5, GPT-4o, GPT-4o-mini exposed with Microsoft SLAs.
Vertex AI — Gemini tiers plus open-weight Llama and Mistral hosted.
Together AI — open-weight models at all four tiers, including Llama 3.3 70B and Qwen 2.5 72B at frontier-comparable quality.
Fireworks AI — similar, with latency-optimized serving stacks.
Replicate — hundreds of open-weight models, mostly small and mid tier.
Modal — serverless GPU for self-hosted on-device-style workloads.
Anyscale — Ray Serve plus open-weight models for tier-flexibility.
OpenRouter — routing layer that exposes every major tier behind one API.
LiteLLM — open-source proxy with tier-aware routing logic.
Helicone, Langfuse, LangSmith, Braintrust — observability platforms that report per-tier cost and latency for production traffic.
Vellum, Pezzo, PromptLayer — prompt management with per-tier eval views.
Cursor, Windsurf, GitHub Copilot — code editors that route easy completions to small tiers and hard refactors to frontier.
Glean — enterprise search that uses small models for retrieval reranking and frontier for answer synthesis.
Notion AI — small-tier classification for routing, mid for drafting.
Anthropic Console / OpenAI Playground — the per-tier toggles you click to see cost-quality tradeoffs interactively.
MMLU, HumanEval, GSM8K, BIG-bench, MTEB — public benchmarks where the tier-quality curve is published.
Berkeley Function-Calling Leaderboard — same models, scored on tool use.
Hugging Face Inference Endpoints — open-weight hosting across all tiers.
Llama 3.3 8B / Phi-4 / Gemma 3 / Apple Intelligence — on-device tier in real shipping consumer products.

Pause and recall¶

Name three frontier models, three mid-tier models, three small-tier models, and three on-device models as of 2026.
Roughly what is the price ratio between Haiku 4.5 and Opus 4.7 per million tokens?
Why does a kitchen need all four tiers, not just the best one it can afford?
Give one workload that belongs on small and one that belongs on frontier, with one line of reasoning each.
What is the matching habit, in one sentence?
When does on-device beat managed even when the managed version is cheaper per token?
In the worked example, why did the matched plan come out close in cost to the lazy plan — and why was it still the right plan?

Interview Q&A¶

Q1. Walk me through how you would pick a model for a new feature. A. I start by writing the feature as a sequence of model-touching steps. For each step I ask three things: what is the worst output the business can absorb, what is the latency budget, and what is the volume per day. Those three drive the tier. Closed-set classification at high volume goes small. Generation in a known shape goes mid. Multi-step reasoning or planning goes frontier. I draft the routing in a table, then run a bake-off on each step to confirm the cheapest tier that clears the quality floor. Trap: Naming a specific model first. The interviewer wants to hear the reasoning, not the brand.

Q2. Your CEO says "let's use GPT-5 for everything to keep it simple." What do you say? A. I would run the unit economics on a representative workload. For most production systems, 60-80% of the model calls are classification, extraction, or short generation — work that Haiku, Flash-Lite, or GPT-4o-mini handle at 1-2% of the frontier cost. Using GPT-5 everywhere is typically a 5-10x cost inflation for a quality gain measurable only on the hardest steps. The simpler architecture is a routing layer, not a single-model architecture. Common wrong answer to avoid: Agreeing because "it is simpler." Simpler in code, much more expensive in P&L.

Q3. When is on-device the right tier? A. When at least one of four conditions holds. One — data cannot legally leave the device (healthcare PHI in some jurisdictions, regulated finance, some government). Two — the network may not be available (in-car assistants, field workers, planes). Three — latency floor below 200ms is required and network round trip kills it. Four — per-inference cost over user hardware amortizes to near-zero at scale (consumer apps with billions of calls). Trap: Picking on-device for "privacy" when the actual workload streams data to a managed model anyway.

Q4. Why do mid-tier models matter more than frontier ones for most production work? A. Because most production work is in the shape mid-tier models handle — extraction, summarization, drafting, RAG synthesis, function-calling. Frontier-only quality gains show up on novel reasoning, hard code, and long-horizon planning. If the workload does not include those, the mid tier delivers the same business outcome at one-fifth the cost. Trap: Treating frontier as the default and mid as a downgrade. Mid is the default; frontier is the escalation.

Q5. How do you decide when to escalate a request to a frontier model? A. Two patterns. First — the small/mid model self-flags low confidence (logprobs, explicit "I am unsure" output, schema-validation failure) and a router escalates. Second — a cheap classifier upfront tags hard cases (long history, complex multi-part requests, prior failed attempts) and routes them straight to frontier. The pattern depends on whether the hardness is detectable upfront or only after a first attempt. Trap: Escalating every request "to be safe." That collapses to the all-frontier plan with extra latency.

Q6. A team is running 100% on Opus 4.7. How do you find what to demote? A. Pull the trace logs, group by step type, and compute three metrics per group — error rate against ground truth, output token length, and human-review flag rate. Steps with low error rates and short outputs are the safest demote candidates. Run a bake-off — same step, Sonnet vs Opus, same eval set, significance test. If Sonnet matches Opus on that step, demote. Common wrong answer to avoid: "Demote the cheap steps first." The cheap steps are already cheap. The demote candidates are the expensive steps that turn out not to need the frontier.

Q7. Why are tiers priced roughly 5x apart instead of 1.5x apart? A. Two reasons. One — the underlying training and inference costs are genuinely that far apart, because frontier models use larger parameter counts, longer training, and more inference compute per token. Two — vendors want the tier choice to be forcing. If Haiku were 30% cheaper than Sonnet, nobody would think hard about which to use. At 12x cheaper, the choice becomes a real architecture decision and the cheap tier gets the volume it needs to be economic. Trap: "Vendors are gouging on the frontier." Frontier inference really is much more expensive per token.

Q8. How do you think about the difference between Haiku and on-device? A. Haiku is a managed small-tier model — fast, cheap, network-dependent, hits a vendor API. On-device is a local model — slower per inference, zero marginal cost per call, but pays a fixed engineering and distribution cost. Haiku wins for server-side classification at scale. On-device wins when data egress is forbidden, when offline is required, or when the per-call cost matters more than the per-engineer cost. Trap: Treating them as the same tier because both are "small." Different deployment surfaces with different bottlenecks.

Apply now (5 min)¶

Step 1 — list your steps. Take one feature you have built or are building. Write out the model-touching steps in order. For each step, fill in three fields: input tokens, output tokens, calls per day.

Step 2 — assign a tier. For each step, write the tier you would assign (F/M/S/D) and one line of justification — what is the worst output the business can absorb at that step.

Step 3 — run the math. Compute monthly cost two ways: everything on frontier, and the tier-matched routing. The gap is your starting argument for the routing architecture in your next planning meeting.

Bridge. Now you can name the four cooks. The next question is what forces the choice between them — the three-way tradeoff between cost, latency, and quality, and the fact that you cannot have all three at the top.

→ 02-cost-latency-quality-triangle.md