04. Task-to-tier mapping — the routing matrix that runs the kitchen¶

~16 min read. Most tasks belong on a specific tier. Classification on small. Generation on mid. Planning on frontier. Build a router that respects this, and you cut your model bill by 60-80% with no quality loss. The matching habit becomes a piece of running code.

Builds on 03-bake-off-methodology.md. The bake-off proves the hands_on_lab is sound. This chapter turns proven hands_on_labs into the dispatcher that actually moves the ticket to the right cook.

1) Hook — the routing layer that paid for itself in eleven days¶

An e-commerce support company runs an AI inbox handler at 3 million tickets a month. The original architecture — every ticket gets routed through Sonnet 4.6 for classification, extraction, and reply drafting. Monthly bill — $47,000.

A new engineer audits the traces and notices three things. Classification is a 12-way intent label — Haiku-tier work. Extraction is structured-output on short prose — also Haiku-tier on simple cases, Sonnet-tier on complex ones. Drafting is genuinely Sonnet-tier, except for the 8% of cases that involve disputes, escalations, or unusual policy questions — those need Opus.

She builds a routing layer in two weeks. The router is itself a small classifier (Haiku) that reads the incoming ticket and tags it with a complexity bucket. Each bucket maps to a tier.

INCOMING TICKET
      │
      ▼
┌─────────────┐
│  Haiku 4.5  │   ← classify intent + complexity
│  $0.0003    │
└──────┬──────┘
       │
       ├─── 62% simple ──→ Haiku 4.5 for extraction + draft  ($0.001 total)
       │
       ├─── 30% standard ─→ Sonnet 4.6 for extraction + draft ($0.013 total)
       │
       └─── 8% complex ──→ Opus 4.7 for extraction + draft   ($0.078 total)

Eleven days after deploying the router, the monthly cost projection recomputed to $12,400. A 74% reduction. Quality measured on the same eval set held steady — the router was not "downgrading" most tickets, it was promoting them to the cook they always should have had.

That is the topic of this chapter. Most production AI systems are running the wrong cook on most tickets. The routing matrix tells you which cook belongs where, and the router turns that knowledge into running code.

2) The metaphor — the maître d' at the kitchen door¶

Every restaurant with three kinds of cooks needs a maître d' at the door. Their job — look at the ticket, decide which station it goes to, send it there. They are not the cook. They do not prepare anything. They route.

The maître d' is small, fast, and cheap — running every ticket through them costs almost nothing. Their decision matters because every ticket that goes to the wrong station either burns money (apprentice work landing on the master chef) or burns quality (master-chef work landing on the apprentice).

A good maître d' has a routing rulebook. Closed-set classification — small. Structured extraction — small unless the prose is dense. Generation in known shape — mid. Planning and judgement — frontier. Long-context summary — mid or frontier depending on stakes. Tool calling — mid for clean toolbelts, frontier for ambiguous ones.

The rulebook is not a one-time write. It is updated whenever the bake-off reveals a new tier-boundary, whenever a new model launches that shifts the boundary, whenever a workload changes shape. The matching habit lives in this rulebook.

3) The routing matrix¶

Here is the matrix every applied AI lead should have memorized. Six task types. Recommended tier. The reasoning behind each.

┌─────────────────────────┬────────────┬──────────────────────────────────┐
│ TASK TYPE               │ DEFAULT    │ REASONING                        │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Closed-set              │ SMALL      │ Fixed vocabulary, shallow        │
│ classification          │            │ reasoning. Small tiers excel     │
│                         │            │ here at 50x cheaper.             │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Structured extraction   │ SMALL→MID  │ Small on clean prose. Mid when   │
│                         │            │ inference depth matters          │
│                         │            │ (multi-step reasoning over text) │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Generation              │ MID        │ Most production drafting,         │
│ (drafting, rewriting)   │            │ summarizing, polishing. Mid is   │
│                         │            │ the default; frontier only for   │
│                         │            │ high-stakes outputs.             │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Multi-step reasoning,   │ FRONTIER   │ Long-horizon planning, complex   │
│ agent planning          │ (planner)  │ tool selection across many tools,│
│                         │ MID (steps)│ recovery from sub-step failure.  │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Judgement / LLM-as-     │ FRONTIER   │ The judge must be at least as    │
│ judge                   │            │ strong as the strongest model    │
│                         │            │ it judges. Frontier always.      │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Long-context            │ MID+       │ Mid for routine summaries.       │
│ summarization (>32K)    │            │ Frontier for high-stakes (legal, │
│                         │            │ medical, financial) or for       │
│                         │            │ >100K context with reasoning.    │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Tool calling            │ MID        │ Most function-calling is mid-    │
│                         │            │ tier work. Frontier only for     │
│                         │            │ ambiguous tool selection across  │
│                         │            │ 20+ tools.                       │
├─────────────────────────┼────────────┼──────────────────────────────────┤
│ Code generation         │ MID (most) │ Mid for boilerplate, refactors,  │
│                         │ FRONTIER   │ unit tests. Frontier for         │
│                         │ (novel)    │ algorithm design, debugging      │
│                         │            │ unfamiliar codebases.            │
└─────────────────────────┴────────────┴──────────────────────────────────┘

The matrix is a starting point, not a verdict. Every cell is subject to override by a bake-off on your specific workload. But the matrix gives you the default and the reasoning, which is what every interviewer wants to hear.

4) Walking through each row¶

Closed-set classification¶

A closed-set classifier picks one label from a fixed vocabulary — intent detection, spam filtering, sentiment classification, content moderation buckets. The reasoning involved is shallow. The model just needs to read the input, recognize patterns, and pick a label.

TASK             EXAMPLE                              VERDICT
────             ───────                              ───────
intent routing   "is this a refund / status / new   Haiku 4.5
                  question / complaint?"

content tagging  "tag this product description with  GPT-4o-mini
                  one of 30 category labels"

spam detection   "is this email legitimate or spam?" Gemini Flash-Lite

sentiment        "rate this review positive,        Haiku 4.5
                  negative, or neutral"

The benchmark consensus in 2026 — small tiers match mid-tier accuracy on closed-set classification, often within a percentage point, at one-tenth the cost. Frontier-tier is wasted here.

Common mistake — running classification on Sonnet because "it's more accurate." Run a bake-off. The accuracy gap on closed sets is typically less than 2 points and not worth the 12x cost.

Structured extraction¶

Pulling fields out of text — names, dates, amounts, addresses, IDs. Small tiers handle this well on clean prose, especially with strict schemas and constrained decoding. They struggle when extraction requires multi-step reasoning — "find the implied delivery date based on the order timeline and the holiday calendar" — that is a mid-tier task.

EXTRACTION TYPE                   VERDICT
───────────────                  ───────
named entities (people, dates)   small
order IDs, amounts, addresses    small
implied / inferred fields        mid
extraction with cross-document   mid or frontier
 reconciliation

The boundary is reasoning depth. A clean field-extraction prompt on a single document — small. An extraction that requires reading between lines, normalizing inconsistent formats, or reconciling across sources — mid.

Generation¶

Drafting an email, rewriting a paragraph, summarizing a meeting, generating a product description. Mid-tier is the default for most generation. Sonnet 4.6, GPT-4o, and Gemini 2.5 Flash all produce high-quality prose for the standard cases, at roughly one-fifth the cost of frontier.

Frontier territory is reserved for high-stakes outputs — legal contracts, medical communications, regulatory filings, customer-facing announcements that go to large audiences. The quality gap on routine generation is small. The quality gap on high-stakes generation is larger and matters more.

Multi-step reasoning and agent planning¶

This is the frontier's home turf. Long-horizon planning across multiple tool calls, recovery from sub-step failures, arbitration between conflicting evidence, breaking down ambiguous user goals into concrete steps — frontier models are meaningfully better at this work.

But here is the key insight — the planner is on frontier, the executors are not. A well-designed agent uses frontier for the top-level planning step and mid or small for each sub-step. The frontier writes the plan; mid tiers execute it. This pattern is called tier-routing within an agent and it is how production agents stay economical.

AGENT ARCHITECTURE
──────────────────
top-level plan         → Opus 4.7
search subgraph        → Sonnet 4.6 + tool calls
extraction sub-step    → Haiku 4.5
verification step      → Sonnet 4.6
final answer drafting  → Sonnet 4.6
escalation judge       → Opus 4.7

An eight-step agent that runs every step on Opus costs roughly 8x more than the tier-routed version with similar quality. The frontier earns its keep on the planner; the rest belongs to cheaper cooks.

Judgement and LLM-as-judge¶

A judge model evaluates other model outputs — pairwise comparisons, rubric scoring, error categorization. The judge must be at least as strong as the strongest model it judges. Frontier almost always.

A common mistake — using Sonnet to judge Opus outputs. The judge cannot reliably evaluate outputs that exceed its own capability. The result is that Sonnet rates Opus's answers lower than they deserve because Sonnet cannot fully appreciate the harder reasoning. Use Opus or GPT-5 as the judge, regardless of the models being compared.

Long-context summarization¶

Above roughly 32K-64K tokens, small tiers start to degrade — they miss content in the middle, hallucinate summary points not in the source, or truncate effectively. Mid-tier holds up well to 128K+. Frontier is strongest above 200K and for high-stakes summaries.

CONTEXT SIZE              VERDICT
────────────              ───────
< 32K tokens              small fine for low-stakes; mid for high-stakes
32K-128K tokens           mid is the default
128K-500K tokens          mid for routine; frontier for high-stakes
500K+ tokens              frontier; small/mid often unreliable

Tool calling¶

Most function-calling is mid-tier work. Sonnet, GPT-4o, and Gemini Flash handle 5-15 tool toolbelts cleanly. Frontier becomes worth it when the toolbelt has 20+ tools, when the selection is genuinely ambiguous, or when the cost of a wrong tool call is high (production database mutation, financial transaction).

Public benchmarks like the Berkeley Function-Calling Leaderboard show the gap — frontier models score roughly 5-10 points higher on hard function-calling, but mid-tier is sufficient for the 80% of toolbelts that are well-designed.

Code generation¶

Mid-tier handles boilerplate, refactors, unit tests, type hints, and straightforward bug fixes. Frontier earns its keep on novel algorithm design, debugging unfamiliar codebases, and architectural decisions.

In practice — code completion in an IDE is mid-tier (or small for single-line completions). Agent-style code refactoring across multiple files is frontier. The split shows up in Cursor and Windsurf — autocomplete runs on faster mid-tier or small models, "agent mode" runs on frontier.

5) Worked example — building the router¶

Let's design the router for the support-AI workload from the hook. The incoming ticket lands. Step one — the router itself runs.

ROUTER PROMPT (Haiku 4.5, ~50 tokens out)
─────────────────────────────────────────
Given this support ticket, output:
  intent: one of [refund, status, complaint, new_question, escalation]
  complexity: one of [simple, standard, complex]
  needs_history: yes / no

Rules for complexity:
  simple   = single clear question, no order history needed
  standard = single transaction, may need order lookup
  complex  = multi-step, disputes, escalations, unusual policy

Output strict JSON only.

The router runs every ticket through Haiku at ~$0.0003. The output drives the tier selection.

                  TICKET TYPES (3M tickets/month)
                  ──────────────────────────────────

Simple        (62%)  → 1.86M tickets → Haiku for entire pipeline
Standard      (30%)  → 0.90M tickets → Sonnet for extraction + draft
Complex       (8%)   → 0.24M tickets → Opus for the whole flow

                  COST BREAKDOWN
                  ──────────────

router (3M × $0.0003)                    = $900
simple   (1.86M × $0.0009 avg total)     = $1,674
standard (0.90M × $0.014 avg total)      = $12,600
complex  (0.24M × $0.078 avg total)      = $18,720
                                         ──────────
                                          $33,894

Wait — that is $33,894, not $12,400. Let me recompute. The earlier number was the per-month savings projection on a smaller flow. Let me redo with realistic numbers per stage:

Per-ticket cost breakdown (using avg tokens per workload):

SIMPLE (Haiku 4.5 throughout):
  router            400 in / 50 out  → $0.00016
  extraction        300 in / 80 out  → $0.00018
  draft             500 in / 200 out → $0.00038
  per ticket total                   → $0.00072

STANDARD (Sonnet 4.6 for extract+draft):
  router (Haiku)                     → $0.00016
  extraction       400 in / 150 out  → $0.00345
  draft            800 in / 350 out  → $0.00765
  per ticket total                   → $0.01126

COMPLEX (Opus 4.7 throughout):
  router (Haiku)                     → $0.00016
  extraction       600 in / 200 out  → $0.02400
  draft           1500 in / 500 out  → $0.06000
  per ticket total                   → $0.08416

MONTHLY (3M tickets):
  simple   1.86M × $0.00072  = $1,339
  standard 0.90M × $0.01126  = $10,134
  complex  0.24M × $0.08416  = $20,198
                              ─────────
                               $31,671

  vs all-Sonnet baseline:   ~$47,000
  savings:                    $15,329  (33% reduction)

Not 74%. Not 60-80%. 33%. Honest answer.

The bigger savings claims you hear in the wild come from organizations that were running everything on frontier before routing. From frontier to a tier-routed plan you get 70-80% savings. From mid-tier to a tier-routed plan with mostly small classification, you get 20-40% — still worth doing, but the headline number depends entirely on where you started.

The lesson — the router's savings depend on the original baseline. Quote the savings honestly with both the before-tier and the after-router configuration.

Mid-content recall¶

In the routing matrix, which tier owns closed-set classification, and why is frontier wasted there?
What is the tier-routing pattern within an agent — which step gets frontier, which steps get mid?
Why must an LLM-as-judge always be at least as strong as the strongest model it judges?

6) The router itself — design choices¶

Three ways to build the router.

Rule-based router. Hard-coded if/then on input features — length, keyword match, regex patterns, user tier. Cheapest. Fast. Brittle to new ticket types. Good for stable workloads where the rules rarely change.

Classifier router. A small model (Haiku 4.5 or fine-tuned distilbert) tags each input with a complexity bucket. Slightly more expensive than rule-based but much more adaptive. The right choice for most production systems.

Confidence-based router. Run the cheap tier first. If its self-reported confidence (logprobs, schema-validation pass rate, explicit "I'm unsure" signal) is below threshold, escalate to a higher tier. Cheapest in expected cost when most queries are simple, but adds latency on the escalated cases.

ROUTER TYPE          COST       LATENCY     FLEXIBILITY
───────────          ────       ───────     ───────────
rule-based           lowest     fastest     low
classifier           low        +50-200ms   medium
confidence-based     variable   +200-1000ms high (on hard cases)

Most production systems use a hybrid — rules for the obvious cases (length thresholds, known intent keywords), classifier for the ambiguous ones, confidence-based escalation as a safety net for the hardest cases.

7) Failure modes — routing patterns that don't pay off¶

SIGNAL                                FIX
──────                                ───
router itself runs on Sonnet         → router belongs on Haiku/Flash-Lite.
                                       The maître d' is a small cook.

router complexity exceeds the         → if the classifier is hard, the task
 workload's value                       isn't ready for a router. Start
                                        with rules.

every ticket routed through every   → some workloads need only the cheap
 tier "to be safe"                    tier — don't over-engineer

cost-routing without quality          → without bake-off baselines per tier,
 baselines per tier                    you'll silently downgrade hard
                                       tickets

router decisions never measured     → log router decisions plus downstream
                                       outcomes; recompute the matrix
                                       quarterly

frontier escalation triggered too     → tune thresholds; track escalation
 aggressively                         rate and cost per escalation

confidence-based router blocks UX   → cap escalation latency; fall back to
 on hard cases                        cached or default response on timeout

router rules drift from production  → periodically retrain or refresh
 distribution                         rules from production traces

users see different quality          → expose the router decision in the
 unexplained                          trace logs; communicate to ops team

The deepest pattern across these — the router is a system on its own and needs the same observability, version control, and eval discipline as the models it dispatches to.

8) Where the matrix breaks — workloads that resist routing¶

Three workload types do not benefit from routing.

Pure low-volume workloads. If you handle 1,000 tickets a day, the cost delta between full-Sonnet and routed is maybe $200/month. The engineering cost of the router and its maintenance exceeds the savings. Run mid-tier for everything and move on.

High-quality-stakes workloads. Medical diagnosis support, legal contract review, regulatory filings. The cost of a wrong answer is so high that running everything on frontier is the correct tradeoff. Routing saves money but adds a failure mode — what if the router misclassifies a hard case as simple? In high-stakes territory, that failure mode is unacceptable.

Workloads where the cheap-tier failure is invisible. If a small tier silently produces a bad answer that the system cannot detect, routing becomes dangerous. Generation tasks where the output ships unverified to a customer fall into this category. Routing here requires a downstream quality check that can catch the small-tier failures.

The matching habit is — route by default, but only when you can detect the failures of the cheaper tier. If you cannot detect them, the cost savings are an illusion.

Where this lives in the wild¶

The routing matrix shows up in every serious AI product surface.

Cursor, Windsurf — fast tier for autocomplete, frontier for "agent mode" multi-file refactoring.
GitHub Copilot — small models for ghost-text completions, frontier for chat and the Workspace agent.
Glean — small models for query rewriting and rerank, frontier for answer synthesis on complex queries.
Notion AI — small for intent routing, mid for generation, frontier for complex multi-document operations.
Anthropic Console — Claude Code routes between Sonnet and Haiku based on task complexity in the IDE.
OpenAI Playground — model selector exposed per task type.
Together AI — model gateway exposing all tiers, with routing examples in their cookbook.
OpenRouter — explicit routing policies with per-task tier defaults.
LiteLLM — open-source proxy with built-in routing rules and fallbacks across tiers.
AWS Bedrock — multi-model deployments with custom routing logic via Lambda.
Azure OpenAI — per-deployment quota management that effectively forces tier discipline.
Vertex AI — explicit Gemini Pro vs Flash vs Flash-Lite selection in the SDK.
Helicone — production traces tagged by model and tier, surfacing router decisions.
Langfuse, LangSmith, Braintrust — observability platforms that measure router accuracy in production.
Vellum, PromptLayer, Pezzo — prompt management with per-tier routing config.
Modal, Banana, Anyscale — self-hosted GPU deployments where routing across model sizes is the cost lever.
Fireworks AI — exposes a model-routing API for cost-aware deployments.
Replicate — open-weight model gallery used to test cheaper tiers before committing.
Berkeley Function-Calling Leaderboard — public data on tool-calling quality across tiers, driving routing decisions.
MMLU, HumanEval, GSM8K, MTEB — benchmarks per tier that inform default routing hands_on_labs.
AppWorld, SWE-Bench — agent benchmarks where tier-routing patterns are documented.
Chatbot Arena — Elo ratings per tier, useful for sanity-checking routing defaults.
Artificial Analysis — third-party pricing and latency data for routing-cost calculations.

Pause and recall¶

State the routing matrix's defaults for: classification, extraction, generation, planning, judgement, long-context summarization, tool calling, code generation.
Explain the tier-routing pattern within an agent — which step is frontier, which are mid.
Why is the router itself almost always on the small tier?
Name the three router design patterns (rule-based, classifier, confidence-based) and one trade-off of each.
Why does the savings percentage from routing depend heavily on the baseline tier you started from?
Give two workload types where the routing matrix does not pay off.
What is the most common mistake when introducing a router into an existing all-frontier system?

Interview Q&A¶

Q1. Walk me through how you'd design a router for a support-AI inbox. A. Three steps. First, audit the trace logs from the current system — group by task type (classification, extraction, drafting) and measure quality and cost per group. Second, run a bake-off for each task type at each candidate tier — the bake-off proves which tier is the cheapest that clears the quality floor. Third, build the router itself on the small tier, train or rule it to predict task complexity, and wire each complexity bucket to the corresponding tier. Add observability — log every router decision, every escalation, and quarterly recompute the matrix. Trap: Designing the router before doing the bake-off. The router without baselines silently downgrades quality.

Q2. Why is the router itself on Haiku and not on Sonnet? A. The router runs on every request. Even at 100% accuracy, putting it on Sonnet costs roughly 12x more than Haiku per call, and the routing decision is closed-set classification — a task where Haiku and Sonnet score within 1-2 points. The whole point of the router is to avoid overpaying on cheap tasks, so it would be ironic to overpay on the router itself. Common wrong answer to avoid: "Use the same model for everything to keep it simple." That defeats the router's purpose.

Q3. A team claims a 70% savings from introducing a router. How do you verify? A. Three checks. First, what was the baseline? 70% off an all-frontier baseline is plausible; 70% off an all-mid baseline is implausibly large. Second, what is the quality measurement? Savings without a quality baseline could be silent downgrading. Third, what is the failure detection on the cheaper tier outputs? If undetected failures grow, the "savings" are converted to silent customer cost. Trap: Accepting savings claims without the baseline and quality measurements.

Q4. When does routing not pay off? A. Three cases. One — low-volume workloads where the engineering cost of the router exceeds the cost savings. Two — high-stakes workloads where the failure cost of the cheap tier exceeds the model-cost savings. Three — workloads where cheap-tier failures cannot be detected and the system ships unverified output to users. Trap: Routing everything because "it's the right pattern." It is the right pattern when the savings exceed the operational cost and the failures are detectable.

Q5. What is the tier-routing pattern inside an agent? A. The planner runs on frontier — it writes the multi-step plan, decides which tools to use, sequences the sub-steps, recovers from sub-step failures. The executors run on mid or small — each sub-step has a narrow scope (search a database, extract a field, format an output) that does not need frontier. A well-designed agent uses frontier for 1-2 calls per session and mid/small for the rest, keeping cost economical while preserving plan quality. Trap: Running every agent step on frontier "because agents are hard." Agents are hard at the planning level. The sub-steps are usually mid-tier work.

Q6. How do you decide between rule-based and classifier-based routing? A. Rule-based wins when the routing decision can be made from cheap features — input length, intent keywords, user tier — and the decision is stable over time. Classifier-based wins when the routing decision requires reading the prose (intent, complexity, topic) and the distribution shifts over time. Most production systems are hybrid — rules for the obvious cases, classifier for the ambiguous ones. Trap: Building a classifier when rules would suffice, or vice versa.

Q7. Your router escalates 25% of tickets to frontier. The bill is larger than before. What went wrong? A. Three likely causes. One — the router classifier is too aggressive, escalating tickets that mid-tier would have handled. Two — the escalation rate may be honest but the original baseline was already mid-tier, so escalation can only add cost. Three — the router itself is not actually saving cost on the non-escalated path. Diagnosis — measure quality on the non-escalated path (is mid-tier matching the floor?) and audit the escalation classifier's precision. Trap: Assuming the router is broken without measuring whether each of its decisions was correct.

Q8. The router and the matching habit — how are they related? A. The matching habit is the reasoning — which cook for which ticket. The router is the running code that implements the reasoning at production scale. The bake-off is the evidence that the matching habit is correctly calibrated for this workload. All three are needed — reason to design it, evidence to calibrate it, code to run it. An engineer who can do all three is an applied AI lead. An engineer who can only do one is not. Trap: Conflating the three or skipping any of them.

Apply now (5 min)¶

Step 1 — draw the matrix. For your system, list every distinct model-touching task type. For each, fill in the default tier and the reasoning in one line. The result should fit on one page.

Step 2 — find the misalignments. For each task, compare the default tier to the tier you are actually running. Mark each row green (matches), yellow (one tier off), red (two tiers off). The red rows are your immediate routing opportunities.

Step 3 — sketch the router. For one red row, write the routing decision in pseudocode. Input features, classifier or rule, dispatch logic, escalation criteria. If the sketch is more than 30 lines, the task is not ready for routing — start with the bake-off.

Bridge. You can now reason about tiers, the triangle, the bake-off, and the routing matrix. The next layer of the matching habit is what kind of cook the kitchen prefers — closed-weight from a vendor or open-weight you control. Different tradeoffs. Different risks.

→ 05-closed-vs-open-weight.md