09. Task Routing Patterns — Do not pay grandmaster prices for calculator work¶
~11 min read. Strong reasoning systems are 80% routing and 20% the deep model. This chapter is the highest-leverage applied-engineer skill in the module.
Built on the ELI5 in 00-eli5.md. the time budget — spent selectively by task difficulty and risk — turns reasoning from a single model choice into a system-level routing policy.
Think in lanes, not one giant pipe¶
A real product has many request types: easy autocompletions, medium classifications, hard multi-step tasks, compliance-critical decisions, ambiguous open-ended planning. If all go through one model path, you waste money somewhere — either by under-reasoning the hard ones or over-paying for the easy ones.
request
│
▼
┌──────────┐
│ Router │
└────┬─────┘
├──→ Fast lane (gpt-4.1-mini, haiku 4.5 — no reasoning)
├──→ Reasoning lane (gpt-5.5, sonnet 4.6 thinking medium)
├──→ Deep lane (opus 4.7 effort=high, o3-pro)
├──→ Search/tool lane (agent loop with verifier)
└──→ Human review (out-of-policy, low confidence, irreversible)
That is the time budget made operational. The router is the traffic cop — not the model itself. The 2026 production pattern is almost always a cascade of 3–5 lanes.
Signals a router can use¶
Cheap features the router computes on the request before spending the deep tier:
| Signal | What it predicts | How to extract |
|---|---|---|
| Task length / token count | Complexity correlates with token count | Count tokens of the user message |
| Presence of numeric/math content | Math benefits from reasoning | Regex / classifier |
Code keywords (def, class, SELECT) |
Code tasks vary by complexity | Pattern match |
| Constraint count | More constraints → reasoning helps | Parse for "must", "should", numbered rules |
| Multi-file context attached | Cross-file → escalate | Count attached files |
| User tier / SLA | Premium users → may justify deeper model | User metadata |
| Business risk class | Compliance, money, safety → escalate | Domain tagging |
| First-pass confidence | Low confidence → retry deeper | Logprob or self-rating |
| Prior turn was hard | Sticky escalation | Session state |
Combine into a score, or train a small classifier. Most production routers start with 5–10 hand-coded rules; learned routers come later when you have enough labeled data.
Cascade pattern: the production default¶
The most-used routing pattern in 2026. Run the cheapest model first; escalate only when needed.
async def cascade_route(request, golden_set_thresholds):
# Tier 1: fast model
fast_resp = await client.chat.completions.create(
model="gpt-4.1-mini",
messages=[...],
)
if confidence(fast_resp) >= 0.85 and passes_schema(fast_resp):
return fast_resp # 80% of requests stop here
# Tier 2: reasoning at medium effort
reasoning_resp = await client.responses.create(
model="gpt-5",
input=request,
reasoning={"effort": "medium"},
)
if confidence(reasoning_resp) >= 0.80 and passes_verifier(reasoning_resp):
return reasoning_resp # another 15% stop here
# Tier 3: deep reasoning
deep_resp = await client.responses.create(
model="o3-pro",
input=request,
reasoning={"effort": "high"},
)
if not passes_verifier(deep_resp):
return escalate_to_human(request, deep_resp)
return deep_resp # remaining 5%
The cascade lets you serve 80% of requests at 5% of the cost while preserving deep-tier quality on the long tail. Compare against routing every request to o3-pro: you'd spend ~15× as much for indistinguishable median quality.
Worked example: cascade cost math at 1M requests/day¶
Approximate May 2026 numbers, average 1,000 output tokens per request, 80/15/5 split.
| Tier | Model | $/M out | Vol/day | Cost/day |
|---|---|---|---|---|
| Fast | gpt-4.1-mini | $0.60 | 800,000 | $480 |
| Reasoning medium | GPT-5 effort=medium | $10.00 | 150,000 | $1,500 |
| Deep | o3-pro effort=high | $80.00 | 50,000 | $4,000 |
| Cascade total | 1,000,000 | $5,980 | ||
| All-deep counterfactual | o3-pro effort=high everywhere | $80.00 | 1,000,000 | $80,000 |
The cascade is ~13× cheaper with no measurable quality loss on the fast-tier traffic (verified against your golden set). The savings compound: each tier also has lower latency for the 80% common case, improving P50 by ~10×.
This single decision — cascade vs flat routing — is the difference between a financially viable AI product and an inference-bill horror story.
Routing signals that matter most in 2026¶
After hundreds of cascade deployments, the signals that consistently buy quality:
- First-pass schema/verifier pass — strongest single signal. If the fast model's output passes your structured-output schema and your domain verifier, escalation rarely changes the answer.
- Confidence proxies — model self-rating, logprobs (for non-reasoning models), or a tiny classifier head. Self-rating asks the model to score its own answer 1-5; it correlates with accuracy at r ≈ 0.5–0.7 — useful, not perfect.
- Task tag at intake — "code refactor across 3+ files" or "tax calc with multi-step deductions" — pre-classified by the front-end agent. Pre-classification is cheap and well-calibrated.
- User-tier override — paid users get deeper routing on demand.
- Retry-after-failure — failed verifier? Escalate, don't retry the same model.
Anti-signals (look weak but engineers reach for them): - Long prompt is not always hard. Many long prompts are pasted context with a simple ask. - Short prompt is not always easy. "Prove P=NP" is six tokens. - Random sampling of users for A/B test is not routing — it's experimentation.
When to use a learned router¶
Hand-coded rules cover 70–80% of routing decisions cheaply. A learned router (small classifier, often a fine-tuned BERT-class model or even logistic regression on extracted features) adds value when:
- You have 5,000+ labelled routing decisions with ground-truth "should-have-escalated" labels.
- Your rule-based router is systematically wrong on identifiable clusters.
- The cost of false escalation × volume > cost of training and maintaining a classifier.
Production learned routers (Anthropic's internal model picker, Anyscale's routing, Martian) report 10–25% additional cost savings on top of rule-based cascades. But they require an offline-eval loop and constant retraining as task distributions shift.
Routing mistakes that hurt¶
| Mistake | What goes wrong |
|---|---|
| Routing only by user prestige | Free users get bad quality; paid users get over-routed |
| Routing only by token count | Long pastes get escalated unnecessarily |
| One-shot escalation on any uncertainty | Tail traffic floods deep tier |
| No measurement of routing decisions | You can't tell if your router is wrong |
| Same router across products | Different surfaces have different SLAs |
| Forgetting fallback for deep-tier outage | When o3 has an incident, the whole product breaks |
| Ignoring "human review" as a lane | Some tasks should not be answered automatically |
Measure: false escalation rate (cheap task sent to deep tier) and missed escalation rate (hard task served by fast tier). The cost of each is asymmetric — missing an escalation often costs more than over-escalating. Tune accordingly.
Where this lives in the wild¶
- GitHub Copilot Auto-mode — task-aware routing released April 2026. Routes by intent: autocomplete → fast model, multi-step reasoning → o3, large refactors → Claude Opus 4.7. The user picks the capability (chat, agent, completion); Copilot picks the model.
- Cursor Auto mode — frontier picks: Opus 4.7 for architecture/refactors, GPT-5.5 for general edits, DeepSeek V4 Pro for cheap reasoning fallback. Routes per task type, transparent to user.
- Perplexity Computer (May 2026) — Claude Opus 4.6 orchestrator routes sub-agents: Gemini for deep research, code-specialized models for coding sub-tasks. Routing is recursive (orchestrator routes other reasoners).
- Intercom Fin — support automation — escalation rules built into the routing layer: simple FAQ → fast model, account-specific → reasoning model with retrieval, refund/cancel → human review queue.
- Harvey AI (legal) — cascading pipeline: retrieval → case-law specialist model → reasoning orchestrator → tool-grounded verifier. Each tier filters before the next.
Pause and recall¶
- In a typical cascade, what fraction of traffic stops at the fast tier and what fraction reaches the deep tier?
- In the 1M/day example, what was the cost ratio between cascade and all-deep routing?
- Name the strongest single routing signal and one common anti-signal.
- What two metrics measure router quality, and which is usually more costly?
Interview Q&A¶
Q: Walk me through how you'd design a routing layer for a customer support copilot at 100K conversations/day.
A: Start with a four-lane cascade. Lane 1 (fast) — Haiku 4.5 or GPT-4.1-mini for greeting, intent classification, simple FAQ; ~70% of traffic. Lane 2 (reasoning medium) — Sonnet 4.6 with effort=medium for account-specific issues that need retrieval; ~22%. Lane 3 (reasoning high + tools) — Opus 4.7 or GPT-5.5 with tool access for complex multi-step (refund eligibility, fraud review); ~6%. Lane 4 (human) — refunds > $500, account closures, anything flagged by the verifier; ~2%. Router signals: intent classifier from Lane 1 output, retrieval hit count, user tier, account risk score, business-action type. Measure false-escalation rate (Lane 3 used when Lane 2 would've sufficed) and missed-escalation rate (wrong answer that Lane 3 would've gotten right). Tune thresholds quarterly against your golden set.
Common wrong answer to avoid: "Send everything to the best model" — at 100K conversations/day with reasoning enabled, you're looking at \(50K-\)100K/day in inference at most providers' frontier tier. The CFO will not approve. Cascade is the cost-defensible answer.
Q: Your fast tier is wrong 5% of the time. The cost of being wrong is $50 per error. The deep tier is correct 99% of the time at $0.20 per call. What do you do? A: Math first. Fast tier expected error cost per call: 0.05 × $50 = $2.50. Deep tier error cost: 0.01 × $50 = $0.50. Deep tier total cost: $0.50 + $0.20 = $0.70. Fast tier total: $2.50 + (fast price, say $0.01) = $2.51. Deep is ~4× cheaper at this error cost. Route to deep — or better, use the deep model only for the 20% of requests where the fast model's confidence is low, capturing most of the quality at fast prices. Show the expected-cost math; senior loops will respect it.
Common wrong answer to avoid: "Always use the model with higher accuracy" — that ignores the cost of the model itself. The right comparison is total cost = serving cost + expected error cost × error rate. When error cost dominates, deep is right. When error cost is low, fast is right.
Q: A teammate proposes using model logprobs as the confidence signal for escalation. What's the catch? A: Three catches. First, reasoning models hide their logprobs — o-series, Claude extended thinking, Gemini thinking don't expose token logprobs in the response. You can't use them. Second, logprobs measure fluency, not correctness — a confidently-wrong answer has high logprobs by definition. Third, poor calibration — modern aligned models are systematically over-confident. The better confidence signal is self-rating (ask the model 1-5 how confident) or programmatic verification (does the output pass the schema?). Logprobs work as a rough signal on non-reasoning models with calibration adjustment, but they're not the right primary signal.
Common wrong answer to avoid: "Logprobs are the gold-standard confidence signal" — they were useful pre-RLHF. After alignment training, calibration drifted; modern models are over-confident on logprobs. Anthropic and OpenAI's own docs warn against trusting raw logprobs as confidence.
Q: How do you handle a deep-tier outage in your routing layer? A: Defense in depth. Fallback model — if Opus 4.7 is down, route to GPT-5.5 or o3 automatically. Degraded mode — if all reasoning models are down, fall back to fast model with a banner ("Deeper analysis unavailable, showing best-effort result"). Circuit breaker — after N consecutive errors, stop calling the failing tier for M seconds. Caching of recent answers for idempotent queries. Human queue for irreversible actions — if compliance work needs deep reasoning and it's down, queue for human review rather than auto-approve. The router is a single point of failure if you don't design for provider outages.
Common wrong answer to avoid: "We trust the provider's SLA" — every major provider has had multi-hour outages. Your router needs fallbacks built in or your product breaks when theirs does.
Apply now (5 min)¶
Take your last 100 production requests. Manually classify each into: fast-sufficient, reasoning-required, deep-required, human-required. Then check your current routing — how many were over- or under-routed? Compute the cost waste and the missed-quality cost. That's your routing-improvement budget.
Sketch from memory: Draw the four-lane cascade with one signal per escalation transition. Annotate the typical traffic split (70/22/6/2 or whatever your domain expects).
Bridge. Routing decisions become business decisions the moment cost and latency matter. Time to make the tradeoff framework explicit, with real $ and ms numbers. → 10-cost-quality-latency-tradeoffs.md