Skip to content

09. Task Routing Patterns — Do not pay grandmaster prices for calculator work

~11 min read. Strong reasoning systems are 80% routing and 20% the deep model. This chapter is the highest-leverage applied-engineer skill in the module.

Built on the ELI5 in 00-eli5.md. the time budget — spent selectively by task difficulty and risk — turns reasoning from a single model choice into a system-level routing policy.


Think in lanes, not one giant pipe

A real product has many request types: easy autocompletions, medium classifications, hard multi-step tasks, compliance-critical decisions, ambiguous open-ended planning. If all go through one model path, you waste money somewhere — either by under-reasoning the hard ones or over-paying for the easy ones.

request
┌──────────┐
│  Router  │
└────┬─────┘
     ├──→ Fast lane         (gpt-4.1-mini, haiku 4.5 — no reasoning)
     ├──→ Reasoning lane    (gpt-5.5, sonnet 4.6 thinking medium)
     ├──→ Deep lane         (opus 4.7 effort=high, o3-pro)
     ├──→ Search/tool lane  (agent loop with verifier)
     └──→ Human review      (out-of-policy, low confidence, irreversible)

That is the time budget made operational. The router is the traffic cop — not the model itself. The 2026 production pattern is almost always a cascade of 3–5 lanes.


Signals a router can use

Cheap features the router computes on the request before spending the deep tier:

Signal What it predicts How to extract
Task length / token count Complexity correlates with token count Count tokens of the user message
Presence of numeric/math content Math benefits from reasoning Regex / classifier
Code keywords (def, class, SELECT) Code tasks vary by complexity Pattern match
Constraint count More constraints → reasoning helps Parse for "must", "should", numbered rules
Multi-file context attached Cross-file → escalate Count attached files
User tier / SLA Premium users → may justify deeper model User metadata
Business risk class Compliance, money, safety → escalate Domain tagging
First-pass confidence Low confidence → retry deeper Logprob or self-rating
Prior turn was hard Sticky escalation Session state

Combine into a score, or train a small classifier. Most production routers start with 5–10 hand-coded rules; learned routers come later when you have enough labeled data.


Cascade pattern: the production default

The most-used routing pattern in 2026. Run the cheapest model first; escalate only when needed.

async def cascade_route(request, golden_set_thresholds):
    # Tier 1: fast model
    fast_resp = await client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[...],
    )
    if confidence(fast_resp) >= 0.85 and passes_schema(fast_resp):
        return fast_resp  # 80% of requests stop here

    # Tier 2: reasoning at medium effort
    reasoning_resp = await client.responses.create(
        model="gpt-5",
        input=request,
        reasoning={"effort": "medium"},
    )
    if confidence(reasoning_resp) >= 0.80 and passes_verifier(reasoning_resp):
        return reasoning_resp  # another 15% stop here

    # Tier 3: deep reasoning
    deep_resp = await client.responses.create(
        model="o3-pro",
        input=request,
        reasoning={"effort": "high"},
    )
    if not passes_verifier(deep_resp):
        return escalate_to_human(request, deep_resp)

    return deep_resp  # remaining 5%

The cascade lets you serve 80% of requests at 5% of the cost while preserving deep-tier quality on the long tail. Compare against routing every request to o3-pro: you'd spend ~15× as much for indistinguishable median quality.


Worked example: cascade cost math at 1M requests/day

Approximate May 2026 numbers, average 1,000 output tokens per request, 80/15/5 split.

Tier Model $/M out Vol/day Cost/day
Fast gpt-4.1-mini $0.60 800,000 $480
Reasoning medium GPT-5 effort=medium $10.00 150,000 $1,500
Deep o3-pro effort=high $80.00 50,000 $4,000
Cascade total 1,000,000 $5,980
All-deep counterfactual o3-pro effort=high everywhere $80.00 1,000,000 $80,000

The cascade is ~13× cheaper with no measurable quality loss on the fast-tier traffic (verified against your golden set). The savings compound: each tier also has lower latency for the 80% common case, improving P50 by ~10×.

This single decision — cascade vs flat routing — is the difference between a financially viable AI product and an inference-bill horror story.


Routing signals that matter most in 2026

After hundreds of cascade deployments, the signals that consistently buy quality:

  1. First-pass schema/verifier pass — strongest single signal. If the fast model's output passes your structured-output schema and your domain verifier, escalation rarely changes the answer.
  2. Confidence proxies — model self-rating, logprobs (for non-reasoning models), or a tiny classifier head. Self-rating asks the model to score its own answer 1-5; it correlates with accuracy at r ≈ 0.5–0.7 — useful, not perfect.
  3. Task tag at intake — "code refactor across 3+ files" or "tax calc with multi-step deductions" — pre-classified by the front-end agent. Pre-classification is cheap and well-calibrated.
  4. User-tier override — paid users get deeper routing on demand.
  5. Retry-after-failure — failed verifier? Escalate, don't retry the same model.

Anti-signals (look weak but engineers reach for them): - Long prompt is not always hard. Many long prompts are pasted context with a simple ask. - Short prompt is not always easy. "Prove P=NP" is six tokens. - Random sampling of users for A/B test is not routing — it's experimentation.


When to use a learned router

Hand-coded rules cover 70–80% of routing decisions cheaply. A learned router (small classifier, often a fine-tuned BERT-class model or even logistic regression on extracted features) adds value when:

  • You have 5,000+ labelled routing decisions with ground-truth "should-have-escalated" labels.
  • Your rule-based router is systematically wrong on identifiable clusters.
  • The cost of false escalation × volume > cost of training and maintaining a classifier.

Production learned routers (Anthropic's internal model picker, Anyscale's routing, Martian) report 10–25% additional cost savings on top of rule-based cascades. But they require an offline-eval loop and constant retraining as task distributions shift.


Routing mistakes that hurt

Mistake What goes wrong
Routing only by user prestige Free users get bad quality; paid users get over-routed
Routing only by token count Long pastes get escalated unnecessarily
One-shot escalation on any uncertainty Tail traffic floods deep tier
No measurement of routing decisions You can't tell if your router is wrong
Same router across products Different surfaces have different SLAs
Forgetting fallback for deep-tier outage When o3 has an incident, the whole product breaks
Ignoring "human review" as a lane Some tasks should not be answered automatically

Measure: false escalation rate (cheap task sent to deep tier) and missed escalation rate (hard task served by fast tier). The cost of each is asymmetric — missing an escalation often costs more than over-escalating. Tune accordingly.


Where this lives in the wild

  • GitHub Copilot Auto-mode — task-aware routing released April 2026. Routes by intent: autocomplete → fast model, multi-step reasoning → o3, large refactors → Claude Opus 4.7. The user picks the capability (chat, agent, completion); Copilot picks the model.
  • Cursor Auto mode — frontier picks: Opus 4.7 for architecture/refactors, GPT-5.5 for general edits, DeepSeek V4 Pro for cheap reasoning fallback. Routes per task type, transparent to user.
  • Perplexity Computer (May 2026) — Claude Opus 4.6 orchestrator routes sub-agents: Gemini for deep research, code-specialized models for coding sub-tasks. Routing is recursive (orchestrator routes other reasoners).
  • Intercom Fin — support automation — escalation rules built into the routing layer: simple FAQ → fast model, account-specific → reasoning model with retrieval, refund/cancel → human review queue.
  • Harvey AI (legal) — cascading pipeline: retrieval → case-law specialist model → reasoning orchestrator → tool-grounded verifier. Each tier filters before the next.

Pause and recall

  1. In a typical cascade, what fraction of traffic stops at the fast tier and what fraction reaches the deep tier?
  2. In the 1M/day example, what was the cost ratio between cascade and all-deep routing?
  3. Name the strongest single routing signal and one common anti-signal.
  4. What two metrics measure router quality, and which is usually more costly?

Interview Q&A

Q: Walk me through how you'd design a routing layer for a customer support copilot at 100K conversations/day. A: Start with a four-lane cascade. Lane 1 (fast) — Haiku 4.5 or GPT-4.1-mini for greeting, intent classification, simple FAQ; ~70% of traffic. Lane 2 (reasoning medium) — Sonnet 4.6 with effort=medium for account-specific issues that need retrieval; ~22%. Lane 3 (reasoning high + tools) — Opus 4.7 or GPT-5.5 with tool access for complex multi-step (refund eligibility, fraud review); ~6%. Lane 4 (human) — refunds > $500, account closures, anything flagged by the verifier; ~2%. Router signals: intent classifier from Lane 1 output, retrieval hit count, user tier, account risk score, business-action type. Measure false-escalation rate (Lane 3 used when Lane 2 would've sufficed) and missed-escalation rate (wrong answer that Lane 3 would've gotten right). Tune thresholds quarterly against your golden set.

Common wrong answer to avoid: "Send everything to the best model" — at 100K conversations/day with reasoning enabled, you're looking at \(50K-\)100K/day in inference at most providers' frontier tier. The CFO will not approve. Cascade is the cost-defensible answer.

Q: Your fast tier is wrong 5% of the time. The cost of being wrong is $50 per error. The deep tier is correct 99% of the time at $0.20 per call. What do you do? A: Math first. Fast tier expected error cost per call: 0.05 × $50 = $2.50. Deep tier error cost: 0.01 × $50 = $0.50. Deep tier total cost: $0.50 + $0.20 = $0.70. Fast tier total: $2.50 + (fast price, say $0.01) = $2.51. Deep is ~4× cheaper at this error cost. Route to deep — or better, use the deep model only for the 20% of requests where the fast model's confidence is low, capturing most of the quality at fast prices. Show the expected-cost math; senior loops will respect it.

Common wrong answer to avoid: "Always use the model with higher accuracy" — that ignores the cost of the model itself. The right comparison is total cost = serving cost + expected error cost × error rate. When error cost dominates, deep is right. When error cost is low, fast is right.

Q: A teammate proposes using model logprobs as the confidence signal for escalation. What's the catch? A: Three catches. First, reasoning models hide their logprobs — o-series, Claude extended thinking, Gemini thinking don't expose token logprobs in the response. You can't use them. Second, logprobs measure fluency, not correctness — a confidently-wrong answer has high logprobs by definition. Third, poor calibration — modern aligned models are systematically over-confident. The better confidence signal is self-rating (ask the model 1-5 how confident) or programmatic verification (does the output pass the schema?). Logprobs work as a rough signal on non-reasoning models with calibration adjustment, but they're not the right primary signal.

Common wrong answer to avoid: "Logprobs are the gold-standard confidence signal" — they were useful pre-RLHF. After alignment training, calibration drifted; modern models are over-confident on logprobs. Anthropic and OpenAI's own docs warn against trusting raw logprobs as confidence.

Q: How do you handle a deep-tier outage in your routing layer? A: Defense in depth. Fallback model — if Opus 4.7 is down, route to GPT-5.5 or o3 automatically. Degraded mode — if all reasoning models are down, fall back to fast model with a banner ("Deeper analysis unavailable, showing best-effort result"). Circuit breaker — after N consecutive errors, stop calling the failing tier for M seconds. Caching of recent answers for idempotent queries. Human queue for irreversible actions — if compliance work needs deep reasoning and it's down, queue for human review rather than auto-approve. The router is a single point of failure if you don't design for provider outages.

Common wrong answer to avoid: "We trust the provider's SLA" — every major provider has had multi-hour outages. Your router needs fallbacks built in or your product breaks when theirs does.


Apply now (5 min)

Take your last 100 production requests. Manually classify each into: fast-sufficient, reasoning-required, deep-required, human-required. Then check your current routing — how many were over- or under-routed? Compute the cost waste and the missed-quality cost. That's your routing-improvement budget.

Sketch from memory: Draw the four-lane cascade with one signal per escalation transition. Annotate the typical traffic split (70/22/6/2 or whatever your domain expects).


Bridge. Routing decisions become business decisions the moment cost and latency matter. Time to make the tradeoff framework explicit, with real $ and ms numbers. → 10-cost-quality-latency-tradeoffs.md