12. Routing strategies — choose the path before spending the budget¶

~13 min read. Different questions want different retrieval stacks. Routing is how the system decides the cheapest good path.

Built on the ELI5 in 00-eli5.md. the multi-step plan — choosing how to split and handle a query — becomes a router when the system selects one retrieval path from several options.

1) The wall — when one RAG chain is either too weak or too expensive¶

Basic RAG gave us a useful baseline: embed the query, retrieve a few chunks, and generate from the result. The earlier advanced-RAG tools — the rewriter, the hypothesis, the multi-step plan, the cross-checker, and the confidence gate — add control points before generation.

The concrete failure here is sharper: different query types need different retrieval costs, indexes, and safety checks. This page follows a route table mapping query class to retriever stack so you can see whether route-specific retrieval actually changes the evidence path or just adds another box to the architecture diagram.

The tempting repair is to send every query through the most expensive pipeline. That keeps the system simple, and on easy questions it may be right. It fails on this case: A definition query should not pay for multi-hop decomposition. A contract-risk query should, because the cost buys evidence coverage.

Root cause: the evidence path has already lost information before the model writes a sentence. Rule: route by evidence need and risk, not by pipeline convenience.

Mini-FAQ. "What is the control point here?" the multi-step plan is useful only when it creates a real decision: retrieve differently, rank differently, retry, reroute, answer, or refuse.

2) The core visual — route by query risk and evidence need¶

Consider an airport control tower.

Small planes and large planes do not use identical procedures.

Queries are like that too.

An FAQ question is not a compare-and-contrast question.

A product-code lookup is not a conceptual “why” question.

Routing means choosing the right runway early.

One route may use sparse-heavy search.

Another may use hybrid plus reranking.

Another may use decomposition and iterative retrieval.

incoming query
      │
      ▼
   router
      │
      ├── route A: FAQ lookup
      ├── route B: exact ID lookup
      ├── route C: compare / decompose
      └── route D: conceptual / HyDE

Good routing saves cost and improves quality at the same time.

3) What the router is really deciding¶

A router can use rules, classifiers, or a small LLM policy.

Its job is not to answer.

Its job is to select the retrieval workflow.

If every query runs the heaviest stack, latency and cost jump.

If every query runs the lightest stack, hard questions break.

So routing is a budget allocation problem.

That is why the multi-step plan matters here.

The plan is no longer only about splitting one question.

It is about choosing the whole query path.

Senior systems route on observable features.

Those include presence of IDs, compare words, time filters, broad “why” phrasing, and required output structure.

4) The worked example — trace the intermediate state¶

Suppose the system supports four routes.

R1 = sparse-heavy lookup for exact IDs

R2 = hybrid plus reranking for policy lookup

R3 = decomposition for compare questions

R4 = HyDE plus iterative retrieval for conceptual “why” questions

Now score one query.

Query:

“Compare Q3 and Q4 GPU margin changes across all regions.”

Router feature scores are:

contains compare language = 1.0

contains exact identifier = 0.0

contains broad why phrasing = 0.1

needs structured table output = 0.8

Compute route utilities.

R1 utility = 0.2

R2 utility = 0.5

R3 utility = 0.9

R4 utility = 0.3

router decision
R1 exact lookup      0.2
R2 hybrid policy     0.5
R3 decomposition     0.9  ← choose
R4 conceptual HyDE   0.3

See the intermediates clearly.

The query is not mainly about one ID.

It is not mainly conceptual.

It needs aligned sub-questions and comparison.

So route R3 wins.

5) Failure modes — how the mechanism breaks¶

Failure one. The router uses vague labels.

Everything becomes “general search.”

You built no real routes at all.

Failure two. The router overfits to keywords.

One query contains “why” but is really an exact incident lookup.

The system sends it to HyDE unnecessarily.

Failure three. No feedback returns from downstream quality checks.

Bad routes stay bad because nothing updates the policy.

So what to do?

Route on a small set of clear features.

Then use downstream success and failure to improve the router.

That is the practical use of the multi-step plan.

6) Production rules that hold up¶

Keep the number of routes small at first.

Name each route by what it is best at.

Make route selection inspectable.

Log chosen route, latency, and final quality outcome.

Add confidence thresholds so borderline cases can fall back to a safer route.

Routing is not separate from evaluation.

A route is good only if it wins on the queries it claims.

And even after a route is chosen, the system still must decide whether the retrieved evidence is strong enough to answer.

That last decision belongs to confidence gates.

7) Why not one universal RAG chain under this workload¶

The plausible alternative is one universal RAG chain. It is attractive because it preserves a smaller pipeline and avoids another operational surface.

That tradeoff is correct for low-risk, obvious queries. It is wrong when different query types need different retrieval costs, indexes, and safety checks. Under that workload, route-specific retrieval earns its cost by making the failure inspectable before generation.

Option	Works when	Fails when	Cost moves to
one universal RAG chain	evidence need is simple	different query types need different retrieval costs, indexes, and safety checks	prompt wording and user trust
routing	failure is detectable before generation	checkpoint is unlogged or uncalibrated	retrieval orchestration, latency, evals

Mini-FAQ. "Should every query take this path?" No. The mechanism belongs on routes where the evidence risk justifies the extra latency and debugging surface.

8) Production signals — know whether routing is working¶

A healthy trace shows simple questions stay cheap while risky questions get stronger checks. The first metric to watch is route regret rate. Top-1 similarity is weaker because it can improve while a binding constraint is still missing.

The review loop starts with false greens: cases where the system answered, but later inspection found unsupported claims. Those cases reveal whether the checkpoint is protecting the right boundary.

user complaint
   -> retrieval trace
   -> evidence check
   -> answer / retry / route / abstain
   -> false-green review

9) Boundary — where routing helps, hurts, or wastes budget¶

Strong fit: the bad evidence pattern is visible before generation. Weak fit: the corpus is missing the truth, metadata is stale, or the system has no alternative route to take. Pathology: the pipeline keeps retrying because it wants an answer, not because the current evidence revealed a new evidence need.

Scale limit: each checkpoint spends latency, money, logs, and operator attention. Route the mechanism to the queries that need it; do not make every query pay for the hardest query.

10) Wrong model — advanced RAG means adding more retrieval¶

The wrong model says more variants, rerankers, loops, and confidence scores make the system safer by default.

The better model is narrower: each component must relieve one named pressure and create one visible decision. If the multi-step plan does not change what the system does, it is decoration.

11) Failure taxonomy for routing¶

The checkpoint is not logged, so the failure cannot be replayed.
The threshold came from a demo set and does not match production traffic.
The retry repeats the same weak evidence instead of targeting a missing slot.
The metric improves while one required constraint stays unsupported.
Metadata or routing rules go stale and silently hide the right document.
Extra candidates crowd the prompt with duplicates.
Operators optimize the easy number instead of unsupported-answer rate.

12) Pattern transfer — same pressure, different system¶

Caching has the same shape: add the mechanism only when the access pattern earns it.
Observability has the same shape: traces matter because production bugs start as symptoms.
Distributed systems have the same shape: the useful part is the guarantee under failure.
Evals have the same shape: average wins do not matter if false greens leak to users.

13) Design review checklist¶

What exact pressure forced this mechanism to exist?
What artifact proves the mechanism changed retrieval?
Why is one universal RAG chain weaker on this workload?
Which metric should improve first?
Which cost rises first?
When should the system answer, retry, reroute, or abstain?

Where this lives in the wild¶

Enterprise search orchestrators — choose between FAQ search, document search, and analytic search based on query type.
Perplexity-like systems — route broad conceptual questions differently from exact factual lookups.
Support copilots — send product-code queries into sparse-heavy retrieval and policy questions into hybrid retrieval.
Analytics assistants — decompose compare-and-contrast requests instead of treating them like plain search.
Security bots — route IOC lookups differently from incident root-cause questions.
Enterprise policy assistants — use the pattern when user wording hides policy scope, dates, or exception clauses.
Customer support copilots — need the control when a pleasant answer without full evidence can mislead a customer.
Incident copilots — benefit when service names, runbooks, dashboards, and postmortems disagree or use aliases.
Legal document search — requires visible evidence boundaries because one missing clause can reverse the answer.
Healthcare knowledge assistants — need stricter support checks because fluent synthesis is not a safety guarantee.
Financial research tools — use it to separate ticker-like literal anchors from semantic business questions.
Internal engineering search — exposes whether the system found the exact config, RFC, ticket, or deploy note.
Knowledge-base migration projects — reveal stale, duplicate, and missing documents before retrieval quality is blamed.
Evaluation harnesses — turn the mechanism into a measurable false-green and false-red review loop.
Agentic workflows — reuse the same control point before a tool call, follow-up search, or escalation step.

Recall checkpoint¶

Why is routing mainly a budget allocation problem rather than a generation problem?
In the worked example, which features made the decomposition route win?
What happens when downstream quality never feeds back into the router?
Which false-green case would you review first for routing strategies?
What cost does this mechanism move into retrieval orchestration or evaluation?
When would one universal RAG chain be acceptable instead?

Interview Q&A¶

Q: Why not just run the strongest retrieval path for every query? A: Because the strongest path is often the slowest and most expensive, and many simple queries do not need that cost.

Common wrong answer to avoid: "Because users prefer variety." — The real issue is efficiency and fit, not variety for its own sake.

Q: What makes a routing policy production-friendly? A: It is inspectable, tied to observable query features, and evaluated against downstream answer quality and cost.

Common wrong answer to avoid: "A routing policy is good if it feels intuitive." — Intuition is not enough when route errors are costly.

Q: Why is a fallback route useful? A: Because some queries are borderline, and a safer default path prevents brittle over-routing into specialized stacks.

Common wrong answer to avoid: "Fallback means the router failed." — Fallback is part of good control design, not a sign of weakness.

Q: What trace would you inspect first when routing strategies fails? A: Inspect the artifact before generation: route table mapping query class to retriever stack. Then compare it with the final answer claims to find the unsupported step.

Common wrong answer to avoid: "Start by editing the final prompt." — Prompt edits hide whether retrieval, routing, or evidence coverage failed earlier.

Q: What cost does this mechanism add? A: It adds orchestration, latency, logging, and evaluation work. The cost is justified only when it reduces unsupported answers or expensive retries.

Common wrong answer to avoid: "It is free because it is just another LLM call." — Every call consumes latency, money, observability budget, and failure surface.

Q: When should you remove or bypass this mechanism? A: Bypass it for low-risk, simple queries where the evidence need is obvious and the extra decision does not change behavior.

Common wrong answer to avoid: "Never remove advanced RAG components." — Advanced components are useful controls, not trophies.

Apply now (10 min)¶

List three query types in your domain and write one retrieval route that fits each one best.
Sketch from memory: draw the router with four branches and label the feature that should trigger each branch.
Reproduce from memory: explain routing strategies in five sentences, including the pressure, mechanism, alternative, metric, and boundary.

What you should remember¶

Routing exists because different query types need different retrieval costs, indexes, and safety checks. The mechanism is not valuable because it sounds advanced; it is valuable when it changes the evidence path before generation.

The artifact to inspect is a route table mapping query class to retriever stack. If that artifact does not explain why retrieval, ranking, retry, routing, or refusal changed, the component is not carrying its operational weight.

Remember:

Route by evidence need and risk, not by pipeline convenience.
Inspect a route table mapping query class to retriever stack before blaming the final prompt.
Prefer visible evidence controls over hidden prompt hope.
Review false greens before celebrating average retrieval scores.
Every advanced RAG component should relieve one pressure and create one decision.

Bridge. A route chooses how to search. It does not decide when the answer is safe. For that final stoplight, we need explicit confidence gates. → 13-confidence-gates.md