04. Query decomposition — turning one hard ask into smaller hops¶

~14 min read. Multi-hop questions look single-shot only until you list the evidence they secretly require.

Built on the ELI5 in 00-eli5.md. the multi-step plan — splitting a complex query into sub-questions — matters when one retrieval call cannot cover all hidden hops.

1) The wall — when one question is really several lookups¶

Basic RAG gave us a useful baseline: embed the query, retrieve a few chunks, and generate from the result. The earlier advanced-RAG tools — the rewriter, the hypothesis, the multi-step plan, the cross-checker, and the confidence gate — add control points before generation.

The concrete failure here is sharper: multi-hop questions contain several evidence needs that one query cannot rank together. This page follows a question split into sub-questions with evidence slots so you can see whether sub-question planning actually changes the evidence path or just adds another box to the architecture diagram.

The tempting repair is to ask one broad query and hope the top-k includes every hop. That keeps the system simple, and on easy questions it may be right. It fails on this case: A contract-risk question needs the renewal list, threshold rule, update amount, and exception clauses. A single broad query usually over-ranks one part and drops another.

Root cause: the evidence path has already lost information before the model writes a sentence. Rule: split when the answer needs independently verifiable evidence slots.

Mini-FAQ. "What is the control point here?" the multi-step plan is useful only when it creates a real decision: retrieve differently, rank differently, retry, reroute, answer, or refuse.

2) The core visual — split the answer into evidence slots¶

Consider a manager asking, “Compare Q3 and Q4 revenue growth across all regions and explain the outlier.”

That sounds like one question.

It is not.

It hides at least four retrieval jobs.

Find Q3 numbers.

Find Q4 numbers.

Align region names.

Find the note explaining the outlier.

This is where the multi-step plan earns its salary.

Instead of forcing one bloated search, the system breaks the job apart.

complex question
      │
      ▼
┌──────────────────┐
│ the multi-step   │
│ plan             │
└───────┬──────────┘
        │
        ├── sub-query 1: Q3 by region
        ├── sub-query 2: Q4 by region
        ├── sub-query 3: normalize region aliases
        └── sub-query 4: explain outlier

One hard question often fails because retrieval cannot satisfy every hop at once.

Decomposition replaces one impossible ask with several ordinary asks.

3) What decomposition changes in practice¶

A single-shot query tends to mix retrieval and reasoning into one blob.

That blob confuses ranking.

Some documents match the quarter names.

Some match the regions.

Some match the explanatory note.

Few match everything.

Top-k search rewards partial overlap.

Generation then pretends the partial overlap is full support.

Decomposition fixes the retrieval side first.

Each sub-query has a tighter target.

Each result set is easier to inspect.

And the synthesis step becomes explicit rather than magical.

That means the multi-step plan gives you observability, not just better recall.

You can see which hop failed.

That is a big production advantage.

4) The worked example — trace the intermediate state¶

Question:

“Compare Q3 and Q4 revenue growth across all regions and explain the outlier.”

Decompose it into four steps.

S1 = What was Q3 revenue growth for APAC, EMEA, LATAM, and North America?

S2 = What was Q4 revenue growth for APAC, EMEA, LATAM, and North America?

S3 = Do any region labels need normalization, such as NA versus North America?

S4 = Which document explains the largest quarter-to-quarter change?

Suppose retrieval returns these facts.

S1 returns APAC 8, EMEA 5, LATAM 3, NA 6.

S2 returns APAC 11, EMEA 4, LATAM 9, North America 7.

S3 maps NA to North America.

S4 returns “LATAM channel recovery note.”

Now compute deltas.

APAC = 11 - 8 = 3.

EMEA = 4 - 5 = -1.

LATAM = 9 - 3 = 6.

North America = 7 - 6 = 1.

aligned table
APAC           Q3 8   Q4 11  delta +3
EMEA           Q3 5   Q4 4   delta -1
LATAM          Q3 3   Q4 9   delta +6
North America  Q3 6   Q4 7   delta +1

outlier = LATAM because +6 is largest absolute change
support note = LATAM channel recovery note

The answer is now grounded in visible sub-results.

If S4 failed, you would still know the comparison but not the explanation.

That is honest partial progress.

5) Failure modes — how the mechanism breaks¶

Failure one. The decomposition is too coarse.

It keeps “compare and explain” in the same sub-query.

You gain very little.

Failure two. The decomposition is too fine.

You create twenty tiny sub-queries.

Latency explodes.

Synthesis becomes the new bottleneck.

Failure three. The system forgets dependencies.

It computes the outlier before normalizing NA versus North America.

Now the math looks clean and the labels are wrong.

So what to do?

Split by evidence needs, not by grammar alone.

That is the mature use of the multi-step plan.

6) Production rules that hold up¶

Decompose when the question contains compare, contrast, before-after, or join-like structure.

Also decompose when one answer needs facts from different document families.

Keep each sub-query answerable in one hop if possible.

Name every sub-step clearly.

Log outputs for each step.

Mark which later steps depend on earlier normalization.

Decomposition is a small planning engine.

It does not make reasoning free.

It makes reasoning visible.

And once a sub-query is still too abstract for raw search, another trick helps.

You can search using an imagined answer shape instead of the raw question.

That is HyDE.

7) Why not one giant retrieval query plus a long synthesis prompt under this workload¶

The plausible alternative is one giant retrieval query plus a long synthesis prompt. It is attractive because it preserves a smaller pipeline and avoids another operational surface.

That tradeoff is correct for low-risk, obvious queries. It is wrong when multi-hop questions contain several evidence needs that one query cannot rank together. Under that workload, sub-question planning earns its cost by making the failure inspectable before generation.

Option	Works when	Fails when	Cost moves to
one giant retrieval query plus a long synthesis prompt	evidence need is simple	multi-hop questions contain several evidence needs that one query cannot rank together	prompt wording and user trust
query decomposition	failure is detectable before generation	checkpoint is unlogged or uncalibrated	retrieval orchestration, latency, evals

Mini-FAQ. "Should every query take this path?" No. The mechanism belongs on routes where the evidence risk justifies the extra latency and debugging surface.

8) Production signals — know whether query decomposition is working¶

A healthy trace shows each sub-question has its own retrieved support. The first metric to watch is missing evidence slot rate. Top-1 similarity is weaker because it can improve while a binding constraint is still missing.

The review loop starts with false greens: cases where the system answered, but later inspection found unsupported claims. Those cases reveal whether the checkpoint is protecting the right boundary.

user complaint
   -> retrieval trace
   -> evidence check
   -> answer / retry / route / abstain
   -> false-green review

9) Boundary — where query decomposition helps, hurts, or wastes budget¶

Strong fit: the bad evidence pattern is visible before generation. Weak fit: the corpus is missing the truth, metadata is stale, or the system has no alternative route to take. Pathology: the pipeline keeps retrying because it wants an answer, not because the current evidence revealed a new evidence need.

Scale limit: each checkpoint spends latency, money, logs, and operator attention. Route the mechanism to the queries that need it; do not make every query pay for the hardest query.

10) Wrong model — advanced RAG means adding more retrieval¶

The wrong model says more variants, rerankers, loops, and confidence scores make the system safer by default.

The better model is narrower: each component must relieve one named pressure and create one visible decision. If the multi-step plan does not change what the system does, it is decoration.

11) Failure taxonomy for query decomposition¶

The checkpoint is not logged, so the failure cannot be replayed.
The threshold came from a demo set and does not match production traffic.
The retry repeats the same weak evidence instead of targeting a missing slot.
The metric improves while one required constraint stays unsupported.
Metadata or routing rules go stale and silently hide the right document.
Extra candidates crowd the prompt with duplicates.
Operators optimize the easy number instead of unsupported-answer rate.

12) Pattern transfer — same pressure, different system¶

Caching has the same shape: add the mechanism only when the access pattern earns it.
Observability has the same shape: traces matter because production bugs start as symptoms.
Distributed systems have the same shape: the useful part is the guarantee under failure.
Evals have the same shape: average wins do not matter if false greens leak to users.

13) Design review checklist¶

What exact pressure forced this mechanism to exist?
What artifact proves the mechanism changed retrieval?
Why is one giant retrieval query plus a long synthesis prompt weaker on this workload?
Which metric should improve first?
Which cost rises first?
When should the system answer, retry, reroute, or abstain?

Where this lives in the wild¶

Perplexity — complex compare-and-contrast questions often need hidden sub-searches before synthesis.
Glean enterprise assistants — multi-document business questions benefit from explicit hop planning.
OpenAI file search workflows — structured comparison prompts work better when sub-questions are separated.
Financial research copilots — quarterly comparisons naturally decompose into period-specific retrieval steps.
Databricks internal knowledge bots — incident diagnosis improves when timeline, owner, and fix are retrieved separately.
Enterprise policy assistants — use the pattern when user wording hides policy scope, dates, or exception clauses.
Customer support copilots — need the control when a pleasant answer without full evidence can mislead a customer.
Incident copilots — benefit when service names, runbooks, dashboards, and postmortems disagree or use aliases.
Legal document search — requires visible evidence boundaries because one missing clause can reverse the answer.
Healthcare knowledge assistants — need stricter support checks because fluent synthesis is not a safety guarantee.
Financial research tools — use it to separate ticker-like literal anchors from semantic business questions.
Internal engineering search — exposes whether the system found the exact config, RFC, ticket, or deploy note.
Knowledge-base migration projects — reveal stale, duplicate, and missing documents before retrieval quality is blamed.
Evaluation harnesses — turn the mechanism into a measurable false-green and false-red review loop.
Agentic workflows — reuse the same control point before a tool call, follow-up search, or escalation step.

Recall checkpoint¶

Why do multi-hop questions fail under one bloated retrieval query?
In the revenue example, which step normalized the labels before computing deltas?
What is the danger of decomposing too finely rather than too coarsely?
Which false-green case would you review first for query decomposition?
What cost does this mechanism move into retrieval orchestration or evaluation?
When would one giant retrieval query plus a long synthesis prompt be acceptable instead?

Interview Q&A¶

Q: When does decomposition beat query expansion? A: When the question contains several evidence needs that should be answered separately rather than with alternate phrasings of one need.

Common wrong answer to avoid: "Expansion and decomposition are basically the same." — Expansion widens phrasing for one intent, while decomposition splits one task into several intents.

Q: Why is decomposition useful even if final answer quality stays imperfect? A: Because it exposes which hop failed, which makes debugging and fallback behavior far easier.

Common wrong answer to avoid: "If the final answer is still weak, decomposition added no value." — Observability is value in production systems.

Q: What is the key design principle for sub-queries? A: Each sub-query should target one evidence need cleanly and avoid mixing retrieval with later synthesis logic.

Common wrong answer to avoid: "Break the sentence at every comma." — Grammar is not the same as evidence structure.

Q: What trace would you inspect first when query decomposition fails? A: Start with question split into sub-questions with evidence slots. Then compare it with the final answer claims.

Common wrong answer to avoid: "Start by rewriting the final prompt." — The prompt may be fine. The evidence path may already be broken.

Q: What cost does this add? A: Latency, logging, evaluation work, and one more place operators must understand.

Common wrong answer to avoid: "It is free because it is only another model call." — Model calls spend money, time, and failure budget.

Q: When should you skip it? A: Skip it when the question is low-risk, the evidence need is obvious, and the decision would not change.

Common wrong answer to avoid: "Never skip advanced components." — Advanced components are controls, not trophies.

Apply now (10 min)¶

Take one compare-or-diagnose question and split it into the smallest useful evidence hops.
Sketch from memory: draw the decomposition tree and mark which steps depend on earlier normalization.
Reproduce from memory: explain query decomposition in five sentences, including the pressure, mechanism, alternative, metric, and boundary.

What you should remember¶

Query decomposition exists because multi-hop questions contain several evidence needs that one query cannot rank together. The mechanism is not valuable because it sounds advanced; it is valuable when it changes the evidence path before generation.

The artifact to inspect is a question split into sub-questions with evidence slots. If that artifact does not explain why retrieval, ranking, retry, routing, or refusal changed, the component is not carrying its operational weight.

Remember:

Split when the answer needs independently verifiable evidence slots.
Inspect a question split into sub-questions with evidence slots before blaming the final prompt.
Prefer visible evidence controls over hidden prompt hope.
Review false greens before celebrating average retrieval scores.
Every advanced RAG component should relieve one pressure and create one decision.

Bridge. Decomposition makes each hop cleaner. But some hops are still phrased too sparsely for good retrieval. In those cases, it helps to search with an imagined answer instead of the raw question. → 05-hyde-hypothetical-embeddings.md