02. Query rewriting — same intent, better handles¶

~12 min read. Retrieval fails early when the question arrives in human shorthand instead of searchable form.

Built on the ELI5 in 00-eli5.md. the rewriter — query transformation before search — matters because the shelves cannot retrieve what the question never made explicit.

1) The wall — when conversation shorthand becomes retrieval ambiguity¶

Basic RAG gave us a useful baseline: embed the query, retrieve a few chunks, and generate from the result. The earlier advanced-RAG tools — the rewriter, the hypothesis, the multi-step plan, the cross-checker, and the confidence gate — add control points before generation.

The concrete failure here is sharper: human shorthand hides entities, time windows, and evidence type. This page follows the raw query, rewritten query, and changed top-3 result list so you can see whether intent-preserving rewriting actually changes the evidence path or just adds another box to the architecture diagram.

The tempting repair is to paraphrase the question until it sounds cleaner. That keeps the system simple, and on easy questions it may be right. It fails on this case: “Phoenix last week” can refer to a campaign, an office, or a payments incident. Search needs the project name, date window, and task type before retrieval can rank the right documents.

Root cause: the evidence path has already lost information before the model writes a sentence. Rule: rewrite for searchable constraints, not prettier language.

Mini-FAQ. "What is the control point here?" the rewriter is useful only when it creates a real decision: retrieve differently, rank differently, retry, reroute, answer, or refuse.

2) The core visual — raw wording gets converted into handles¶

Consider a student asking, “What happened with Phoenix last week?” A human teammate may understand the office gossip around that sentence. A retriever does not. It sees one vague entity, one fuzzy time range, and one unclear task. Now the rewriter steps in. It does not change the meaning. It gives the meaning handles. Handles are names, date windows, document types, and expected evidence. That is all query rewriting is. You are not making the question smarter. You are making the search path clearer.

raw user wording
       │
       ▼
┌──────────────┐
│ the rewriter │
└──────┬───────┘
       │ keeps intent
       │ makes entities explicit
       │ removes filler
       ▼
retrieval-ready query

Human wording is optimized for conversation. Retrieval wording is optimized for finding evidence.

3) What the transformation really does¶

A good rewrite preserves every binding constraint. That includes geography, customer tier, time range, and task type. It can drop polite filler. It cannot drop meaning. Many teams treat rewriting like paraphrasing. That is too casual. Paraphrasing may keep the vibe while losing the guardrails. Retrieval cannot afford that. Suppose the user asks, “Did premium users in APAC see latency improve after the cache rollout?” A bad rewrite says, “Did users see latency improve after caching?” Premium vanished. APAC vanished. The answer will now be wrong for precise reasons. So the rule is sharp. Rewrite for retrieval, not for elegance. That is why the rewriter should be logged and inspected. You want raw query, rewritten query, and final evidence all visible.

4) The worked example — trace the intermediate state¶

Raw question: “What happened with Phoenix last week?” Assume the corpus contains these candidates. D1 — Phoenix marketing campaign postmortem D2 — Project Phoenix payment incident report D3 — Phoenix office hiring plan D4 — on-call handoff for the payments team D5 — customer escalation summary for card failures First, extract the implied pieces. Phoenix may be a campaign, a project, or an office.

Last week becomes a date window.

What happened becomes incident plus cause plus resolution.

Now the rewrite becomes:

“Project Phoenix payment incident in the last 7 days, including timeline, root cause, mitigation, and customer impact.”

Raw retrieval scores:

D1 = 0.86

D2 = 0.61

D3 = 0.58

Rewritten retrieval scores:

D2 = 0.89

D4 = 0.79

D5 = 0.74

raw query top-3
1. D1 0.86  ─ campaign noise
2. D2 0.61  ─ actual incident
3. D3 0.58  ─ office noise

rewritten query top-3
1. D2 0.89  ─ root incident
2. D4 0.79  ─ operational timeline
3. D5 0.74  ─ customer impact

See the intermediates clearly.

Entity disambiguation moved D2 from rank 2 to rank 1.

Task expansion pulled D4 and D5 into view.

That is the whole win.

No new model knowledge was created.

The search simply stopped guessing what Phoenix meant.

5) Failure modes — how the mechanism breaks¶

Failure one. The rewrite removes a hidden constraint.

“APAC premium latency” becomes “latency.”

The system returns a global average and feels correct.

Failure two. The rewrite overcommits to the wrong entity.

Phoenix becomes the campaign instead of the service.

Now the entire retrieval set is clean, fast, and wrong.

Failure three. The rewrite becomes too long and too specific.

It packs every imagined term into one sentence.

Sparse retrieval loves it.

Dense retrieval may drift because the sentence now mixes unrelated goals.

So what to do?

Preserve intent.

Expose missing handles.

Stop before you invent assumptions.

That is the adult version of rewriting.

6) Production rules that hold up¶

Start by extracting named entities, dates, filters, and the user’s real task.

Then rebuild the sentence in retrieval-friendly order.

Put the main entity first.

Put the time window second.

Put the evidence request third.

Keep synonyms for expansion later.

Do not stuff them into the rewrite too early.

A good rewrite feels boring.

That is good news.

Boring rewrites are traceable.

Traceable rewrites are debuggable.

And debuggable systems improve faster.

Once the query shape is stable, you can fan out into multiple variants safely.

That is where the next topic enters.

7) Why not letting embeddings infer the missing handles under this workload¶

The plausible alternative is letting embeddings infer the missing handles. It is attractive because it preserves a smaller pipeline and avoids another operational surface.

That tradeoff is correct for low-risk, obvious queries. It is wrong when human shorthand hides entities, time windows, and evidence type. Under that workload, intent-preserving rewriting earns its cost by making the failure inspectable before generation.

Option	Works when	Fails when	Cost moves to
letting embeddings infer the missing handles	evidence need is simple	human shorthand hides entities, time windows, and evidence type	prompt wording and user trust
query rewriting	failure is detectable before generation	checkpoint is unlogged or uncalibrated	retrieval orchestration, latency, evals

Mini-FAQ. "Should every query take this path?" No. The mechanism belongs on routes where the evidence risk justifies the extra latency and debugging surface.

8) Production signals — know whether query rewriting is working¶

A healthy trace shows raw and rewritten queries make the missing handles visible. The first metric to watch is wrong-entity retrieval rate. Top-1 similarity is weaker because it can improve while a binding constraint is still missing.

The review loop starts with false greens: cases where the system answered, but later inspection found unsupported claims. Those cases reveal whether the checkpoint is protecting the right boundary.

user complaint
   -> retrieval trace
   -> evidence check
   -> answer / retry / route / abstain
   -> false-green review

9) Boundary — where query rewriting helps, hurts, or wastes budget¶

Strong fit: the bad evidence pattern is visible before generation. Weak fit: the corpus is missing the truth, metadata is stale, or the system has no alternative route to take. Pathology: the pipeline keeps retrying because it wants an answer, not because the current evidence revealed a new evidence need.

Scale limit: each checkpoint spends latency, money, logs, and operator attention. Route the mechanism to the queries that need it; do not make every query pay for the hardest query.

10) Wrong model — advanced RAG means adding more retrieval¶

The wrong model says more variants, rerankers, loops, and confidence scores make the system safer by default.

The better model is narrower: each component must relieve one named pressure and create one visible decision. If the rewriter does not change what the system does, it is decoration.

11) Failure taxonomy for query rewriting¶

The checkpoint is not logged, so the failure cannot be replayed.
The threshold came from a demo set and does not match production traffic.
The retry repeats the same weak evidence instead of targeting a missing slot.
The metric improves while one required constraint stays unsupported.
Metadata or routing rules go stale and silently hide the right document.
Extra candidates crowd the prompt with duplicates.
Operators optimize the easy number instead of unsupported-answer rate.

12) Pattern transfer — same pressure, different system¶

Caching has the same shape: add the mechanism only when the access pattern earns it.
Observability has the same shape: traces matter because production bugs start as symptoms.
Distributed systems have the same shape: the useful part is the guarantee under failure.
Evals have the same shape: average wins do not matter if false greens leak to users.

13) Design review checklist¶

What exact pressure forced this mechanism to exist?
What artifact proves the mechanism changed retrieval?
Why is letting embeddings infer the missing handles weaker on this workload?
Which metric should improve first?
Which cost rises first?
When should the system answer, retry, reroute, or abstain?

Where this lives in the wild¶

Glean — resolves company terms, team names, and aliases before enterprise search runs.
Elastic AI Assistants — benefit when incident questions are rewritten into service, timeframe, and failure-mode handles.
Perplexity — rewrites messy natural questions into retrieval-ready search prompts before synthesis.
Microsoft 365 Copilot — needs entity-explicit rewrites so documents and mail results align.
Datadog Bits AI — improves incident lookup when vague chat phrasing becomes service-specific retrieval text.
Enterprise policy assistants — use the pattern when user wording hides policy scope, dates, or exception clauses.
Customer support copilots — need the control when a pleasant answer without full evidence can mislead a customer.
Incident copilots — benefit when service names, runbooks, dashboards, and postmortems disagree or use aliases.
Legal document search — requires visible evidence boundaries because one missing clause can reverse the answer.
Healthcare knowledge assistants — need stricter support checks because fluent synthesis is not a safety guarantee.
Financial research tools — use it to separate ticker-like literal anchors from semantic business questions.
Internal engineering search — exposes whether the system found the exact config, RFC, ticket, or deploy note.
Knowledge-base migration projects — reveal stale, duplicate, and missing documents before retrieval quality is blamed.
Evaluation harnesses — turn the mechanism into a measurable false-green and false-red review loop.
Agentic workflows — reuse the same control point before a tool call, follow-up search, or escalation step.

Recall checkpoint¶

Why is query rewriting about preserving constraints rather than making the sentence prettier?
In the Phoenix example, which hidden pieces did the rewrite surface before retrieval?
What is the specific danger of overcommitting to the wrong entity during rewriting?
Which false-green case would you review first for query rewriting?
What cost does this mechanism move into retrieval orchestration or evaluation?
When would letting embeddings infer the missing handles be acceptable instead?

Interview Q&A¶

Q: Why not let the retriever handle ambiguous user wording directly? A: Because retrievers need explicit handles, and ambiguity often pushes the top ranks toward related but wrong entities.

Common wrong answer to avoid: "Embeddings understand everything anyway." — Embeddings reduce ambiguity, but they do not reliably infer which Phoenix, which week, or which task without cues.

Q: What makes a good rewrite safe in production? A: It preserves all explicit constraints, makes implied filters visible, and stays inspectable beside the original query.

Common wrong answer to avoid: "A safe rewrite is simply a more detailed sentence." — More detail is not safety if the extra detail was hallucinated by the rewrite step.

Q: Why should rewrites be logged? A: Because retrieval failures become diagnosable only when you can compare the raw question, the transformed query, and the evidence returned.

Common wrong answer to avoid: "Logging the final answer is enough." — By then the retrieval mistake is already buried under generation.

Q: What trace would you inspect first when query rewriting fails? A: Start with raw query, rewritten query, and changed top-3 result list. Then compare it with the final answer claims.

Common wrong answer to avoid: "Start by rewriting the final prompt." — The prompt may be fine. The evidence path may already be broken.

Q: What cost does this add? A: Latency, logging, evaluation work, and one more place operators must understand.

Common wrong answer to avoid: "It is free because it is only another model call." — Model calls spend money, time, and failure budget.

Q: When should you skip it? A: Skip it when the question is low-risk, the evidence need is obvious, and the decision would not change.

Common wrong answer to avoid: "Never skip advanced components." — Advanced components are controls, not trophies.

Apply now (10 min)¶

Rewrite three vague workplace questions into retrieval-ready form without dropping a single constraint.
Sketch from memory: draw the raw-question to rewritten-query flow and label the constraint checks.
Reproduce from memory: explain query rewriting in five sentences, including the pressure, mechanism, alternative, metric, and boundary.

What you should remember¶

Query rewriting exists because human shorthand hides entities, time windows, and evidence type. The mechanism is not valuable because it sounds advanced; it is valuable when it changes the evidence path before generation.

The artifact to inspect is the raw query, rewritten query, and changed top-3 result list. If that artifact does not explain why retrieval, ranking, retry, routing, or refusal changed, the component is not carrying its operational weight.

Remember:

Rewrite for searchable constraints, not prettier language.
Inspect the raw query, rewritten query, and changed top-3 result list before blaming the final prompt.
Prefer visible evidence controls over hidden prompt hope.
Review false greens before celebrating average retrieval scores.
Every advanced RAG component should relieve one pressure and create one decision.

Bridge. A clean rewrite gives one good search path. But one path is still one guess. To increase recall, we often need several careful variants of the same intent. → 03-query-expansion.md