09. Query understanding and retrieval — repairing the question before it hits the shelf¶
~12 min read. The bookshelf is perfect. The chunks are clean. The user types "what about the other one?" and the answer comes back wrong. By the end of this page you will know why, and which six tools fix it.
Builds on 08-rag-pipeline.md. The librarian must first decode what the user really asked. Stage 1 of the pipeline lives here, and most RAG bugs live with it.
1) The hook — a vague chat that breaks naive retrieval¶
A user opens your support chatbot. They have been chatting for two minutes about an enterprise renewal. Then they type this.
"What about the other one — over 30 days?"
Eleven words. Three landmines.
- "The other one" — references something earlier in the chat. The retriever does not have the chat. It only has the latest message.
- "Over 30 days" — a hard time constraint. Drop it and retrieval drifts to generic refund pages.
- No subject. No product. No verb. The embedding for this sentence sits in a vague neighbourhood of "the bookshelf." Nothing pulls hard.
Embed the raw text. The librarian fetches twenty chunks about general refunds. The reranker promotes the most refund-flavoured ones. The LLM writes a confident answer about the standard 30-day window. The user wanted the escalation policy past 30 days for enterprise. Wrong policy. Confident voice. Logged as a hallucination.
The bug is not in retrieval. The bug is in the question that retrieval received. This chapter is about repairing the question.
2) The metaphor — the librarian reads, then re-reads¶
In the ELI5, the librarian reads the user's question and walks to the shelf. In a real RAG system the librarian does more.
The librarian reads the question. Then reads it again. Asks four silent questions.
- What is the user actually asking?
- Does any word point backward to earlier turns?
- Are there hidden filters — tenant, time, product, region?
- Is one question hiding two questions?
Only after that does the librarian write the real index card and walk to the bookshelf. The raw user message is the cover of the question. The rewritten query is the question itself.
That gap — between cover and content — is where every tool in this chapter lives.
3) The six tools, in one column¶
┌─────────────────────────────────────────────────┐
│ RAW USER MESSAGE │
│ "what about the other one — over 30 days?" │
└────────────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ A. CONVERSATIONAL REWRITE │
│ resolve "the other one" using chat history │
│ → "enterprise refund policy beyond 30 days" │
└────────────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ B. QUERY REWRITE │
│ add anchors, expand acronyms │
│ → "enterprise annual plan refund policy after │
│ 30 day renewal window" │
└────────────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ C. QUERY EXPANSION │
│ add synonyms, related terms │
│ → also: "credit memo", "refund eligibility" │
└────────────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ D. MULTI-QUERY FAN-OUT │
│ 3-5 paraphrased queries, retrieve all │
│ → variants for billing, policy, escalation │
└────────────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ E. HyDE (hypothetical doc embedding) │
│ LLM drafts a fake "ideal" answer, embed that │
│ → vector closer to real policy chunks │
└────────────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ F. METADATA + FILTERED RETRIEVAL │
│ tenant=acme, doc_type=policy, lang=en │
│ → narrows the shelf before scoring │
└────────────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ HYBRID RETRIEVAL (sparse + dense + RRF) │
│ BM25 + vector, fused by Reciprocal Rank Fusion │
│ → 20–60 candidate chunks │
└─────────────────────────────────────────────────┘
Six tools. Each box has its own failure mode. You will rarely use all six. You almost always need three.
4) Tool A — conversational rewrite (the chat reference problem)¶
Chat is where naive RAG dies first.
The user's latest message is a fragment. The full meaning lives across the last 3–6 turns. If you embed the latest message alone, you embed a fragment.
The fix is mechanical. Before retrieval, run a small LLM call that takes the chat history plus the latest message and outputs a standalone search query.
SYSTEM: Rewrite the user's latest message as a standalone
search query. Resolve pronouns and references using
the conversation. Do not invent facts. Output the
query only.
HISTORY:
user: I'm looking at our Acme enterprise renewal.
assistant: It renewed on April 10th, $48k annual.
user: What about the other one — over 30 days?
REWRITE: enterprise refund policy beyond the 30 day window
for Acme annual plan renewal
That rewrite is now searchable. The pronoun is resolved. The time constraint is preserved. The tenant is anchored.
Mini-FAQ. "How do you resolve 'what about the other one?' in a chat?" You pass the last 3–6 turns plus the new message to a small LLM with the instruction above. Cost is one extra LLM call — typically 200–500 input tokens, 30–60 output tokens, around $0.0002 on GPT-4o-mini or Claude Haiku 3.5. Latency adds 100–300 ms. Skip it on first-turn queries; run it from turn 2 onward.
Failure mode. The rewriter hallucinates a fact not in the chat. Example: history mentions only "Acme renewal." Rewriter outputs "Acme refund for $48k annual plan paid via wire." The wire-transfer detail came from nowhere. Use a strict "do not invent facts" instruction and a small, well-aligned model.
5) Tool B — query rewriting (anchors and acronyms)¶
Even single-turn questions need cleaning.
Users type "RTO for our SLA" when they mean "recovery time objective in the service level agreement." Acronyms confuse embedders trained on general text. A query rewriter expands acronyms, adds product names, adds domain vocabulary, and removes filler.
For our running example:
| Step | Output |
|---|---|
| Raw | "what about the other one — over 30 days?" |
| After conversational rewrite (A) | "enterprise refund policy beyond 30 days for Acme renewal" |
| After query rewrite (B) | "enterprise annual plan refund eligibility after 30 day renewal window, escalation and manager approval process" |
Notice what B added — "eligibility," "escalation," "manager approval." Those are the real document phrases in the corpus. The rewriter is translating the user's English into your corpus's English.
Failure mode. Over-rewriting. The model adds phrases that are not in the corpus at all. Then retrieval moves toward a neighbourhood with no real chunks. Always evaluate rewrites against retrieval recall on a held-out set, not by reading them.
6) Tool C — query expansion (synonyms and related terms)¶
Expansion is lighter than rewriting. The query stays one query. You just attach related terms before embedding, or before BM25 scoring.
Two flavours.
- Lexical expansion — add synonyms from a thesaurus or a small LLM call. "refund" → "credit," "reimbursement," "chargeback." Helps BM25.
- Embedding expansion — average the query vector with vectors of related terms. Smooths the neighbourhood. Used less often, easy to misuse.
For our example, expansion adds credit memo, refund eligibility, post-renewal cancellation. These are not the user's words. They are the corpus's words.
Failure mode. Expand too aggressively and the query becomes a soup. Hybrid retrieval then matches anything vaguely refund-shaped. Three to five extra terms is usually the ceiling.
7) Tool D — multi-query fan-out¶
Sometimes one rewrite is not enough. The user's question hides several angles, and one embedding cannot cover all of them.
Fan-out generates 3–5 paraphrased queries from the same source question. Each one retrieves independently. Then the results merge.
For our running example, the fan-out might produce:
- "enterprise annual plan refund policy after 30 days"
- "manager approval workflow for late refund requests"
- "credit memo issuance for past-window cancellations"
- "escalation path for refund disputes beyond standard window"
Each query lands in a different part of the bookshelf. The union of their top-20 results gets fused (usually by Reciprocal Rank Fusion) before reranking.
┌── query 1 → top-20 ─┐
│ │
question ───┼── query 2 → top-20 ─┼── RRF fusion → 40 unique chunks
│ │
└── query 3 → top-20 ─┘
LangChain ships this as MultiQueryRetriever. LlamaIndex ships it as SubQueryQueryEngine and QueryFusionRetriever.
Mini-FAQ. "Why fan-out into multiple queries?" Because one embedding is one point in space. The user's question often spans a region. Three queries cover the region better than one. The cost: one extra LLM call to generate variants (~200 input tokens, ~150 output tokens, ~$0.0003) plus N times the retrieval cost. On a managed vector DB at $0.00001 per query, fanning out to 4 still costs less than half a cent.
Failure mode. Drift. The variants stop being paraphrases and become different questions. The fused result then mixes irrelevant chunks into the top-k. Cap variants at 4. Score each variant against the original and drop low-similarity ones.
8) Tool E — HyDE (hypothetical document embeddings)¶
HyDE is the clever one. Worth understanding because interviewers ask.
The idea. Short queries and long documents live in different parts of embedding space. Asking "what about over 30 days?" produces a vector shaped like a question. The chunks in your bookshelf are shaped like answers — full sentences, policy paragraphs, technical prose.
So HyDE does this. Ask a small LLM to write a fake ideal answer to the query. Embed that fake answer. Use that vector to search.
query: "enterprise refund over 30 days?"
│
▼
LLM drafts a hypothetical answer:
"Enterprise annual plans may request a refund within
30 days of renewal. After 30 days, refunds require
manager approval and are issued as credit. Customers
should contact billing support to initiate the process."
│
▼
embed THAT text → search vector
│
▼
bookshelf returns chunks geometrically closer to
real policy paragraphs
The hypothetical answer is often factually wrong — that does not matter. What matters is that it has the shape of a real document. Embedding shape is what drives nearest-neighbour search.
Mini-FAQ. "What is HyDE and when does it pay off?" HyDE = Hypothetical Document Embeddings. Pays off when (a) the query is much shorter than the chunks, (b) the embedder was trained on document-document similarity rather than query-document, (c) you do not yet have query-document training pairs. Cost: one extra LLM call (~\(0.0002–\)0.001) and ~200–500 ms added latency. Replaced over time by query-tuned embedders like Voyage, Cohere v3 with input-type, and OpenAI text-embedding-3 with task hints.
Failure mode. The hypothetical answer drifts into a different domain. Example — user asks about "enterprise refunds," LLM invents an answer about consumer subscription refunds. Now the search vector points the wrong way. Use a small, instructable model and a short prompt.
9) Tool F — filtered and metadata-aware retrieval¶
Vague queries are not the only failure. The right chunk may exist in the corpus but belong to the wrong tenant, wrong language, or wrong product version. Pre-filter before vector scoring.
Every chunk in your bookshelf has metadata. Typical fields:
| Field | Example values |
|---|---|
tenant_id |
acme, globex, initech |
doc_type |
policy, faq, runbook, ticket, contract |
language |
en, de, ja |
product |
billing, api, mobile |
version |
v2, v3, v4 |
created_at |
2024-Q1, 2025-Q3 |
confidentiality |
public, internal, restricted |
For our running example, the user is logged in as Acme admin. Their query should be filtered: tenant_id = "acme" AND doc_type IN ("policy", "faq") AND language = "en". That trims the shelf from a million chunks to maybe ten thousand before the vector index even starts scoring.
Pinecone, Weaviate, Qdrant, Milvus, pgvector, and Azure AI Search all support metadata filtering at query time. The mechanics differ — Pinecone uses metadata filters in the query, pgvector uses a SQL WHERE clause, Weaviate uses GraphQL where. The principle is identical.
Failure mode. Forgetting tenant isolation. Glean built an entire product around the fact that enterprise search must respect permissions. A retriever that ignores tenant_id leaks documents across customers. Fail to filter on confidentiality and you serve internal docs to public users. This is a security bug, not a relevance bug.
10) Retrieval modes — sparse, dense, hybrid¶
Once the query is repaired, choose how to score.
| Mode | What it captures | Where it wins | Where it fails |
|---|---|---|---|
| Sparse (BM25, SPLADE) | Exact tokens, rare terms | Product names, error codes, legal phrases, SKUs | Synonyms, paraphrase |
| Dense (vector) | Semantic meaning | Paraphrase, fuzzy intent | Rare entities, exact codes |
| Hybrid (RRF fusion) | Both | Most real-world enterprise corpora | Marginally more complex to tune |
| Multi-vector (ColBERT) | Token-level meaning | Long chunks, complex queries | Higher index size, compute cost |
Reciprocal Rank Fusion is the dominant fusion algorithm. For each chunk, compute score = sum over retrievers of 1/(k + rank) with k typically 60. Cheap, no tuning, robust. Elastic, Vespa, Weaviate, Pinecone, OpenSearch all expose hybrid search with RRF as a default option.
Multi-vector retrieval (ColBERT, ColBERTv2, ColPali for images-with-text) stores one vector per token rather than one per chunk. Recall and precision both improve. Index size grows 10–30x. Use when single-vector hybrid still misses too often.
Mini-FAQ. "Hybrid fusion — what's the overhead?" Two retrievers run in parallel. Wall-clock latency is max(BM25, dense), not their sum. RRF fusion itself is microseconds. Index cost roughly doubles — you maintain a BM25 index next to the vector index. Most production stacks decide hybrid is worth it before the first 100k chunks.
Predict the order of A–F before reading the trace¶
Before reading on, write down the order in which you would apply A–F for a multi-turn enterprise support chatbot. Which would you skip on the first turn? Which would you skip for a single-shot search box?
11) The failure chain — our running query through every tool¶
Let's trace what each tool does for "what about the other one — over 30 days?"
| Tool | Without it | With it |
|---|---|---|
| A. Conversational rewrite | Embed the fragment → vague top-k | "Acme enterprise refund beyond 30 days" → focused top-k |
| B. Query rewrite | Acronyms, casual wording stays | Corpus-aligned terminology like "manager approval," "credit memo" |
| C. Expansion | BM25 misses "credit memo" entirely | BM25 catches the rare-but-exact match |
| D. Multi-query | One embedding, one region of shelf | 4 paraphrases cover policy, escalation, billing |
| E. HyDE | Short-query vector lands in wrong neighbourhood | Document-shaped vector lands near policy chunks |
| F. Metadata filter | Globex's policy leaks into Acme's results | Filtered to tenant=acme, doc_type=policy |
The user types eleven words. The pipeline runs six transformations before retrieval begins. None of them is optional in mature systems. All of them are skipped in toy demos. That gap is the gap between "RAG works on the slide" and "RAG works in production."
12) Cost and latency — what query work actually adds¶
Adding query-side work is not free. Quantify before adding.
| Stage | Extra LLM calls | Extra latency (typical) | Extra cost (per query) |
|---|---|---|---|
| Conversational rewrite | 1 small | 100–300 ms | \(0.0001–\)0.0003 |
| Query rewrite | 1 small | 80–200 ms | \(0.0001–\)0.0003 |
| Query expansion (LLM) | 1 small | 80–200 ms | \(0.0001–\)0.0002 |
| Multi-query fan-out | 1 small (generation) + N retrievals | 100–250 ms LLM + parallel retrieves | \(0.0002–\)0.0004 + N × retrieve cost |
| HyDE | 1 small-medium | 200–500 ms | \(0.0002–\)0.001 |
| Metadata filter | 0 | <5 ms | negligible |
| Hybrid retrieval | 0 | max(BM25, dense), not sum | doubles index storage |
Stack matters. Numbers above assume a small fast LLM (GPT-4o-mini, Claude Haiku 3.5, Gemini Flash) for query work. Use a frontier model for rewriting and you pay 10–30x more and add seconds. Most teams use a small model for query work and a larger model only for the final answer.
Mini-FAQ. "When does query rewrite cost more than it saves?" When the corpus is small, well-curated, and queries are already structured (form-based search, internal admin tools). Adding 200 ms and an LLM call for a query that already retrieves correctly is pure waste. Measure recall@k without rewrite first. Only add rewrite if recall is below target.
13) Failure modes — the catalogue¶
The four that actually break production.
- Lost context references in chat. "What about the other one?" embedded as-is. Fix: conversational rewrite from turn 2 onward.
- Over-expanded query. Synonyms added until the query matches everything. Recall goes up, precision collapses, reranker cannot save it. Fix: cap expansion at 3–5 terms, score each against the original.
- Drift from original intent. Multi-query variants stop being paraphrases. HyDE invents a different domain. Fix: keep the original query in the fused set; weight it higher.
- Ignored filters and metadata. Tenant leak. Wrong language. Old product version. Fix: enforce mandatory filters server-side, not client-side. Never trust the LLM to add the tenant filter to its rewrite.
A fifth, sneaky one. Stale rewrite cache. You cache rewrites for repeated queries. The chat history changes around them. The cache returns a rewrite from a different conversation. Always cache post-rewrite, never pre-rewrite when chat history is in scope.
Query-side work across shipped products¶
Real products that do meaningful query-side work. The shape repeats; the corpus, the model, and the priorities differ.
- Perplexity AI — rewrites the user's question into one or more web search queries; multi-query fan-out is core to grounded answers.
- Glean — enterprise search across SaaS; permission-aware filtering and per-user metadata gates every query.
- ChatGPT Search — conversational rewrite before web search; references in chat are resolved into standalone queries.
- You.com and Andi — consumer search-chat with query rewriting and multi-source retrieval.
- Phind — developer search; rewrites natural-language questions into code-aware queries.
- Anthropic Claude with web search — query rewriting and multi-query expansion before tool calls.
- Cohere Coral / Command — built-in query rewrite as part of the RAG API.
- Vectara — managed RAG with query rewriting and hybrid retrieval as defaults.
- Pinecone Assistant — managed pipeline that includes rewrite and metadata-filtered retrieval.
- LlamaIndex —
SubQueryQueryEngine,HyDEQueryTransform,QueryFusionRetrievership in core. - LangChain —
MultiQueryRetriever,SelfQueryRetrieverfor metadata-aware retrieval,HypotheticalDocumentEmbedder. - Haystack — query pipelines with rewriting, expansion, and hybrid retrieval as composable nodes.
- LangGraph query agents — graph-based loops where rewriting and retrieval can re-run with feedback.
- GitHub Copilot Chat — codebase queries rewritten with file context and symbol names before retrieval.
- Cursor and Windsurf — codebase questions rewritten with the open file, the active symbol, and recent edits.
- Notion AI — workspace queries resolved against page references and active document context.
- Slack AI — channel-aware rewrites that resolve "this message" and "earlier today" into searchable queries.
- Intercom Fin — customer support; user message rewritten with conversation history before help-center search.
- Zendesk AI agents — ticket context fused into the query before knowledge-base retrieval.
- Hebbia — financial document QA; multi-query decomposition across long filings.
- Harvey — legal RAG; query rewriting into legal terminology and clause-aware metadata filters.
- Casetext CoCounsel — legal queries decomposed across case law sub-questions.
- Vespa and Elastic ELSER — sparse-dense hybrid retrieval with query-time expansion.
- Azure AI Search — query rewriting feature in semantic ranker; filter expressions for metadata.
- Google Vertex AI Search — query understanding layer before document ranking.
Different products. Same shape. Most query-side bugs in production are the same bug repeated.
Recall — query-side tools cold¶
- Why does embedding the raw user message fail in a chat context?
- What does conversational rewriting actually do, and what does it cost per turn?
- What is HyDE, and what shape problem does it solve?
- Why do multi-query variants need to stay close to the original intent?
- When is metadata filtering a security requirement, not a relevance one?
- What does Reciprocal Rank Fusion do, and why is it the default?
- Name two failure modes that come from too much query work, not too little.
- For a small, well-curated corpus, which tools from A–F would you skip and why?
Interview Q&A¶
Q1. What is HyDE and when would you use it? A. HyDE (Hypothetical Document Embeddings) asks an LLM to draft a fake answer to the user's question, then embeds that draft and uses it as the search vector. It works because chunks in the index look like answers, not questions — embedding an answer-shaped string lands closer to relevant chunks. Use it when queries are much shorter than chunks, when you don't have query-tuned embeddings, or when retrieval recall is the bottleneck. Common wrong answer to avoid: "HyDE makes the LLM answer the question directly."
Q2. How do you handle 'what about the other one?' in a chat-based RAG? A. Run a conversational rewrite before retrieval. Pass the last 3–6 turns plus the new message to a small LLM with an instruction to output a standalone search query, resolving pronouns and not inventing facts. Then embed the rewrite, not the raw message. Common wrong answer to avoid: "Concatenate the whole chat history and embed it."
Q3. Sparse vs dense vs hybrid retrieval — which one and why? A. Sparse (BM25, SPLADE) wins on exact tokens, product names, SKUs, error codes. Dense wins on paraphrase and synonym. Hybrid wins in most real corpora because real queries mix both patterns. Fuse with Reciprocal Rank Fusion — robust and tuning-free. Common wrong answer to avoid: "Dense is always better; sparse is legacy."
Q4. Why is multi-query fan-out useful, and when does it hurt? A. One embedding is one point in space; the user's intent often spans a region. Three to five paraphrased queries cover the region. It hurts when variants drift away from the original intent — the fused top-k then includes irrelevant chunks. Keep the original query in the fusion set and cap variants. Common wrong answer to avoid: "More queries always means better recall."
Q5. How does metadata filtering interact with vector search? A. Metadata predicates run before or alongside the vector index. Pinecone, Weaviate, Qdrant, Milvus, and pgvector all support filtered ANN. Filters cut the candidate set, which improves both relevance and security — tenant isolation, language, document type, version, and confidentiality should be enforced server-side, never trusted to the LLM to add. Common wrong answer to avoid: "Filter after retrieval — it's the same result." (It is not — filtered ANN preserves recall; post-filtering throws away relevant chunks that fell outside top-k.)
Q6. What is Reciprocal Rank Fusion, and why is it the default fusion algorithm?
A. RRF computes score = sum over retrievers of 1/(k + rank), k typically 60. It is rank-based, so scale differences between BM25 and vector scores do not matter. It is tuning-free, robust, and cheap. Used by Vespa, Elastic, Weaviate, OpenSearch, Azure AI Search.
Common wrong answer to avoid: "You add the scores together." (You cannot — they live on different scales.)
Q7. When does query rewriting cost more than it saves? A. When the corpus is small and well-curated, when queries are already structured, or when recall@k is already at target without rewriting. Adding 200 ms and an LLM call to fix a query that already works is pure overhead. Always measure recall first. Common wrong answer to avoid: "Rewriting always helps."
Q8. How do you prevent the rewriter from hallucinating facts not in the chat? A. Use a small, well-aligned model. Use a strict system instruction — "resolve references using only the conversation; do not invent facts." Evaluate the rewrite against the chat with an LLM-as-judge or a regex check for facts the rewrite added. Drop rewrites that fail. Common wrong answer to avoid: "Frontier models don't hallucinate on short tasks."
Apply now (10 min)¶
Step 1 — model the exercise. Take our running query: "what about the other one — over 30 days?" in a chat where prior turns mention an Acme enterprise renewal. Write out the six rows below for it.
| Tool | Will I use it? | What it produces for this query |
|---|---|---|
| A. Conversational rewrite | Yes | "Acme enterprise refund beyond 30 days" |
| B. Query rewrite | Yes | adds "manager approval," "credit memo" |
| C. Expansion | Maybe | adds "credit," "reimbursement" for BM25 |
| D. Multi-query | Yes (3 variants) | policy, escalation, billing angles |
| E. HyDE | Optional | only if recall@20 is below target |
| F. Metadata filter | Mandatory | tenant=acme, doc_type=policy, lang=en |
Step 2 — your turn. Pick one real query from your product. Chat-style if you have one. Write the same six rows. For each row mark cost (extra LLM calls, extra latency, extra index cost) and value (what failure mode it prevents).
Step 3 — sketch from memory. Redraw the six-tool column. Beside each box, write the output shape (string, list of strings, vector, filter expression) — not the description. If you can do this cold, you understand the layer.
What you should remember¶
This chapter explained why "embed the user's message" is the toy-demo move that breaks the moment real chat arrives. The user typed eleven words; the production pipeline runs six transformations before retrieval even begins — conversational rewrite, query rewrite, expansion, multi-query fan-out, HyDE, metadata filter — and a hybrid retriever fuses BM25 with dense vectors. Each tool answers a different failure: lost-context references, vocabulary mismatch, rare-token blindness, single-point intent, short-query geometry, tenant leak.
You also learned the asymmetric pricing of query work. A small fast LLM (Haiku, Flash, GPT-4o-mini) does the rewriting; a larger model only writes the final answer. Skipping the small model is rarely the savings it looks like — every wrong-chunk retrieval downstream costs more than the 100 ms it would have prevented.
Carry this diagnostic forward: when chat-style RAG returns generic answers on turn 2, look at the embedded text, not the model. If the raw fragment was embedded instead of a rewritten standalone query, the bug is upstream of retrieval. Fix the rewriter. Always log both the raw and rewritten query for every chat turn.
Remember:
- A pronoun is invisible to an embedder. Conversational rewrite from turn 2 onward is not optional.
- Hybrid retrieval (BM25 + dense, fused by RRF) is the production default. Dense alone misses rare tokens; sparse alone misses paraphrase.
- HyDE shines when queries are much shorter than chunks — the hypothetical answer lands in the right geometric neighbourhood.
- Metadata filters are security, not relevance. Enforce server-side; never trust the LLM to add the tenant filter.
- Always measure recall without query work first. Adding 200 ms and an LLM call to fix a query that already worked is pure waste.
Bridge. The query is now repaired. Hybrid retrieval has fetched 20–60 candidate chunks. The right ones are somewhere in that set. But k=60 is far too many for the LLM, and bi-encoder similarity is too blunt to rank them sharply. The next file is about the cross-encoder that rescans the candidates one by one and pushes the strongest to the top.