02. Advanced RAG — Narrative Explainer¶

Companion to 03_study_material.md. That file is the quick-reference sheet. This file is the story, the failure map, and the picture in your head.

Table of contents¶

ELI5 — the promoted librarian (start here)
Chapter 1: The opening failure
1.1 The demo looked fine
1.2 The question that breaks the demo
1.3 Why 70% accuracy is a dangerous number
Chapter 2: Query transformation
2.1 Why raw user wording is not a retrieval query
2.2 Query rewriting
2.3 Query expansion
2.4 Query decomposition
2.5 Step-back prompting
2.6 Worked example A — financial comparison
2.7 Worked example B — policy exception
2.8 Worked example C — metric change diagnosis
Chapter 3: HyDE & advanced retrieval
3.1 The hypothesis
3.2 When HyDE helps and when it hurts
3.3 Parent-child retrieval
3.4 Fusion retrieval — dense + sparse
3.5 How these fit together
Chapter 4: Reranking & filtering
4.1 Retriever vs reranker
4.2 The cross-checker
4.3 Metadata filtering
4.4 MMR for diversity
4.5 Worked example — cleaning a noisy result set
Chapter 5: Agentic RAG & multi-hop
5.1 Why loops appear
5.2 Corrective RAG (CRAG)
5.3 Self-RAG and the confidence gate
5.4 Iterative retrieval
5.5 Routing between retrieval strategies
5.6 Retrieval prompts you can use tomorrow
5.7 Honest admission
Chapter 6: Recap
6.1 Failure-fix table
6.2 Key points to remember
6.3 Important interview questions
6.4 Production experience
6.5 Foundation-gap audit before Module 09
6.6 Exercises and bridge forward

ELI5 — the promoted librarian¶

In Module 07, you had a helpful librarian. You asked a question. The librarian fetched a few books. Then the model answered from those books. That was basic RAG.

Now imagine that librarian got promoted to head researcher. They still fetch books. But they do much more before trusting the first shelf.

They hear your question and think, “Is this even the best way to ask for the evidence?” So they clean the question first. That is the rewriter.

Then they think, “If a strong answer existed, what would it probably look like?” That imaginary answer helps guide the search. That is the hypothesis.

Then they say, “This question is secretly three smaller questions.” So they split it into steps. That is the multi-step plan.

Then they bring back many candidate passages and say, “Wait, these all look related, but which ones really answer the question?” That careful second pass is the cross-checker.

Finally they look at the draft answer and ask, “Did we actually get enough support, or should we search again?” That last guard is the confidence gate.

So the librarian from Module 07 did retrieval. The head researcher in Module 08 does retrieval plus judgment. That is the difference.

Suppose the user asks:

Compare Q3 and Q4 revenue growth across all regions.

A junior librarian searches that exact sentence, pulls a few quarterly summaries, misses the APAC table, misses the LATAM footnote, and the answer comes out smooth but wrong.

The promoted researcher works like this:

question
  ↓
rewriter
  ↓
retrieval plan
  ├─ maybe one search
  ├─ maybe many searches
  └─ maybe different search types
  ↓
cross-checker
  ↓
answer draft
  ↓
confidence gate
  ├─ answer now
  └─ search again

That loop is the whole module. Everything else is detail and control.

Chapter 1: The opening failure¶

1.1 The demo looked fine¶

Basic RAG gives a very convincing first demo. Ask, “What is the refund policy?” It retrieves the policy chunk and answers correctly.

Ask, “When was feature X launched?” It finds a release note and again answers correctly.

So the team relaxes. They say, “Great. Retrieval solved hallucination. Ship it.”

Then production users arrive. Production users do not ask neat benchmark questions. They ask mixed, vague, multi-step, constraint-heavy questions.

They ask things like: - Compare Q3 and Q4 revenue growth across all regions. - Which contracts expire next quarter for EMEA enterprise customers? - Did latency improve after the caching rollout for premium users? - Summarize policy exceptions approved after the August update.

These are not single-hop lookups. They are tiny research tasks. That is where naive RAG begins to crack.

1.2 The question that breaks the demo¶

Take this question:

Compare Q3 and Q4 revenue growth across all regions.

Looks harmless. Actually it hides several retrieval demands.

We need: - Q3 numbers - Q4 numbers - all regions - a stable region list - maybe commentary explaining changes - maybe the same definition of “growth” across documents

Now watch the naive pipeline.

user question
  ↓
embed once
  ↓
vector search top-k
  ↓
stuff chunks into prompt
  ↓
answer

The vector store may return: - a Q4 CEO letter - an EMEA sales highlight - an annual summary - an APAC operations note - one old Q3 investor deck

Everything is related. Nothing is sufficient. That is the trap.

Maybe Q3 appears for only two regions. Maybe Q4 appears for all regions. Maybe “revenue growth” is written as “topline expansion.” Maybe LATAM is only inside a table caption.

The model now sees partial evidence. Partial evidence is dangerous because generation is strong. Strong generation plus weak retrieval creates believable nonsense.

weak retrieval
  +
strong generation
  =
confident nonsense

That is the core failure. Naive RAG often fails before the answer prompt even starts.

1.3 Why 70% accuracy is a dangerous number¶

Basic RAG often lands around the “seems decent” zone. Maybe 70% overall accuracy. Maybe 75% on easy queries. That sounds almost done. In production, it is nowhere near done.

Why? Because averages hide the hard cases. The exact questions users care most about are often the ones with multiple hops, multiple entities, explicit constraints, or comparison logic. Those can be wrong 40% of the time.

That creates four kinds of damage.

First, trust damage. Users stop believing even correct answers.

Second, operational damage. Someone makes a real decision from a polished but wrong answer.

Third, debugging damage. Teams blame the model, when the actual failure was retrieval design.

Fourth, product damage. Leadership concludes, “RAG is hype.” No. Bad retrieval pipelines are hype.

So the stakes are simple. Basic RAG is a good intern. Advanced RAG is a good analyst. The rest of this module is the promotion path.

Chapter 2: Query transformation¶

2.1 Why raw user wording is not a retrieval query¶

A user writes for humans. A retriever needs something else. It needs a query shaped for evidence lookup.

Humans omit context. Humans use vague references. Humans mix three jobs into one sentence. Humans say, “What happened with the bug from last week?” That sounds fine in conversation. It is terrible for retrieval.

Which bug? What system? What date range? What outcome counts as “what happened”? Without those handles, retrieval becomes guesswork.

So the first rule of advanced RAG is this: do not treat raw user wording as sacred retrieval input. Preserve meaning. Improve shape. That is query transformation.

2.2 Query rewriting¶

Query rewriting means: keep the intent, make the search easier.

A good rewrite: - preserves every constraint - resolves obvious ambiguity - makes implied entities explicit - removes chat filler - converts spoken style into retrieval style

Example:

User asks:

What happened with the bug from last week?

A stronger retrieval query is:

Production incident reported last week for the checkout service, including root cause, timeline, and resolution.

Now the retriever has handles. Service. Time window. Incident type. Desired evidence.

But rewriting has one danger. It can quietly remove or change a constraint. Suppose the real user query was, “Did premium users in APAC see latency improve after the caching rollout?” If the rewrite drops premium or drops APAC, you already lost.

So senior systems log both forms: - raw question - rewritten query

If the answer goes wrong, you want to know whether the damage happened at retrieval, reranking, or generation.

2.3 Query expansion¶

Sometimes the user is asking correctly, but the corpus speaks a different dialect.

The user says, “revenue growth.” The document says, “topline expansion.”

The user says, “refund.” The document says, “reimbursement.”

The user says, “North America.” The slides say, “NA.”

That is where query expansion helps. Expansion adds nearby terms, aliases, or synonyms that improve recall.

Example:

raw query:
Compare Q3 and Q4 revenue growth across all regions

expanded hints:
- revenue growth
- sales growth
- topline growth
- APAC
- EMEA
- LATAM
- North America
- NA

Notice the difference: - rewrite improves the shape of the question - expansion improves recall inside the corpus

Expansion is powerful. It also increases noise. You usually accept that trade-off, then let reranking rescue precision later.

2.4 Query decomposition¶

Some questions are simply too large for one retrieval call. No amount of clever embedding fixes that. The structure itself is multi-hop.

Example:

Which regions improved revenue growth from Q3 to Q4, and what explanations were given in the earnings commentary?

Hidden jobs: 1. find Q3 numbers by region 2. find Q4 numbers by region 3. compare the numbers 4. find commentary explaining the change 5. connect each explanation to the right region

One giant search usually returns chunks that partially touch many parts, but fully solve none.

So we use query decomposition. Break the hard question into smaller, independently retrievable sub-queries.

For this example: 1. What was Q3 revenue growth for each region? 2. What was Q4 revenue growth for each region? 3. Which regions improved or declined? 4. What explanations were given for each region?

This is the ELI5 placeholder called the multi-step plan. It matters because retrieval is local. Complex business questions are often hidden workflows. Treat them like workflows.

2.5 Step-back prompting¶

Sometimes the answer lives in a broader principle, not in the exact surface wording of the question.

Example:

Can a contractor approve spend above the project cap during an emergency?

The relevant evidence may not contain that sentence. It may live under a broader section like, “emergency procurement exception handling.”

So before searching the narrow case, we ask a broader question:

What higher-level policy governs this situation?

That is step-back prompting. We retrieve at two levels: - broad governing rule - specific edge case or exception

ASCII picture:

specific question
  ↓
step back to governing principle
  ↓
retrieve broad policy
  +
retrieve specific exception
  ↓
combine

The broad query finds the anchor. The narrow query finds the edge case. Together they answer more reliably.

2.6 Worked example A — financial comparison¶

Let us do the full transformation flow. Slowly. This is what interviewers love to hear.

Raw user question:

Compare Q3 and Q4 revenue growth across all regions.

Step 1: diagnose the failure risk

Risks: - multi-hop comparison - region coverage may be incomplete - synonyms may differ across documents - some numbers may live inside tables

Step 2: rewrite

Rewritten query:

Compare regional revenue growth percentages for Q3 and Q4 across APAC, EMEA, LATAM, and North America.

Why this helps: - “regional” is explicit - “percentages” narrows the metric type - region names are now concrete

Step 3: expand

Add terms like: - revenue growth - sales growth - topline growth - North America - NA

Step 4: decompose

Sub-query 1: - Q3 regional revenue growth percentages

Sub-query 2: - Q4 regional revenue growth percentages

Sub-query 3: - management commentary explaining regional change between Q3 and Q4

Step 5: optional step-back

Step-back query: - quarterly regional performance summary and management commentary

Now the search plan looks like this:

raw question
  ↓
rewrite
  ↓
expand
  ↓
decompose into 3 sub-queries
  ↓
retrieve for each
  ↓
merge evidence
  ↓
compare numbers
  ↓
answer

That is already advanced RAG. No new frontier model required. Just better thinking before retrieval.

2.7 Worked example B — policy exception¶

Raw question:

Are interns allowed to access customer production data for debugging?

If you search that raw sentence, you may retrieve an internship handbook, a debugging guide, and some generic data policy. Still not enough.

Rewrite:

Policy for intern access to customer production data during debugging or incident response.

Expansion: - intern - trainee - temporary employee - production data - live customer data - debugging - incident response

Step-back query:

Data access control policy for customer production systems and exceptions.

Possible decomposition: 1. What is the default data-access policy for interns? 2. What is the policy for customer production data? 3. Are there debugging exceptions, and who approves them?

Now the retrieval plan matches how real policy documents are written. That is the point.

2.8 Worked example C — metric change diagnosis¶

Raw question:

Did the caching rollout reduce latency for premium users in APAC?

Rewrite:

Impact of caching rollout on latency metrics for premium users in APAC.

Expansion: - latency - response time - p95 latency - premium users - paid tier - APAC - Asia Pacific

Decomposition: 1. When did the caching rollout happen in APAC? 2. What latency metrics existed before rollout for premium users? 3. What latency metrics existed after rollout for premium users? 4. Do documents attribute any change to caching?

Now add metadata filters: - region = APAC - user_tier = premium - date >= rollout date

See how transformation and filtering cooperate. In production they are not isolated chapters. They are one system.

2.9 What query transformation is not¶

Query transformation is not permission to invent helpful words. It is not, “I think the user probably meant this other thing.”

It must obey one hard rule: preserve intent and preserve constraints.

If a rewrite adds a fake region, removes a date, or changes “can” into “must,” you corrupted the task.

That is why structured outputs help. For example:

{
  "rewritten_query": "...",
  "must_keep_constraints": ["APAC", "premium users", "after rollout"],
  "synonyms": ["latency", "response time", "p95"]
}

The structure makes the step auditable. That is senior engineering. Not just better results, but traceable results.

Chapter 3: HyDE & advanced retrieval¶

3.1 The hypothesis¶

Now comes a very clever idea. Sometimes the raw question is too short, too abstract, or too semantically weak.

Example:

Why did churn improve after onboarding changes?

The best answer passage may talk about: - activation rate - time-to-value - first-week retention - guided setup completion

The question and the answer are related, but not always close in wording. That semantic gap hurts retrieval.

So we create the hypothesis. This is HyDE: Hypothetical Document Embeddings.

Workflow: 1. ask an LLM to draft a plausible answer passage 2. embed that passage 3. retrieve real documents with that embedding 4. throw away the fake passage 5. answer only from the real evidence

Why this can work: answer-shaped text often lands closer to real answer passages than a raw question does.

Think of it like this. A question is a request. A document is a description. Embeddings often match descriptions to descriptions better. HyDE converts the request into a description-shaped probe.

raw question      real docs
    |                |
    | semantic gap   |
    └──────┐   ┌─────┘
           v   v
     hypothetical answer
              ↓
         embedding probe

3.2 When HyDE helps and when it hurts¶

HyDE is not magic. It helps under specific conditions.

It helps when: - the question is conceptual - the wording is sparse - the corpus uses different terminology - the user implies context rather than naming it

It hurts when: - the question is already a crisp factual lookup - the hypothetical answer drifts into the wrong frame - the corpus is tiny and exact-match tokens matter more

Helpful case:

How did onboarding changes affect churn?

A hypothetical answer might mention activation, setup completion, and first-week engagement. Those terms can pull the real documents closer.

Harmful case:

What was APAC revenue growth in Q4 2024?

That answer probably lives in a table. Exact quarter and region matter more than semantic expansion. Here sparse search and filters may help more.

Senior thinking is not, “Always use HyDE.” It is, “Know which failure HyDE actually fixes.”

3.3 Parent-child retrieval¶

Next problem: chunk size. Small chunks are good for precision. Large chunks are good for context. Usually you want both.

Small chunks match well, but they may lose the heading, the unit, the nearby table rows, or the explanation paragraph.

Large chunks preserve context, but too much unrelated text dilutes the match.

Parent-child retrieval solves this trade-off.

The idea: - split documents into larger parent sections - split each parent into smaller child chunks - index the child chunks - retrieve matching child chunks - return the parent section to the generator

Picture it:

Parent section: Q4 Regional Performance
  ├─ child 1: APAC revenue table row
  ├─ child 2: EMEA commentary paragraph
  ├─ child 3: LATAM footnote
  └─ child 4: North America summary

Retrieval happens on the children. Answering happens with the parent window. That gives precision plus surrounding meaning.

For the quarterly example, a child chunk may match “LATAM revenue growth +4%.” Good. But the answer probably needs the parent section too, so you know units, definitions, and neighboring region rows.

3.4 Fusion retrieval — dense + sparse¶

Now another common failure. Dense retrieval is great for meaning. Sparse retrieval is great for exact words. Real systems need both.

Dense retrieval helps when the user says, “refund,” and the document says, “reimbursement.”

Sparse retrieval helps when exact strings matter: - invoice IDs - product SKUs - quarter codes - region abbreviations - error messages

In enterprise search, rare tokens often carry huge meaning. If the user asks for “EMEA-Q4-REV-2024,” you do not want a semantically similar slide. You want that exact token.

So advanced systems run both retrievers. Then they fuse the rankings.

query
 ├─ dense retriever   → semantic neighbors
 └─ sparse retriever  → exact-token matches
           ↓
      rank fusion
           ↓
        candidates

A common fusion method is RRF, Reciprocal Rank Fusion. It combines ranking positions, not raw scores. That is useful because dense and sparse scores live on different scales.

Simple formula:

RRF(doc) = Σ 1 / (k + rank_i(doc))

Do not over-fear the formula. The intuition is easy. If a document ranks well in multiple lists, it rises in the merged result.

3.5 How these fit together¶

At this point, advanced retrieval is a toolkit. Not a single trick.

Use HyDE when the question is semantically weak. Use parent-child when chunk precision fights chunk context. Use fusion when exact tokens and meaning both matter.

A good system might use all three. A smart system will not use all three blindly. That is why routing arrives later.

Chapter 4: Reranking & filtering¶

4.1 Retriever vs reranker¶

Keep this distinction very sharp. A retriever is built for speed. A reranker is built for judgment.

Retriever question: “Which 50 documents are probably worth checking?”

Reranker question: “Of these 50, which 5 actually answer the user?”

Most retrievers are bi-encoders. They embed query and document separately. Fast. Scalable. Good for recall.

Most strong rerankers are cross-encoders. They read query and document together. Slower. More precise. Good for final ordering.

bi-encoder
query → vector ----┐
                   ├─ similarity score
doc   → vector ----┘

cross-encoder
[query ; doc] → one model → relevance score

Why is the cross-encoder better? Because it can inspect token-level interactions directly. It can notice, “Yes, this paragraph answers the second half of the question, not just the general topic.”

But it is too expensive to run over the full corpus. So the production pattern is always:

cheap broad retrieval
  ↓
expensive precise reranking

4.2 The cross-checker¶

This is the ELI5 placeholder called the cross-checker. It reads the retrieved candidates carefully and reorders them with more judgment.

Typical flow: - retrieve top-K = 20 to 100 - rerank all K candidates - keep top-N = 3 to 10

Why not retrieve top-5 directly? Because the best doc may be buried at rank 12. The reranker rescues it.

Mini-example.

Question:

Which regions improved revenue growth from Q3 to Q4?

Retriever top-5: 1. CEO summary of Q4 results 2. EMEA sales note 3. company annual outlook 4. Q3 regional revenue appendix 5. Q4 regional revenue appendix

Looks reasonable. Still not ideal.

The cross-encoder reranker may reorder to: 1. Q3 regional revenue appendix 2. Q4 regional revenue appendix 3. CEO summary of Q4 results 4. EMEA sales note 5. company annual outlook

Now the answer prompt sees the core evidence first. That is why reranking often gives a surprisingly large boost.

4.3 Metadata filtering¶

Sometimes the right answer is easy, if only you search the right slice. Metadata filters are not glamorous. They are wildly practical.

Common filters: - date range - region - customer tier - product line - document type - access scope - team ownership

Suppose the user asks:

Did the caching rollout reduce latency for premium users in APAC?

Good filters could be: - region = APAC - user_tier = premium - metric_type = latency - date >= rollout_date

These filters remove irrelevant junk before ranking even begins. Many teams underuse this, because they hope vector search will infer everything. No. If metadata is known, use it. That is engineering, not cheating.

4.4 MMR for diversity¶

Another frequent pathology: the top results are duplicates.

You ask one question. The retriever returns: - paragraph 3 of report A - paragraph 4 of report A - paragraph 5 of report A - paragraph 3 of report B - paragraph 2 of report A again

Every chunk looks relevant. Together they waste the prompt budget. This hurts especially for compare questions, where you need coverage across entities.

So we use MMR, Maximum Marginal Relevance. MMR balances relevance and diversity.

Simple intuition:

score = relevance
        - redundancy penalty

Formal version:

MMR = λ * relevance - (1 - λ) * similarity_to_selected

If lambda is high, you reward relevance more. If lambda is lower, you force more diversity.

For compare tasks, MMR helps a lot. You want evidence from APAC, EMEA, LATAM, and North America. Not five passages about one region.

4.5 Worked example — cleaning a noisy result set¶

Query:

Summarize policy exceptions approved after the August update.

Retriever returns: 1. August policy overview 2. August policy overview appendix 3. August policy overview FAQ 4. exception approval log 5. September exception log 6. HR handbook note 7. policy change announcement 8. legal exception memo

Good. But messy.

Step 1: metadata filter Keep only documents with: - date >= August update - document_type in {exception log, memo, policy note}

Now items 6 and 7 may drop.

Step 2: rerank The cross-encoder reads the remaining candidates. It places logs and memos above the overview docs, because the question asks for approved exceptions, not for policy background.

Step 3: MMR Among the top candidates, MMR keeps one overview, one log, one memo, not three nearly identical appendices.

The final result set is smaller, cleaner, and much more answerable. That is advanced RAG in practice. Not magic retrieval. Disciplined retrieval hygiene.

Chapter 5: Agentic RAG & multi-hop¶

5.1 Why loops appear¶

By now a pattern should feel obvious. Retrieve once. Inspect. If weak, do something different.

That is the birth of loops. Advanced RAG stops assuming the first retrieval is good enough. It begins to evaluate retrieval quality and choose the next action. That is why this module leads naturally into agents.

5.2 Corrective RAG (CRAG)¶

CRAG stands for Corrective RAG. The key question is simple: Was the retrieved context good enough?

If yes, answer. If no, correct course.

Possible correction moves: - rewrite the query again - broaden search - narrow search with filters - switch retriever type - decompose the question - call another knowledge source

ASCII flow:

retrieve
  ↓
quality check
  ├─ good → answer
  └─ weak → corrective retrieval step

The important mental shift is this: Basic RAG assumes retrieval success. CRAG checks retrieval success. That is a very big upgrade.

5.3 Self-RAG and the confidence gate¶

Self-RAG extends the same control idea inward. The model does not only answer. It can also ask: - do I need retrieval? - is the evidence enough? - should I refine the answer?

You do not need the paper details right now. Keep the control pattern.

This is where the confidence gate enters. The confidence gate is a self-evaluation checkpoint. It usually sits after retrieval, and often after a draft answer.

It asks practical questions: 1. Do the retrieved chunks support the main claims? 2. Did we answer every sub-question? 3. Are the sources mutually consistent? 4. Is confidence high enough to answer, or should we retry or abstain?

Possible outputs: - answer - retry with rewrite - retry with broader retrieval - retry with decomposition - abstain politely

That is already agent-like behavior. The system is looking at its evidence state and choosing the next move.

5.4 Iterative retrieval¶

Some questions are not answerable in one hop, even after decomposition. The next query depends on facts found in the previous step. That is iterative retrieval.

Example:

Which customers were affected by the outage caused by the service that changed deployment regions last month?

Hop one: Which service changed deployment regions last month?

Hop two: Which outage was caused by that service?

Hop three: Which customers were affected by that outage?

ASCII flow:

question
  ↓
search hop 1
  ↓
extract fact A
  ↓
search hop 2 using fact A
  ↓
extract fact B
  ↓
search hop 3 using fact B
  ↓
answer

One-shot retrieval fails here, because the second search depends on the first retrieval. That is the whole point.

5.5 Routing between retrieval strategies¶

Now suppose you have many tools: - rewrite - HyDE - dense retrieval - sparse retrieval - hybrid fusion - parent-child retrieval - decomposition - reranker - metadata filters

Should every query use every tool? No. That would be slow, expensive, and sometimes worse.

So advanced systems route. They choose a strategy based on the query.

Example routing logic: - if exact IDs or codes appear, use sparse-heavy retrieval - if the question is conceptual and vague, use rewrite + HyDE - if the question says compare / across / difference, use decomposition - if metadata is explicit, filter first - if documents are long, prefer parent-child retrieval

user question
  ↓
router
  ├─ exact-token path
  ├─ conceptual path
  ├─ multi-hop path
  └─ metadata-heavy path

This is the last mental bridge before agents. Routing is tool choice. Tool choice is agent behavior.

5.6 Retrieval prompts you can use tomorrow¶

Keep prompts simple. Keep them logged. Keep them inspectable.

Prompt 1 — the rewriter¶

You are a retrieval query rewriter.
Rewrite the user's question so it is easier to retrieve supporting documents.
Preserve every constraint exactly.
Return JSON with:
- rewritten_query
- must_keep_constraints
- synonyms_or_aliases
- suggested_metadata_filters

Why it works: The structure makes silent damage harder.

Prompt 2 — the multi-step plan¶

Break the user's question into the minimum number of sub-queries needed.
Each sub-query should be independently searchable.
Return them in dependency order.
Also say whether the final answer requires aggregation, comparison, or synthesis.

Why it works: It exposes the hidden hops.

Prompt 3 — the hypothesis¶

Write one short hypothetical document passage that would likely answer the question.
Do not claim certainty.
Use domain language likely to appear in the corpus.
Return only the passage.

Why it works: It creates a description-shaped semantic probe.

Prompt 4 — the confidence gate¶

Given the question, retrieved context, and draft answer,
judge whether the answer is fully supported.
Return one of:
- answer
- retry_rewrite
- retry_broaden
- retry_decompose
- abstain
Also provide one sentence explaining the decision.

Why it works: It turns vague self-doubt into explicit control.

5.7 Honest admission¶

Advanced RAG is powerful. It is still not magic. Let us be honest.

First, advanced retrieval cannot rescue a bad corpus. If documents are stale, missing, or mislabeled, smarter retrieval only fails more elegantly.

Second, every upgrade adds latency. Rewrite call. HyDE call. Multiple retrievals. Reranker. Confidence gate. Very quickly you can build a clever system that users hate waiting for.

Third, more moving parts mean more debugging surface. Failures can happen in rewriting, routing, fusion, reranking, or confidence calibration.

Fourth, self-evaluation is useful, not perfect. The confidence gate can still be confidently wrong. So you need logs, evals, and human spot checks.

Fifth, advanced RAG does not replace good product scoping. Some tasks should use SQL, APIs, or workflow tools directly. RAG is not the answer to every question-shaped problem.

This honesty matters. It keeps you grounded. It also prepares you for Module 09, where the same truth appears across many tools.

Chapter 6: Recap¶

6.1 Failure-fix table¶

Memorize the pattern, not the wording. This table is the module in compressed form.

Failure	Symptom	Fix
Raw user wording is vague	Retriever finds loosely related chunks	Query rewriting
Corpus uses different terms	Good docs stay hidden	Query expansion
Question contains many hops	One retrieval call returns partial evidence	Query decomposition
Question is too narrow	Governing policy stays hidden	Step-back prompting
Query is semantically weak	Dense retrieval misses answer neighborhood	HyDE
Small chunks lose context	Matched snippet lacks surrounding meaning	Parent-child retrieval
Exact tokens matter	Dense retrieval misses IDs or acronyms	Sparse or hybrid retrieval
Retriever ranking is noisy	Best docs are buried	Cross-encoder reranking
Results are repetitive	Prompt fills with duplicate evidence	MMR
Search scope is too broad	Wrong date, region, or tier dominates	Metadata filtering
Evidence is weak	System still answers confidently	Confidence gate
First search fails	One-shot pipeline gives up early	CRAG / iterative retrieval
Different query types need different tools	One strategy underperforms	Routing

If you understand this table, you understand advanced RAG.

6.2 Key points to remember¶

Point one. Advanced RAG is mostly better retrieval decisions, not a smarter answer prompt.

Point two. Most production gains come from improving recall and precision separately.

Recall tools: - rewrite - expand - HyDE - fusion - decomposition - step-back prompting

Precision tools: - rerank - metadata filters - MMR - confidence gating

Point three. A multi-step question is a hidden workflow. Treat it like one.

Point four. Self-evaluation enters here, inside retrieval, before full agents appear.

Point five. Retrieval is already becoming a tool. Not just a fixed pipeline stage. A thing the system may choose, retry, or skip.

6.3 Important interview questions¶

Here are clean questions you should answer well. Tie every answer to a failure mode. That sounds senior.

1. Why does basic RAG plateau on hard questions?

Because it assumes the raw user question is retrieval-ready, and usually performs one-shot retrieval with weak ranking control. Hard questions need transformation, multiple hops, and evidence inspection.

2. Rewrite vs expansion vs decomposition — how are they different?

Rewrite preserves meaning while making search explicit. Expansion adds nearby terms to improve recall. Decomposition breaks one hard question into multiple retrievable units.

3. Why can HyDE outperform direct query embedding?

Because answer-shaped text can land closer to real answer passages than a short, underspecified question does. It narrows the semantic gap.

4. Why rerank top-K instead of using only the retriever?

Retrievers optimize speed and coarse recall. Rerankers read query and document jointly, so they can rescue highly relevant docs buried lower.

5. What does MMR solve?

It prevents the context window from filling with redundant near-duplicates, which is especially harmful for synthesis and compare tasks.

6. What is the confidence gate, practically?

A control step that inspects evidence sufficiency, answer coverage, and source consistency, then decides whether to answer, retry, or abstain.

7. How is this module a bridge to agents?

Because the system is already choosing actions, routing across strategies, and deciding when to stop. That is agent behavior in a narrow domain.

6.4 Production experience¶

A few pragmatic lessons. These keep coming up in real systems.

Lesson one. Log every intermediate artifact: raw query, rewrite, filters, routing decision, retrieved IDs, reranked IDs, and confidence-gate outcome. Without this, debugging becomes theatre.

Lesson two. Measure retrieval separately from generation. If the supporting chunk never appears in top-k, changing prompts is mostly decoration.

Lesson three. Use metadata aggressively. Teams often spend weeks tuning embeddings for problems that a simple date or region filter would solve.

Lesson four. Start with one corrective loop. Do not build a ten-branch research agent on day one. One good retry policy beats a maze.

Lesson five. Watch latency like a hawk. Every extra model call must justify itself. Often rewrite + rerank gives most of the value.

Lesson six. Do not trust offline metrics alone. Use human review on failure clusters. A system can improve NDCG and still annoy users.

Lesson seven. Abstention is a feature, not a failure. A polite, “I could not find enough support,” is often the most trustworthy behavior.

6.5 Foundation-gap audit before Module 09¶

Module 09 assumes four things already feel natural. Check yourself honestly.

A. Advanced retrieval patterns Can you explain when to use rewrite, HyDE, hybrid retrieval, reranking, and MMR?

B. When to iterate or loop Can you say, “This query needs another search pass,” and justify why?

C. Self-evaluation concept Do you understand the confidence gate as a decision layer, not just a score?

D. Tool-like retrieval Can you think of retrieval as one action among many possible actions?

If any answer feels weak, re-read chapters 4 and 5. That is exactly the mental gap Module 09 relies on.

6.6 Exercises and bridge forward¶

Try these without notes first.

Exercise 1 — medium Take this question:

Which support issues increased after the pricing change for startups in EMEA?

Do four things: - rewrite it - expand it - decompose it - suggest metadata filters

Exercise 2 — medium Given a corpus of long legal documents, explain why parent-child retrieval may beat flat chunk retrieval. Also give one downside.

Exercise 3 — hard Design a retrieval router for these query types: - exact invoice ID lookup - broad policy question - compare two quarters - ask why a metric changed

For each, choose the retrieval path and justify it.

Exercise 4 — hard Write a confidence gate policy. Say when the system should: - answer - retry with rewrite - retry with decomposition - abstain

Exercise 5 — interview mode Explain advanced RAG in one minute using the promoted librarian analogy. Mention all five placeholders.

Now keep this bridge sentence exactly in your head:

Next module — 16_agents_tool_calling — generalizes this loop pattern. RAG is just one tool an agent can use. Agents can pick from many tools, retry, and reason about when to stop.

That is the main bridge. Module 07 taught you to retrieve. Module 08 taught you to retrieve intelligently. Module 09 will teach you to choose among many actions, with retrieval becoming only one tool in the toolbox.