10. Corrective RAG — judge retrieval, then repair it¶

~14 min read. Advanced RAG does not trust the first retrieval set blindly. It critiques the set and decides what to do next.

Built on the ELI5 in 00-eli5.md. the confidence gate — deciding whether to answer or search again — becomes a real control loop in corrective RAG and Self-RAG patterns.

1) The wall — when retrieval quality must change the route¶

Basic RAG gave us a useful baseline: embed the query, retrieve a few chunks, and generate from the result. The earlier advanced-RAG tools — the rewriter, the hypothesis, the multi-step plan, the cross-checker, and the confidence gate — add control points before generation.

The concrete failure here is sharper: the system still needs to notice when the whole retrieval set is weak or off-route. This page follows a quality label that triggers rewrite, web search, or abstention so you can see whether retrieval quality checks with repair paths actually changes the evidence path or just adds another box to the architecture diagram.

The tempting repair is to always generate after retrieval completes. That keeps the system simple, and on easy questions it may be right. It fails on this case: The top chunks discuss pricing, but none contain the refund threshold. The correct next step is repair, not answer generation.

Root cause: the evidence path has already lost information before the model writes a sentence. Rule: bad retrieval is a control signal, not prompt material.

Mini-FAQ. "What is the control point here?" the confidence gate is useful only when it creates a real decision: retrieve differently, rank differently, retry, reroute, answer, or refuse.

2) The core visual — judge retrieval before generation¶

Consider retrieval returning six passages.

They look decent.

Two are stale.

One is contradictory.

Three only answer half the question.

A naive system keeps going.

A corrective system pauses.

That pause is the key move.

Corrective RAG asks whether retrieval quality is good enough before generation commits.

Self-RAG asks similar questions inside generation itself.

Both ideas revolve around the confidence gate.

retrieve
   │
   ▼
quality check
   │
   ├── good enough ──→ answer
   ├── weak evidence ─→ rewrite / retry
   └── broken state ──→ abstain

This is not fancy for the sake of fancy.

It is a safety habit.

3) What the corrective loop is really doing¶

Corrective RAG treats retrieval as something measurable, not sacred.

It checks support coverage, agreement, freshness, or source quality.

If the set looks weak, it triggers a repair path.

That repair path might rewrite the query.

It might switch retrievers.

It might fall back to web or tool search.

Without critique, one weak retrieval pass can poison the whole answer.

With critique, the system can spend extra work only where needed.

That is why the confidence gate is both a quality control and cost control mechanism.

Easy queries pass quickly.

Hard queries earn more search.

4) The worked example — trace the intermediate state¶

Question:

“Did premium users in APAC see latency improve after the cache rollout?”

First retrieval returns four passages.

P1 — global cache rollout note

P2 — premium pricing update

P3 — APAC traffic report with no latency metric

P4 — old latency dashboard screenshot

Now score the evidence set with a simple retrieval check.

Coverage of required constraints:

premium = present in P2

APAC = present in P3

latency = present in P4

after rollout = weakly present in P1

Joint support for all four constraints in one coherent set = poor

Suppose the retrieval-quality score is 0.42 on a 0 to 1 scale.

Threshold to answer = 0.70.

Threshold to retry = 0.40.

quality check
coverage = 2/4 strong constraints
freshness = weak
consistency = mixed
final score = 0.42

0.42 < 0.70 answer threshold
0.42 ≥ 0.40 retry threshold
→ trigger rewrite + hybrid retry

Second pass rewrites the query and adds APAC latency dashboard search.

Now the set includes an APAC premium latency report after the rollout.

The score rises to 0.81.

Only then does the system answer.

5) Failure modes — how the mechanism breaks¶

Failure one. The quality check is too naive.

It sees keyword overlap and misses contradiction.

Bad evidence still passes.

Failure two. The retry policy has no stop rule.

The system loops forever on impossible questions.

That is expensive and confusing.

Failure three. The correction path is always the same.

Every failure gets another rewrite, even when metadata filtering or abstention would be better.

So what to do?

Score evidence quality from several angles.

Then make the next action conditional, not repetitive.

That is how the confidence gate becomes a real controller.

6) Production rules that hold up¶

Define clear thresholds for answer, retry, and abstain.

Log why each threshold fired.

Use cheap checks first, like missing constraints or stale timestamps.

Reserve heavier retries for genuinely uncertain cases.

Cap retry depth.

Corrective RAG is a loop discipline.

It is not one paper name to memorize.

The deeper lesson is simple.

Retrieval quality should influence system behavior, not only evaluation dashboards.

And once you accept retries, search naturally becomes iterative over several turns.

That is the next topic.

7) Why not asking the generator to be careful with weak evidence under this workload¶

The plausible alternative is asking the generator to be careful with weak evidence. It is attractive because it preserves a smaller pipeline and avoids another operational surface.

That tradeoff is correct for low-risk, obvious queries. It is wrong when the system still needs to notice when the whole retrieval set is weak or off-route. Under that workload, retrieval quality checks with repair paths earns its cost by making the failure inspectable before generation.

Option	Works when	Fails when	Cost moves to
asking the generator to be careful with weak evidence	evidence need is simple	the system still needs to notice when the whole retrieval set is weak or off-route	prompt wording and user trust
corrective RAG	failure is detectable before generation	checkpoint is unlogged or uncalibrated	retrieval orchestration, latency, evals

Mini-FAQ. "Should every query take this path?" No. The mechanism belongs on routes where the evidence risk justifies the extra latency and debugging surface.

8) Production signals — know whether corrective RAG is working¶

A healthy trace shows bad retrieval changes the route before answer generation. The first metric to watch is false-green retrieval rate. Top-1 similarity is weaker because it can improve while a binding constraint is still missing.

The review loop starts with false greens: cases where the system answered, but later inspection found unsupported claims. Those cases reveal whether the checkpoint is protecting the right boundary.

user complaint
   -> retrieval trace
   -> evidence check
   -> answer / retry / route / abstain
   -> false-green review

9) Boundary — where corrective RAG helps, hurts, or wastes budget¶

Strong fit: the bad evidence pattern is visible before generation. Weak fit: the corpus is missing the truth, metadata is stale, or the system has no alternative route to take. Pathology: the pipeline keeps retrying because it wants an answer, not because the current evidence revealed a new evidence need.

Scale limit: each checkpoint spends latency, money, logs, and operator attention. Route the mechanism to the queries that need it; do not make every query pay for the hardest query.

10) Wrong model — advanced RAG means adding more retrieval¶

The wrong model says more variants, rerankers, loops, and confidence scores make the system safer by default.

The better model is narrower: each component must relieve one named pressure and create one visible decision. If the confidence gate does not change what the system does, it is decoration.

11) Failure taxonomy for corrective RAG¶

The checkpoint is not logged, so the failure cannot be replayed.
The threshold came from a demo set and does not match production traffic.
The retry repeats the same weak evidence instead of targeting a missing slot.
The metric improves while one required constraint stays unsupported.
Metadata or routing rules go stale and silently hide the right document.
Extra candidates crowd the prompt with duplicates.
Operators optimize the easy number instead of unsupported-answer rate.

12) Pattern transfer — same pressure, different system¶

Caching has the same shape: add the mechanism only when the access pattern earns it.
Observability has the same shape: traces matter because production bugs start as symptoms.
Distributed systems have the same shape: the useful part is the guarantee under failure.
Evals have the same shape: average wins do not matter if false greens leak to users.

13) Design review checklist¶

What exact pressure forced this mechanism to exist?
What artifact proves the mechanism changed retrieval?
Why is asking the generator to be careful with weak evidence weaker on this workload?
Which metric should improve first?
Which cost rises first?
When should the system answer, retry, reroute, or abstain?

Where this lives in the wild¶

Self-RAG-style research systems — critique evidence sufficiency during generation and request more retrieval when needed.
Enterprise assistants with fallback logic — retry with alternate retrieval paths when support looks weak.
Customer support bots — abstain when policy evidence is incomplete instead of forcing an answer.
Incident copilots — trigger deeper log or dashboard retrieval when the first context set is thin.
Search orchestration stacks — route weak initial retrieval into corrective loops rather than immediate synthesis.
Enterprise policy assistants — use the pattern when user wording hides policy scope, dates, or exception clauses.
Customer support copilots — need the control when a pleasant answer without full evidence can mislead a customer.
Incident copilots — benefit when service names, runbooks, dashboards, and postmortems disagree or use aliases.
Legal document search — requires visible evidence boundaries because one missing clause can reverse the answer.
Healthcare knowledge assistants — need stricter support checks because fluent synthesis is not a safety guarantee.
Financial research tools — use it to separate ticker-like literal anchors from semantic business questions.
Internal engineering search — exposes whether the system found the exact config, RFC, ticket, or deploy note.
Knowledge-base migration projects — reveal stale, duplicate, and missing documents before retrieval quality is blamed.
Evaluation harnesses — turn the mechanism into a measurable false-green and false-red review loop.
Agentic workflows — reuse the same control point before a tool call, follow-up search, or escalation step.

Recall checkpoint¶

Why is corrective RAG a control-loop idea rather than just a retrieval trick?
In the worked example, why did the first evidence set score only 0.42?
What is the operational danger of having a retry path without a stop rule?
Which false-green case would you review first for corrective RAG?
What cost does this mechanism move into retrieval orchestration or evaluation?
When would asking the generator to be careful with weak evidence be acceptable instead?

Interview Q&A¶

Q: What separates corrective RAG from naive retry behavior? A: Corrective RAG uses an explicit assessment of retrieval quality to choose when and how to retry, instead of repeating the same pipeline blindly.

Common wrong answer to avoid: "Corrective RAG just means search twice." — The distinctive part is the critique-driven decision, not the count of searches.

Q: Why are answer, retry, and abstain thresholds all needed? A: Because systems need different actions for strong evidence, recoverable uncertainty, and unrecoverable absence of support.

Common wrong answer to avoid: "One confidence threshold is enough." — Real systems need at least a middle zone for correction rather than premature answer or refusal.

Q: What should a retrieval-quality check look at besides relevance? A: Constraint coverage, source freshness, internal consistency, and whether the evidence jointly supports the needed claims.

Common wrong answer to avoid: "Just average the retriever scores." — High similarity does not guarantee complete or coherent support.

Q: What trace would you inspect first when corrective RAG fails? A: Start with quality label that triggers rewrite, web search, or abstention. Then compare it with the final answer claims.

Common wrong answer to avoid: "Start by rewriting the final prompt." — The prompt may be fine. The evidence path may already be broken.

Q: What cost does this add? A: Latency, logging, evaluation work, and one more place operators must understand.

Common wrong answer to avoid: "It is free because it is only another model call." — Model calls spend money, time, and failure budget.

Q: When should you skip it? A: Skip it when the question is low-risk, the evidence need is obvious, and the decision would not change.

Common wrong answer to avoid: "Never skip advanced components." — Advanced components are controls, not trophies.

Apply now (10 min)¶

For one hard query in your domain, define what evidence would count as answer-worthy, retry-worthy, and abstain-worthy.
Sketch from memory: draw the retrieve → quality check → answer-or-retry loop and label one stop rule.
Reproduce from memory: explain corrective RAG in five sentences, including the pressure, mechanism, alternative, metric, and boundary.

What you should remember¶

Corrective rag exists because the system still needs to notice when the whole retrieval set is weak or off-route. The mechanism is not valuable because it sounds advanced; it is valuable when it changes the evidence path before generation.

The artifact to inspect is a quality label that triggers rewrite, web search, or abstention. If that artifact does not explain why retrieval, ranking, retry, routing, or refusal changed, the component is not carrying its operational weight.

Remember:

Bad retrieval is a control signal, not prompt material.
Inspect a quality label that triggers rewrite, web search, or abstention before blaming the final prompt.
Prefer visible evidence controls over hidden prompt hope.
Review false greens before celebrating average retrieval scores.
Every advanced RAG component should relieve one pressure and create one decision.

Bridge. Once a system is allowed to retry, search stops being a single event. It becomes a sequence of read, refine, and search again. That is iterative retrieval. → 11-iterative-retrieval.md