11. Iterative retrieval — search, read, refine, search again¶

~13 min read. Hard questions often reveal their real search terms only after the first evidence arrives.

Built on the ELI5 in 00-eli5.md. the confidence gate — deciding whether to answer or search again — is what turns retrieval from one shot into a guided loop.

1) The wall — when the first result exposes the next missing fact¶

Basic RAG gave us a useful baseline: embed the query, retrieve a few chunks, and generate from the result. The earlier advanced-RAG tools — the rewriter, the hypothesis, the multi-step plan, the cross-checker, and the confidence gate — add control points before generation.

The concrete failure here is sharper: some missing evidence is only visible after reading the first retrieved set. This page follows an iteration log showing a new query from a discovered gap so you can see whether search-read-search loops actually changes the evidence path or just adds another box to the architecture diagram.

The tempting repair is to retrieve once, then force final synthesis. That keeps the system simple, and on easy questions it may be right. It fails on this case: The first pass finds the policy. Reading it reveals an exception table name. That table name becomes the second query.

Root cause: the evidence path has already lost information before the model writes a sentence. Rule: iterate only when the current evidence reveals a specific next evidence need.

Mini-FAQ. "What is the control point here?" the confidence gate is useful only when it creates a real decision: retrieve differently, rank differently, retry, reroute, answer, or refuse.

2) The core visual — read, find the gap, search again¶

Consider the head researcher finding one useful page.

That page answers part of the question.

It also mentions a new entity, a hidden metric, or a missing date window.

Now the search should change.

Iterative retrieval means the system learns from evidence as it searches.

The first pass is not the finish line.

It is a probe.

The second pass can be sharper because it uses what the first pass uncovered.

query
  │
  ▼
search round 1
  │
  ▼
read evidence
  │ discover new handle
  ▼
search round 2
  │
  ▼
answer or continue

This is how real research feels.

You do not know the perfect keyword on the first try.

3) What the loop is really fixing¶

Single-pass retrieval assumes the first query already contains all needed handles.

That is often false.

A first document may reveal the service name, incident code, owner team, or experiment flag that actually matters.

If the system answers too early, that new handle never helps.

If the system loops carelessly, latency and drift explode.

So iterative retrieval needs control.

That control comes from the confidence gate.

The gate asks whether the current evidence is enough.

If not, the next search should be more specific than the last one.

That is the adult version of “search again.”

4) The worked example — trace the intermediate state¶

Question:

“Why did premium APAC latency spike after the rollout?”

Round one query is broad.

It retrieves D1 rollout announcement, D2 global latency dashboard, and D3 APAC traffic note.

Coverage score after round one is 0.38.

Why so low?

Premium is missing.

The exact rollout component is unclear.

Latency spike cause is missing.

D3 mentions a config flag called edge-cache-v2.

That becomes the new handle.

Round two query becomes:

“premium APAC latency spike edge-cache-v2 after rollout.”

Round two retrieves D4 premium APAC dashboard and D5 edge-cache-v2 rollback note.

Coverage score rises to 0.67.

Now the evidence shows the spike but not the cause chain.

D5 mentions a timeout increase on origin retries.

Round three query becomes:

“edge-cache-v2 origin retry timeout premium APAC.”

Round three retrieves D6 timeout tuning postmortem.

Coverage score rises to 0.86.

round 1 score 0.38 → discover edge-cache-v2
round 2 score 0.67 → discover origin retry timeout
round 3 score 0.86 → enough support to answer

See the shape.

Each round added one missing handle.

That is better than hoping the first wording knew everything.

5) Failure modes — how the mechanism breaks¶

Failure one. The loop keeps searching without learning.

Each round is a paraphrase of the same weak query.

Nothing new enters the evidence set.

Failure two. The loop chases every side detail.

One document mentions an unrelated region.

The next search drifts away from the user’s goal.

Failure three. The system has no stop rule.

A question with no answer in the corpus burns time forever.

So what to do?

Require each new round to add a concrete new handle or a new retrieval strategy.

And let the confidence gate stop the loop when the score plateaus.

6) Production rules that hold up¶

Cap the number of rounds.

Track what each round learned.

Refine with newly discovered entities, metrics, or filters.

Do not retry with cosmetic wording only.

Log score improvement per round.

Iterative retrieval is best for investigative questions, root-cause questions, and layered policy asks.

It is overkill for simple FAQ lookup.

That is why not every query should use it.

Some queries need one pass.

Some need hybrid retrieval.

Some need decomposition.

Choosing that path is the routing problem.

The plausible alternative is blind retries with the same query. It is attractive because it preserves a smaller pipeline and avoids another operational surface.

That tradeoff is correct for low-risk, obvious queries. It is wrong when some missing evidence is only visible after reading the first retrieved set. Under that workload, search-read-search loops earns its cost by making the failure inspectable before generation.

Option	Works when	Fails when	Cost moves to
blind retries with the same query	evidence need is simple	some missing evidence is only visible after reading the first retrieved set	prompt wording and user trust
iterative retrieval	failure is detectable before generation	checkpoint is unlogged or uncalibrated	retrieval orchestration, latency, evals

Mini-FAQ. "Should every query take this path?" No. The mechanism belongs on routes where the evidence risk justifies the extra latency and debugging surface.

8) Production signals — know whether iterative retrieval is working¶

A healthy trace shows each new search is caused by a named gap in the previous evidence. The first metric to watch is useful second-hop rate. Top-1 similarity is weaker because it can improve while a binding constraint is still missing.

The review loop starts with false greens: cases where the system answered, but later inspection found unsupported claims. Those cases reveal whether the checkpoint is protecting the right boundary.

user complaint
   -> retrieval trace
   -> evidence check
   -> answer / retry / route / abstain
   -> false-green review

9) Boundary — where iterative retrieval helps, hurts, or wastes budget¶

Strong fit: the bad evidence pattern is visible before generation. Weak fit: the corpus is missing the truth, metadata is stale, or the system has no alternative route to take. Pathology: the pipeline keeps retrying because it wants an answer, not because the current evidence revealed a new evidence need.

Scale limit: each checkpoint spends latency, money, logs, and operator attention. Route the mechanism to the queries that need it; do not make every query pay for the hardest query.

10) Wrong model — advanced RAG means adding more retrieval¶

The wrong model says more variants, rerankers, loops, and confidence scores make the system safer by default.

The better model is narrower: each component must relieve one named pressure and create one visible decision. If the confidence gate does not change what the system does, it is decoration.

11) Failure taxonomy for iterative retrieval¶

The checkpoint is not logged, so the failure cannot be replayed.
The threshold came from a demo set and does not match production traffic.
The retry repeats the same weak evidence instead of targeting a missing slot.
The metric improves while one required constraint stays unsupported.
Metadata or routing rules go stale and silently hide the right document.
Extra candidates crowd the prompt with duplicates.
Operators optimize the easy number instead of unsupported-answer rate.

12) Pattern transfer — same pressure, different system¶

Caching has the same shape: add the mechanism only when the access pattern earns it.
Observability has the same shape: traces matter because production bugs start as symptoms.
Distributed systems have the same shape: the useful part is the guarantee under failure.
Evals have the same shape: average wins do not matter if false greens leak to users.

13) Design review checklist¶

What exact pressure forced this mechanism to exist?
What artifact proves the mechanism changed retrieval?
Why is blind retries with the same query weaker on this workload?
Which metric should improve first?
Which cost rises first?
When should the system answer, retry, reroute, or abstain?

Where this lives in the wild¶

Perplexity-style answer engines — often refine follow-up search based on what early evidence reveals.
Incident investigation copilots — search logs, dashboards, and postmortems in successive rounds as new handles emerge.
Enterprise research assistants — turn one business question into a short search dialogue with the corpus.
Security analysis bots — refine queries with IOC names or alert IDs discovered in earlier documents.
Support troubleshooting tools — use first-round symptoms to launch second-round product-specific searches.
Enterprise policy assistants — use the pattern when user wording hides policy scope, dates, or exception clauses.
Customer support copilots — need the control when a pleasant answer without full evidence can mislead a customer.
Incident copilots — benefit when service names, runbooks, dashboards, and postmortems disagree or use aliases.
Legal document search — requires visible evidence boundaries because one missing clause can reverse the answer.
Healthcare knowledge assistants — need stricter support checks because fluent synthesis is not a safety guarantee.
Financial research tools — use it to separate ticker-like literal anchors from semantic business questions.
Internal engineering search — exposes whether the system found the exact config, RFC, ticket, or deploy note.
Knowledge-base migration projects — reveal stale, duplicate, and missing documents before retrieval quality is blamed.
Evaluation harnesses — turn the mechanism into a measurable false-green and false-red review loop.
Agentic workflows — reuse the same control point before a tool call, follow-up search, or escalation step.

Recall checkpoint¶

Why is iterative retrieval more than repeating the same search several times?
In the worked example, which new handle appeared after round one and changed round two?
What should make the system stop searching even if the answer is still not perfect?
Which false-green case would you review first for iterative retrieval?
What cost does this mechanism move into retrieval orchestration or evaluation?
When would blind retries with the same query be acceptable instead?

Interview Q&A¶

Q: When is iterative retrieval genuinely better than a stronger first-pass query? A: When key search terms are unknown until the system reads early evidence and uncovers them.

Common wrong answer to avoid: "It is better whenever the first answer feels uncertain." — Uncertainty alone is not enough if the next search has no new direction.

Q: What is the core failure mode of naive iterative retrieval? A: Repeating near-identical searches that consume latency without adding new evidence or narrowing the hypothesis.

Common wrong answer to avoid: "The main failure is just higher cost." — Cost matters, but the deeper issue is loops that do not learn.

Q: What should each extra round contribute? A: A new handle, a tighter filter, a changed retriever, or some other concrete improvement to the evidence search.

Common wrong answer to avoid: "Each round should simply use more tokens." — More tokens are not the same as more signal.

Q: What trace would you inspect first when iterative retrieval fails? A: Start with iteration log showing a new query from a discovered gap. Then compare it with the final answer claims.

Common wrong answer to avoid: "Start by rewriting the final prompt." — The prompt may be fine. The evidence path may already be broken.

Q: What cost does this add? A: Latency, logging, evaluation work, and one more place operators must understand.

Common wrong answer to avoid: "It is free because it is only another model call." — Model calls spend money, time, and failure budget.

Q: When should you skip it? A: Skip it when the question is low-risk, the evidence need is obvious, and the decision would not change.

Common wrong answer to avoid: "Never skip advanced components." — Advanced components are controls, not trophies.

Apply now (10 min)¶

Take one diagnosis-style question and write the exact new handle that a first evidence pass might reveal.
Sketch from memory: draw three search rounds and annotate what changed between each one.
Reproduce from memory: explain iterative retrieval in five sentences, including the pressure, mechanism, alternative, metric, and boundary.

What you should remember¶

Iterative retrieval exists because some missing evidence is only visible after reading the first retrieved set. The mechanism is not valuable because it sounds advanced; it is valuable when it changes the evidence path before generation.

The artifact to inspect is an iteration log showing a new query from a discovered gap. If that artifact does not explain why retrieval, ranking, retry, routing, or refusal changed, the component is not carrying its operational weight.

Remember:

Iterate only when the current evidence reveals a specific next evidence need.
Inspect an iteration log showing a new query from a discovered gap before blaming the final prompt.
Prefer visible evidence controls over hidden prompt hope.
Review false greens before celebrating average retrieval scores.
Every advanced RAG component should relieve one pressure and create one decision.

Bridge. Iterative retrieval is powerful, but not every query deserves a loop. Systems need a quick way to decide which retrieval path fits which question. That is routing. → 12-routing-strategies.md