10. Reranking — the second pass that saves the answer¶
~12 min read. Retrieval got you close. Reranking gets you right.
Built on the ELI5 in 00-eli5.md. The librarian has already pulled 20 books from the bookshelf. Now the librarian sits down and reads the first pages carefully before choosing which ones go on the reading desk. Continues from 09-query-and-retrieval.md.
1) When the right chunk is in top-20 but never reaches top-3¶
A user asks support, "Can enterprise customers get a refund within 30 days of renewal?" The retriever pulls the top-20 chunks from a 200,000-chunk help corpus. Look at the first three: rank 1 is "General refund overview" at 0.86, rank 2 is "Billing error troubleshooting" at 0.83, rank 3 is "Annual plan cancellation steps" at 0.82. None of these answer the question. The actual answer — "Enterprise renewal refund policy — within 30 days of renewal date, eligible" — sits at rank 11 with a score of 0.74.
The right chunk is in the candidate pool. It is just not at the top. If you stuff the top-3 directly into the reading desk, the LLM answers from generic refund text, and the answer comes back fluent and wrong. Not a retrieval-failed problem. Not a model-too-small problem. A ranking problem: cosine similarity loved the word "refund" and ignored "enterprise," "30 days," and "renewal."
Reranking is the rescue. A second-pass model reads each query-chunk pair together, holds the question in mind while scanning the chunk, and rescores. Rank 11 moves up to rank 1, the right chunk reaches the prompt, the answer becomes grounded.
2) Why the librarian has to read each candidate twice¶
The bookshelf is huge, and the librarian cannot read every book carefully — so the work splits into two passes. The first pass walks the aisles fast: spine, title, index card on the cover, quick scan, wide net, pull 20 books that look related. The second pass is different. The librarian sits at the desk, opens each of those 20 books to the first page, and reads each one with the question held in mind, scoring it against the actual ask.
The asymmetry is the point. First pass is fast and shallow because reading every book carefully is impossible at scale. Second pass is slow and deep because trusting the quick scan alone is reckless — the cosine score will love the wrong books exactly often enough to wreck precision near the top. Cover ground first, then sharpen.
This is exactly what bi-encoders and cross-encoders do.
3) Bi-encoder vs cross-encoder — the core visual¶
The single most important diagram in reranking.
BI-ENCODER (retrieval) CROSS-ENCODER (reranking)
───────────────────── ────────────────────────
query chunk ┌──────────────────────┐
│ │ │ [CLS] query [SEP] │
▼ ▼ │ chunk [SEP] │
┌────┐ ┌────┐ └──────────┬───────────┘
│ENC │ │ENC │ │
│ A │ │ B │ ▼
└─┬──┘ └─┬──┘ ┌────────────────┐
│ │ │ TRANSFORMER │
▼ ▼ │ joint over │
q-vec d-vec │ both texts │
│ │ └────────┬───────┘
└──── dot ────┘ │
│ ▼
▼ ┌─────────┐
similarity │ score │
(precomputed │ (one │
for all docs) │ number) │
└─────────┘
towers are SEPARATE tokens MIX inside
docs embedded ONCE offline must run PER (query, chunk) pair
one dot product per chunk one transformer pass per pair
millisecond per chunk 30 to 150 ms per pair
Read the two columns. On the left, query and chunk go through their own encoder towers. The towers never talk. The chunk vector was computed once at indexing time and is just sitting in the vector DB.
On the right, query tokens and chunk tokens sit inside the same transformer. Self-attention lets every query token attend to every chunk token. Negation, time constraints, entity names — all get attended to jointly.
That joint attention is the precision. That joint attention is also the cost.
Mini-FAQ. "Why don't we just retrieve with a cross-encoder?" Because you cannot precompute. To retrieve one query out of a million-chunk corpus with a cross-encoder, you would have to run the joint transformer one million times — per query. At 50 ms each that is 14 hours. Bi-encoder retrieval does the same job in 20 ms because the chunk vectors are already in the index. Use the bi-encoder to find candidates; use the cross-encoder to rank them.
4) The worked example — top-20 in, top-3 out¶
Back to the refund query. Here is what the two stages actually produce.
Query. "Can enterprise customers get a refund within 30 days of renewal?"
After bi-encoder retrieval (top-20, ordered by cosine similarity):
| Rank | Chunk title | Score |
|---|---|---|
| 1 | General refund overview | 0.86 |
| 2 | Billing error troubleshooting | 0.83 |
| 3 | Annual plan cancellation steps | 0.82 |
| 4 | Refund FAQ for trial users | 0.80 |
| 5 | Stripe webhook refund events | 0.79 |
| ... | ... | ... |
| 11 | Enterprise renewal refund policy | 0.74 |
| ... | ... | ... |
| 17 | Legacy refund exception memo (mentions 30-day) | 0.68 |
| 18 | Tax handling on partial refunds | 0.66 |
| 19 | Refund notification email template | 0.65 |
| 20 | Currency conversion for refunds | 0.63 |
Notice. The two chunks that actually answer the question — rank 11 and rank 17 — are buried under broad refund chatter. The cosine similarity loves the word "refund." It does not care about "enterprise," "30 days," or "renewal."
After cross-encoder reranking:
| Rank | Chunk title | Rerank score |
|---|---|---|
| 1 | Enterprise renewal refund policy | 0.97 |
| 2 | Legacy refund exception memo | 0.88 |
| 3 | General refund overview | 0.62 |
| 4 | Annual plan cancellation steps | 0.55 |
| 5 | Refund FAQ for trial users | 0.31 |
| ... | ... | ... |
| 20 | Currency conversion for refunds | 0.04 |
Rank 11 jumped to rank 1. Rank 17 jumped to rank 2. The billing-error chunk crashed.
Why? Because the cross-encoder read the query and each chunk together. It noticed "enterprise" appeared in chunk 11. It noticed "30 days" appeared in chunk 17. It noticed billing-error chunks said nothing about renewal.
Now the top-3 going to the reading desk are the right three. Same retrieval, same corpus, same query. One cheap second pass changed everything.
5) Why rerank at all — the recall vs precision split¶
Retrieval optimises for recall@k — the right chunk should be somewhere in the top-20. Reranking optimises for precision@n — the right chunk should be at the top of the top-3. Different jobs, different tools.
Bi-encoders are recall machines: they cast a wide net and tolerate some noise near the top. Cross-encoders are precision machines: they tighten the top but cannot do recall at scale. You want both — cheap recall first, expensive precision second.
Typical recall@3 lifts from adding a strong reranker on a noisy enterprise corpus: - BEIR-style retrieval benchmarks: +8 to +15 points of nDCG@10. - Real production support corpora: 10 to 20% absolute lift in answer correctness. - Multilingual corpora: lifts can exceed 20 points because dense embeddings degrade faster than rerankers across languages.
These numbers are not promises. They are what teams typically report. Measure on your corpus.
6) The common rerankers — a working menu¶
Five families. Pick the one that matches your latency, budget, and quality target.
1. Managed API rerankers.
- Cohere Rerank v3 — multilingual, strong out-of-box quality, around $2 per 1,000 searches with 100 documents each at the time of writing. Latency 100 to 300 ms for a batch of 50 pairs.
- Voyage Rerank-2 — competitive quality, similar pricing tier, popular with Anthropic-stack teams.
- Jina Reranker v2 — open weights and a hosted API; multilingual; cheap.
- Mixedbread mxbai-rerank-large-v1 — open weights, runs well on a single GPU.
Managed APIs trade per-call cost for zero infra burden.
2. Open-weight cross-encoders.
- BGE Reranker (bge-reranker-base, bge-reranker-large, bge-reranker-v2-m3) — BAAI models, the open default for many teams.
- MS MARCO MiniLM cross-encoders — small, fast, English-only, classic.
- Salesforce LlamaRank — newer entrant, strong on long passages.
Self-hosted means GPU cost, but no per-call fee. A single A10 GPU can rerank 200 to 500 pairs per second with a small cross-encoder.
3. Late-interaction models — the ColBERT family.
ColBERT keeps token-level vectors per chunk, not one chunk-level vector. At query time it does a fast MaxSim over token vectors. Slower than a bi-encoder, faster than a cross-encoder, more precise than a bi-encoder.
Production deployments: Vespa ColBERT, RAGatouille (training library), PLAID (efficient ColBERT serving).
Use ColBERT when you want cross-encoder-grade precision but cannot afford per-pair transformer runs.
Mini-FAQ. "What's a 'late interaction' model like ColBERT?" Bi-encoder = early compression (one vector per chunk, no interaction). Cross-encoder = late interaction at full cost (every token attends to every token). ColBERT = late interaction at reduced cost (token vectors, but a cheap MaxSim aggregation instead of full self-attention). Think of it as the middle seat.
4. LLM-as-reranker.
Send the query and each candidate to an LLM with a prompt like "rate relevance 0 to 10." Or use pairwise ranking — show the LLM two chunks and ask which is more relevant. Or use listwise — show the LLM all 20 chunks at once and ask for a ranking.
Quality can match or beat cross-encoders, especially with GPT-4-class models. Latency and cost are brutal. Typical use: agentic systems where you already pay for an LLM call, or domains where no reranker is trained.
5. Learning-to-rank (LTR) classical.
OpenSearch Learning-to-Rank, Elastic LTR — gradient-boosted trees over hand-crafted features (BM25 score, recency, click rate, embedding similarity). Still alive in production at search-heavy companies. Not deep learning, but extremely tuneable.
7) Predict the rerank budget before reading the cost section¶
Before you read on, answer in your head:
- Why can you not use a cross-encoder for first-pass retrieval over a million chunks?
- What does a bi-encoder precompute that a cross-encoder cannot?
- In the refund example, why did rank 11 jump to rank 1 after rerank?
- If your reranker latency is 200 ms per pair and you rerank top-50, what is your rerank stage budget?
(Answer to the last one: 10 seconds — unacceptable. You would batch the 50 pairs and run them in parallel on a GPU; real wall-clock latency lands at 100 to 300 ms total.)
8) Cost, latency, quality — the three-way tradeoff¶
Reranking is not free. Three budgets to track.
Latency per pair. - Small cross-encoder (MiniLM, 22M params) on CPU: 30 to 60 ms per pair. - Base cross-encoder (BGE-base, 110M params) on GPU: 5 to 15 ms per pair batched. - Large cross-encoder (BGE-large, 350M params) on GPU: 15 to 40 ms per pair batched. - Cohere Rerank API: 100 to 300 ms total for batches of 50. - LLM-as-reranker with GPT-4-class model: 1 to 4 seconds.
Cost per 1,000 search-rerank operations (100 candidates each, at time of writing): - Cohere Rerank v3: around $2. - Voyage Rerank-2: around $0.50 to $2. - Jina Reranker hosted: around $0.20 to $0.50. - Self-hosted BGE-large on an A10 spot instance: roughly $0.10 to $0.30 amortised. - LLM-as-reranker (GPT-4o-mini, listwise): $0.50 to $5 depending on prompt size.
Quality. - Strong cross-encoder over a strong bi-encoder: +5 to +15 points nDCG@10. - ColBERT over a strong bi-encoder: +3 to +10 points. - LLM listwise reranker over a strong cross-encoder: 0 to +3 points, sometimes negative on calibration-sensitive tasks.
Read the table sideways. The biggest jump is bi-encoder to cross-encoder. The next jump (cross-encoder to LLM) is small, slow, and expensive.
Do the first jump. Hesitate on the second.
9) Rerank everything vs rerank when needed¶
Two production strategies.
Strategy A — rerank everything. Every query goes through retrieve → rerank → select. Simple. Predictable cost. Predictable latency. Default for most teams.
Strategy B — rerank when needed. Cheap signal first. Rerank only when the signal is weak.
Signals that say "skip rerank": - Top-1 retrieval score is far above top-2 (clear winner). - Top-3 scores are tightly clustered above a threshold and the query is short and direct. - The query came from a cached high-confidence path.
Signals that say "rerank": - Top-k scores are flat (the bi-encoder is uncertain). - The query contains multiple constraints (time, entity, role). - The corpus is noisy or contains many near-duplicates.
A typical adaptive system reranks 60 to 80% of queries and saves 20 to 40% of rerank cost. Useful at scale. Adds operational complexity at small scale. Default to Strategy A until you have measurements.
10) Cascaded rerankers — when one pass is not enough¶
Big teams sometimes run two reranker stages.
1,000,000 chunks
│
▼
┌────────────────────────────┐
│ Bi-encoder retrieval │ ~20 ms
│ top-200 │
└────────────┬───────────────┘
▼
┌────────────────────────────┐
│ Cheap cross-encoder │ ~50 ms (200 pairs)
│ MiniLM, top-50 │
└────────────┬───────────────┘
▼
┌────────────────────────────┐
│ Expensive cross-encoder │ ~200 ms (50 pairs)
│ BGE-large or Cohere, top-5 │
└────────────┬───────────────┘
▼
reading desk
The cheap reranker is a cost filter — it removes the clearly off-topic candidates so the expensive reranker only sees plausible ones, and its budget goes further on the hard cases.
Cascading pays off when: - Top-k from retrieval has to be large (say 200+) to hit recall targets. - The expensive reranker latency or cost would be prohibitive on the full top-k.
Cascading is overkill when: - Top-50 from retrieval already captures the right chunks. - A single strong cross-encoder fits the latency budget.
Mini-FAQ. "When is LLM-as-reranker worth it?" Three cases. One — you have no trained reranker for your domain and need decent quality fast (legal, medical, specialised technical). Two — you are already running an agent loop that calls an LLM, so the marginal cost is small. Three — you need explanations alongside scores ("this chunk is relevant because..."), which a classical reranker cannot give. Outside those cases, a cross-encoder is cheaper and faster.
11) Failure modes — what actually breaks¶
Failure 1. Domain mismatch. You took a reranker trained on MS MARCO web passages and pointed it at your medical-records corpus. The reranker scores look reasonable. The ordering is subtly wrong. Recall@3 drops 10 points and nobody notices for a month. Mitigation. Use a domain-tuned reranker or fine-tune on labelled in-domain pairs (even 500 to 2,000 labelled pairs help).
Failure 2. Calibration is broken. The reranker orders chunks correctly but its absolute scores are not probabilities. You set a threshold of 0.5 to decide "no relevant chunk found." On corpus A the threshold means "decent match." On corpus B the threshold means "perfect match." Same model, different distribution. Mitigation. Calibrate per corpus. Use relative thresholds (top-1 score vs top-k mean) instead of absolute thresholds.
Failure 3. Reranker latency blows the budget. You added BGE-large to a real-time chat product. p95 latency went from 800 ms to 2.1 seconds. Users churn. Mitigation. Batch pairs on GPU. Use a smaller model. Cascade. Or rerank a smaller top-k.
Failure 4. Top-k too small to feed the reranker. You retrieve top-5 and rerank them. But the right chunk was at rank 12 from retrieval, so it never reached the reranker. Reranking cannot promote what it cannot see. Mitigation. Retrieve top-30 to top-100 before reranking. The reranker is cheap per pair; widening the candidate pool is usually the right move.
Failure 5. Reranker hallucinates relevance for LLM-as-reranker. The LLM gives high scores to plausible-sounding but irrelevant chunks because it pattern-matches surface words. Mitigation. Use a trained cross-encoder where possible. For LLM rerankers, force pairwise comparisons rather than absolute scores — pairwise survives calibration drift better than pointwise scoring.
Failure 6. Chunk length mismatch. The reranker was trained on 256-token passages. Your chunks are 1,200 tokens. Tokens past the model's context window get truncated. The reranker scores the first half of every chunk. Mitigation. Match chunk size to the reranker's training distribution. Or use a long-context reranker (some BGE and Jina variants handle 1k-2k tokens).
Mini-FAQ. "Should I always rerank?" No. If your retrieval already hits >95% recall@3 on a representative eval set, reranking adds latency and cost for negligible gain. If your retrieval sits at 60 to 80% recall@3, reranking is the cheapest quality lever you have.
12) The rerank slot across shipped stacks¶
The reranking layer is now standard. Names and shapes vary; the role does not.
- Cohere Rerank v3 — managed API; powers reranking inside many enterprise RAG stacks.
- Voyage AI rerank-2 — managed reranker often paired with Anthropic models.
- Jina Reranker v2 — open weights plus hosted API; multilingual.
- BGE Reranker (BAAI) — open weights; default self-hosted choice for many teams.
- Mixedbread mxbai-rerank-large-v1 — open weights, strong on English.
- Salesforce LlamaRank — newer long-passage reranker.
- ColBERT — late-interaction model; deployed in Vespa and via RAGatouille.
- PLAID — efficient ColBERT serving for production.
- Pinecone managed reranking — first-class rerank step in Pinecone Assistant.
- Vectara — RAG-as-a-service with built-in reranker and faithfulness scoring.
- Azure AI Search semantic ranker — Microsoft's hosted cross-encoder stage.
- Vertex AI Search ranking API — Google's managed reranker.
- Amazon Kendra — proprietary semantic ranking layer over retrieval.
- OpenSearch Learning-to-Rank — gradient-boosted LTR plugin.
- Elastic LTR — Elastic's learning-to-rank module.
- Weaviate hybrid reranking modules — pluggable rerankers post hybrid retrieval.
- LlamaIndex SentenceTransformersRerank / CohereRerank / FlagEmbeddingReranker — orchestration-side rerankers.
- LangChain Cohere rerank / CrossEncoderReranker / FlashrankRerank — same idea, different framework.
- Perplexity AI — internal reranker stage before answer composition.
- Glean — enterprise search reranker tuned to org context.
- GitHub Copilot Chat — repo-context reranker for code chunks.
- Cursor / Windsurf — code-aware reranking over file embeddings.
- Hebbia / Harvey / Casetext — domain-tuned rerankers for finance and legal.
The pipeline diagram is the same everywhere. What changes is which reranker fills box 4.
13) Recall — eight questions on the second pass¶
- What does a bi-encoder produce that a cross-encoder cannot, and why does that matter for retrieval?
- Why is a cross-encoder more precise but unusable for first-pass retrieval over millions of chunks?
- In the refund example, name two reasons the cosine score loved the wrong chunks.
- What is a late-interaction model and where does ColBERT sit between bi- and cross-encoders?
- Give two signals that say "rerank this query" and two that say "skip rerank."
- When does cascaded reranking pay off, and when is it overkill?
- Your reranker scores look fine but answer quality dropped after a corpus migration. Hypothesise three causes.
- What is the typical recall@3 lift from adding a strong reranker over a strong bi-encoder?
14) Interview Q&A¶
Q1. Bi-encoder vs cross-encoder — what is the actual difference? A. A bi-encoder embeds query and chunk separately into independent vectors, then compares with dot product or cosine. Chunk vectors can be precomputed at indexing time, so retrieval scales. A cross-encoder feeds query and chunk together into one transformer, lets every token attend to every token, and outputs a single relevance score. It cannot be precomputed, so it runs per (query, chunk) pair at query time. Bi-encoder = fast recall. Cross-encoder = sharp precision. Common wrong answer to avoid: "Cross-encoders are just bi-encoders with more parameters."
Q2. Why don't we just use a cross-encoder for retrieval? A. Because we would need to run the joint transformer once per (query, chunk) pair against the whole corpus. At a million chunks and 50 ms per pair that is roughly 14 hours per query. Bi-encoders do retrieval in milliseconds because the chunk vectors live in the index. Common wrong answer to avoid: "Cross-encoders are less accurate than bi-encoders."
Q3. Where does reranking sit in the RAG pipeline and what is its job? A. Between retrieval and selection. Retrieval pulls top-k (often 20 to 100). The reranker rescores each candidate against the query and reorders them. Selection then keeps the top-n (usually 3 to 8) for the prompt. The job is to fix precision near the top, where the bi-encoder is noisy. Common wrong answer to avoid: "Reranking re-runs retrieval with a better model."
Q4. When would you choose ColBERT over a standard cross-encoder? A. When you need cross-encoder-grade precision but cannot afford the per-pair transformer cost. ColBERT keeps token-level vectors per chunk and uses a cheap MaxSim aggregation at query time. It scales further than full cross-encoders while preserving more interaction than bi-encoders. Trade-off: bigger index (per-token vectors), more storage, more engineering. Production setups: Vespa, PLAID. Common wrong answer to avoid: "ColBERT is just a faster bi-encoder."
Q5. What is LLM-as-reranker and when is it worth the cost? A. Use an LLM to score or rank candidates directly — pointwise (rate 0–10), pairwise (which of two is more relevant), or listwise (rank all 20). Quality can match strong cross-encoders, sometimes better, especially with GPT-4-class models. Worth the cost in three cases: no trained reranker exists for your domain; you're inside an agent loop already paying for LLM calls; you need score explanations. Outside those, a cross-encoder is 10 to 100× cheaper and faster. Common wrong answer to avoid: "LLM rerankers are always better because LLMs are smarter."
Q6. Should you always rerank? When would you skip it? A. No. Skip when retrieval already hits high recall@3 on your eval set, when latency budgets are extremely tight, or when the corpus is small and clean. Rerank when retrieval top-k scores are flat, queries have multiple constraints, the corpus is noisy, or your eval shows precision@3 is the bottleneck. Default to rerank-everything until your measurements say otherwise. Common wrong answer to avoid: "Rerank every query maximally — quality is everything."
Q7. Your reranker has good ordering but bad absolute scores. What is happening and how do you fix it? A. Calibration drift. Cross-encoders are trained to rank, not to produce calibrated probabilities. Absolute scores shift across corpora and even queries. If you use an absolute threshold (e.g., "abstain if score < 0.5"), the threshold means different things in different contexts. Fix: calibrate per corpus with a held-out set, or switch to relative thresholds — top-1 score vs the mean of top-k, or score gap between top-1 and top-2. Common wrong answer to avoid: "Retrain the reranker from scratch."
Q8. How do you design a cascaded reranker pipeline and when does it pay off? A. Stage 1: bi-encoder, retrieve top-200. Stage 2: cheap cross-encoder (MiniLM or similar), top-50. Stage 3: expensive cross-encoder or Cohere Rerank, top-5. Pays off when (a) top-k from retrieval must be large to hit recall, and (b) running the strongest reranker on the full top-k blows latency or cost. Overkill when top-50 already captures the right chunks and a single strong reranker fits the budget. The trap is adding a cascade before measuring whether one stage is enough. Common wrong answer to avoid: "Always cascade — more stages always help."
15) Apply now (10 min)¶
Step 1 — model the exercise. Here is the trace I would write for the refund query.
| Stage | Input | Output | Latency I would log |
|---|---|---|---|
| Retrieval | query vector | top-20 chunks, cosine scores | retrieval p95 |
| Rerank | (query, chunk) × 20 pairs | 20 scores, reordered | rerank latency per batch, p95 |
| Selection | top-20 reranked | top-3 deduped | dedup hit rate |
For each stage, record one failure mode and one metric. Reranker latency drift is the one I would alert on first.
Step 2 — your turn. Pick one query from your product. Write the top-10 chunks your retriever would pull. Mark which two or three you believe are actually relevant. Then — without running the model — predict how a cross-encoder would reorder them, and write one sentence per swap explaining why.
Step 3 — sketch from memory. Redraw the bi-encoder vs cross-encoder diagram. Two columns. Label which side is precomputed and which is not. Label one number on each side (latency per pair). Mark where the librarian sits and which chunks land on the reading desk after the second pass. If you can do this cold, you understand the layer.
What you should remember¶
This chapter explained why retrieval alone is not enough: cosine similarity loves word overlap, and the chunk that actually answers the question often sits at rank 11, not rank 1. The fix is a second pass — a cross-encoder that reads each (query, chunk) pair jointly and rescores. Same retrieval, same corpus, same query. One cheap pass changes which chunks reach the reading desk.
You learned the asymmetry that makes the two-pass design unavoidable: bi-encoders precompute chunk vectors once at indexing and scale to millions, but they cannot let query and chunk tokens see each other; cross-encoders let every token attend to every token but cannot precompute, so they only fit on a small candidate pool. ColBERT sits in the middle seat — token-level vectors with a cheap MaxSim — when you need cross-encoder precision without per-pair transformer cost. LLM-as-reranker is a hammer worth using only when no trained reranker exists, when you are already inside an agent loop, or when you need explanations.
Carry this diagnostic forward: when answer quality drops, before blaming the generator or swapping retrievers, read the rerank ordering. If the right chunk reached top-20 but lost top-3, the rerank step is where the answer was lost. If the right chunk never reached top-20, widen retrieval before adding any reranker — reranking cannot promote what it cannot see.
Remember:
- Retrieval optimises recall@k; reranking optimises precision@n. Different jobs, different models.
- A cross-encoder cannot do retrieval — running it across millions of chunks per query is ~14 hours of compute. Use it as the second pass on top-20 to top-200.
- The biggest quality jump in the stack is bi-encoder → cross-encoder. The jump from cross-encoder to LLM-listwise is small, slow, and expensive.
- Calibration drift is real: cross-encoders are trained to rank, not to produce probabilities. Prefer relative thresholds (top-1 minus top-2 gap) over absolute ones.
- Reranking cannot promote what retrieval did not return. If recall@20 is poor, widen retrieval before adding rerank layers.
Bridge. The right chunks are now in the top-3. The reading desk is set. But putting chunks in front of the LLM is not the same as building the answer brief. Order, instructions, citations, abstention rules — all matter. Next file: how to assemble the prompt that turns evidence into a grounded answer.