12. Retrieval metrics — grading the librarian before grading the writer¶
~12 min read. Two RAG systems both score recall@10 = 0.80. One feels brilliant. The other feels broken. By the end, you will know why — and which metric would have caught the difference.
Builds on 11-prompt-augmentation.md and the ELI5 in 00-eli5.md. The librarian must be graded on the books pulled, not the essay written. Generation grading comes next.
The hook — same recall, different feel¶
Picture two systems answering the same support query. Both pull 10 chunks. Both have 4 relevant chunks in those 10. Their recall@10 is identical — 4 out of 4 possible relevant chunks were in the top-10 candidate pool. On paper, equal.
Now look at how each one ranks those 4 relevant chunks.
System A: R . . R . . . R . R ← relevant at ranks 1, 4, 8, 10
System B: . . . . . . R R R R ← relevant at ranks 7, 8, 9, 10
System A puts a good chunk at rank 1. The user sees the answer instantly. System B buries the first good chunk at rank 7. The reading desk only fits 3 chunks. The LLM never reads a relevant one. The user sees a hallucination.
Same recall. Different lives. The metric you choose decides whether you see this gap or not. Recall@10 cannot tell A and B apart. MRR can. nDCG can. That is the whole point of this chapter.
Grade the librarian first. The writer cannot fix what the librarian never brought.
The running example — one query, one top-10 result¶
We will keep one example for the whole page. Watch every metric land on the same numbers.
Query. "Can enterprise customers request a refund after 30 days?"
Golden set for this query (the chunks a human judge marked relevant):
| Chunk ID | Grade | Why it is relevant |
|---|---|---|
| D1 | 3 — highly relevant | Direct policy: "Enterprise refunds after 30 days require manager approval" |
| D2 | 2 — partially relevant | Mentions enterprise refund timeline but no >30-day clause |
| D5 | 1 — weakly relevant | General refund FAQ |
| D9 | 3 — highly relevant | Renewal clause with explicit >30-day handling |
Four relevant chunks exist in the corpus. Two are highly relevant, one partial, one weak.
The system's top-10 retrieval, in order:
| Rank | Chunk | Grade | Relevant? |
|---|---|---|---|
| 1 | D7 | 0 | no |
| 2 | D1 | 3 | yes |
| 3 | D3 | 0 | no |
| 4 | D5 | 1 | yes |
| 5 | D4 | 0 | no |
| 6 | D2 | 2 | yes |
| 7 | D8 | 0 | no |
| 8 | D6 | 0 | no |
| 9 | D9 | 3 | yes |
| 10 | D10 | 0 | no |
This single table will be reused for every metric below. Memorize the shape. Three relevant chunks in the top-5, all four in the top-10, first relevant at rank 2.
Recall@k — did we even find it?¶
Recall@k asks one question. Of all the relevant chunks that exist, how many did we surface in the top-k?
Formula. recall@k = |relevant ∩ top-k| / |relevant|.
Worked on our example.
Total relevant in corpus = 4 (D1, D2, D5, D9)
Top-3 contains: D1, D5 → 2 relevant
recall@3 = 2 / 4 = 0.50
Top-5 contains: D1, D5 → 2 relevant
recall@5 = 2 / 4 = 0.50
Top-10 contains: D1, D5, D2, D9 → 4 relevant
recall@10 = 4 / 4 = 1.00
Read those three lines slowly. Recall@5 jumped to 0.50. Recall@10 jumped to 1.00. The denominator never moved — the four relevant chunks always existed. Only k changed.
What it tells you. Whether the right books exist in your shortlist at all. If recall@k is low, no amount of rerank or prompt tuning helps. The evidence is not in the room.
What it hides. Where in the top-k the relevant chunks landed. Recall@10 = 1.00 looks perfect — but if all four relevant chunks landed at ranks 7-10, the LLM with a top-3 budget still loses.
Mini-FAQ. "Recall vs precision — when each?" Recall asks "did we find them all?" Precision@k asks "of the k we showed, how many were relevant?" On our example, precision@5 = 2/5 = 0.40, precision@10 = 4/10 = 0.40. Use recall when missing evidence is dangerous — compliance, legal, medical. Use precision when the reading desk is tight and every chunk costs tokens. In a RAG pipeline, recall@k matters more upstream (retrieval), precision matters more downstream (after rerank, before stuffing).
Mini-FAQ. "What is hit@k?" Hit@k is the binary cousin of recall@k. Hit@k = 1 if at least one relevant chunk is in top-k, else 0. On our example, hit@1 = 0 (D7 at rank 1 is not relevant), hit@2 = 1 (D1 lands). Hit@k is what FAQ bots and "one true answer" lookups actually care about.
Precision@k — of what we showed, how much was good?¶
Formula. precision@k = |relevant ∩ top-k| / k.
Notice precision can fall as k grows. Recall can only rise. They pull against each other. That is why we look at both.
In production, the operating range is precision@n where n is the reading desk size — usually 3 to 8 chunks. Higher precision@n at the stuffing step means less wasted prompt space.
MRR — how quickly did the first good chunk show up?¶
Mean Reciprocal Rank. For each query, find the rank of the first relevant chunk. Take 1 divided by that rank. Average across all queries in the golden set.
Formula. MRR = (1/N) × Σ (1 / rank_of_first_relevant_chunk_i).
On our example, the first relevant chunk is D1 at rank 2.
If our golden set had three queries, with first relevant at ranks 2, 1, and 5:
Why reciprocal? Because rank 1 should hurt 100% more than rank 2, not 50% more. The reciprocal compresses the top hard and ignores the tail.
| Rank of first relevant | Reciprocal | Felt impact |
|---|---|---|
| 1 | 1.00 | Instant answer |
| 2 | 0.50 | One scroll |
| 5 | 0.20 | Visible scroll |
| 10 | 0.10 | User gives up |
| 50 | 0.02 | Effectively zero |
Use MRR when one true doc exists per query. Support FAQ. "What is my account balance?" There is exactly one right page. You only need it early.
Do not use MRR when several relevant docs matter. A research query needs many of them. MRR rewards finding the first one and stops paying attention.
Mini-FAQ. "What's the difference between MRR and nDCG?" MRR cares about the first relevant hit. nDCG cares about the whole ranking, with graded relevance, log-discounted by position. MRR is a special case of nDCG when relevance is binary and you only count the first hit. Use MRR for one-answer-per-query tasks. Use nDCG when many chunks combine to form a good answer.
nDCG — did we rank stronger evidence higher?¶
Normalized Discounted Cumulative Gain. The name is heavy. The idea is two moves: reward higher relevance more, and discount it by position.
Three pieces.
- Gain. A relevance grade per result. Use 2^grade − 1 by convention (so grade 3 → gain 7, grade 2 → gain 3, grade 1 → gain 1, grade 0 → gain 0).
- Discount. Divide each gain by log2(rank + 1). Lower ranks get smaller discounts.
- Normalize. Compare DCG against the ideal DCG (IDCG) — what you would get if you ranked every relevant doc perfectly first.
Formula. DCG@k = Σ (2^grade_i − 1) / log2(i + 1), summed over ranks i = 1 to k. nDCG@k = DCG@k / IDCG@k.
Now the worked computation on our example at k=10.
| Rank i | Chunk | Grade | Gain = 2^grade − 1 | log2(i+1) | Discounted gain |
|---|---|---|---|---|---|
| 1 | D7 | 0 | 0 | 1.000 | 0.000 |
| 2 | D1 | 3 | 7 | 1.585 | 4.416 |
| 3 | D3 | 0 | 0 | 2.000 | 0.000 |
| 4 | D5 | 1 | 1 | 2.322 | 0.431 |
| 5 | D4 | 0 | 0 | 2.585 | 0.000 |
| 6 | D2 | 2 | 3 | 2.807 | 1.069 |
| 7 | D8 | 0 | 0 | 3.000 | 0.000 |
| 8 | D6 | 0 | 0 | 3.170 | 0.000 |
| 9 | D9 | 3 | 7 | 3.322 | 2.107 |
| 10 | D10 | 0 | 0 | 3.459 | 0.000 |
Now the ideal. If we ranked perfectly: D1 (grade 3) at rank 1, D9 (grade 3) at rank 2, D2 (grade 2) at rank 3, D5 (grade 1) at rank 4. The rest are zero.
| Rank i | Ideal grade | Gain | log2(i+1) | Discounted |
|---|---|---|---|---|
| 1 | 3 | 7 | 1.000 | 7.000 |
| 2 | 3 | 7 | 1.585 | 4.416 |
| 3 | 2 | 3 | 2.000 | 1.500 |
| 4 | 1 | 1 | 2.322 | 0.431 |
| 5..10 | 0 | 0 | — | 0.000 |
Read that ratio. The system scored 60.1% of the best possible ranking quality. Recall@10 said 1.00 — perfect. nDCG@10 said 0.60 — meh. Same retrieval result. nDCG saw the missed opportunity at rank 1 that recall could not see.
That is exactly the System A vs System B gap from the opening hook. nDCG is the metric that distinguishes them.
Use nDCG when graded relevance matters — when several chunks contribute, and putting the strongest one first matters more than putting any-relevant first.
MAP — averaging precision across the relevance ladder¶
Mean Average Precision. For one query, compute precision at each rank where a relevant chunk appears, then average those values. Then average across queries.
On our example:
First relevant at rank 2 (D1): precision@2 = 1/2 = 0.500
Second relevant at rank 4 (D5): precision@4 = 2/4 = 0.500
Third relevant at rank 6 (D2): precision@6 = 3/6 = 0.500
Fourth relevant at rank 9 (D9): precision@9 = 4/9 ≈ 0.444
AP = (0.500 + 0.500 + 0.500 + 0.444) / 4 ≈ 0.486
MAP = average of AP over all queries
MAP rewards both coverage and early ranking. It is binary on relevance (yes/no) but sensitive to position. Many academic benchmarks still report MAP.
Predict recall@3 and the diagnostic metric before reading on¶
Before reading on, answer two questions on paper.
- If our system's top-3 budget feeds the LLM, what is recall@3 in this example, and which relevant chunks make it through?
- Which metric — recall@3, MRR, or nDCG@3 — would best diagnose "the first chunk is junk"?
Now continue.
The golden set — without it, every metric lies¶
Every metric above assumes you know which chunks are relevant for each query. That ground truth is the golden set. Without it, you cannot compute any of these numbers honestly.
A golden set has three pieces per query:
- The query string.
- The list of relevant chunk IDs (and grades if using nDCG/MAP).
- Metadata — query type, intent, difficulty tag.
How big? 50 queries to start sanity checking. 200 to track week-over-week regression. 1000+ to evaluate model swaps confidently.
Mini-FAQ. "How do you build a golden set if you don't have one?" Four pragmatic sources: 1. Log mining. Pull real user queries from production. Sample across intent buckets so support, research, and lookup are all represented. 2. LLM-generated questions. Feed each document to an LLM and ask it to generate 3 questions the doc answers. Then the doc is the golden chunk for that query. Cheap. Beware: questions may be too easy. 3. Subject-matter expert (SME) labeling. Slowest, most expensive, most trustworthy. Required for legal/medical/financial domains. 4. User feedback loops. Thumbs up/down on retrieved chunks in the product. Free golden data over time.
A mix beats any single source. Start with LLM-generated, validate a sample with humans, augment with production logs once the system is live.
Leakage warning. If the golden set queries also live inside your training data (for fine-tuned embedders) or in your few-shot examples (for query rewriters), the score is inflated. Hold out the eval set. Never train on it. Re-shuffle when in doubt.
Offline vs online — two different jobs¶
Offline metrics use a static golden set. You run them on demand. Recall@k, MRR, nDCG, MAP — these are the offline crew. Good for catching regressions before deploy. Good for comparing two retrievers head-to-head. Cheap, deterministic, repeatable.
Online metrics use real users. Click-through rate on cited sources. Thumbs-up rate. Abstention rate. Session length. Follow-up question rate. These tell you whether the user found the system useful. Slow, noisy, but honest.
Healthy teams run both. Offline catches the obvious regression in CI. Online catches the subtle thing the golden set never imagined — like users asking a brand-new question category your golden set does not cover.
Failure modes — where metrics quietly lie¶
These are the traps that look like progress.
- Single-answer task, ranking metric. You use nDCG on an FAQ where there is one true page. The metric rewards spreading relevance across ranks. You ship a retriever that looks better on nDCG and worse for users.
- Multi-doc task, single-hit metric. You use MRR on a research query that needs 5 sources combined. MRR is 1.0 because the first chunk is relevant. But you never measure whether chunks 2-5 were also good. Users get partial answers and you cannot see it.
- No golden set. You eyeball ranking quality on 5 queries. You ship. Production traffic hits a query distribution your 5 queries never sampled. Quality cratered weeks ago — you have no instrument to see it.
- Golden-set leakage. Your fine-tuned embedder saw the eval queries during training. Offline recall@10 = 0.95. Production recall@10 = 0.60. The gap is shock and you cannot explain it.
- Top-k tunnel vision. You only measure recall@10 but the user-facing system shows top-3. Recall@10 = 1.0 means nothing if the four relevant chunks are at ranks 7, 8, 9, 10.
- Long tail invisible. Your top-50 catches almost everything, so recall@50 is great. But rerankers struggle, and after rerank you show top-3. The long tail is hiding a different problem upstream.
- Graded labels treated as binary. You squash grades 3, 2, 1 into "relevant." You lose nDCG's signal and nothing tells you that the highly relevant chunk is at rank 8.
- One number obsession. A single dashboard tile shows nDCG@10. A change ships. nDCG@10 went up 2 points. But recall@3 quietly went down 8 points. The user with a small prompt budget is now worse off.
The fix to every one of these: track more than one metric, and pick the metric to match the task.
Retrieval metrics across eval, search, and observability stacks¶
These metrics appear across the eval, search, and observability stacks.
- RAGAS — open-source RAG eval framework. Computes context recall, context precision, and answer relevance against a golden set.
- TruLens — LLM eval and observability with retrieval scoring built in.
- Arize Phoenix — LLM-eval platform with retrieval metrics, embedding drift, and tracing.
- LangSmith — LangChain's eval and tracing platform; supports custom retrieval evaluators on datasets.
- LangFuse — open-source LLM observability with eval pipelines including retrieval metrics.
- Weights & Biases Prompts / Weave — experiment tracking for RAG evals over time.
- Helicone — LLM observability with retrieval traces and custom evals.
- Comet Opik — LLM eval platform with built-in retrieval metric primitives.
- BeIR benchmark — academic retrieval benchmark; reports nDCG@10 as primary metric across 18 datasets.
- MTEB (Massive Text Embedding Benchmark) — community standard for embedder evaluation; uses nDCG, MAP, recall.
- MS MARCO — Microsoft's passage ranking benchmark; canonical MRR@10 leaderboard.
- TREC — long-running IR evaluation series; MAP and nDCG are the staple metrics.
- ir_measures — Python library for IR metric computation. Standard in academic IR.
- pytrec_eval — Python wrapper around trec_eval; reference implementation.
- ranx — fast IR metrics library with statistical testing.
- sentence-transformers — has an
InformationRetrievalEvaluatorthat computes recall@k, MRR, nDCG, MAP during training. - Pinecone Inference + Eval — managed retrieval scoring inside Pinecone's stack.
- Vectara — RAG-as-a-service; reports retrieval and faithfulness metrics per query.
- Cohere Rerank eval — Cohere documents nDCG@10 deltas across rerank tiers.
- Vespa evaluation — built-in ranking experiment tooling with offline replay.
- Elastic Relevance Workbench — A/B retrieval experiments with MRR and nDCG.
- OpenSearch Search Relevance — judgment-set-driven retrieval eval in the OpenSearch UI.
- Glean — enterprise search reports nDCG and CTR as paired offline/online metrics.
- Perplexity — uses online click and follow-up signals as their dominant eval surface.
- Notion AI Q&A — internal eval over workspace golden sets before each release.
- GitHub Copilot retrieval eval — code retrieval evaluated with MRR-style metrics on golden code-search sets.
Same metric family. Different surface, different golden set.
Recall — eight questions on the metric zoo¶
- Two systems have identical recall@10. Which metric reveals which one ranks better?
- On our worked example, what is recall@5 and why is it lower than recall@10?
- Why does nDCG use 2^grade − 1 instead of just the grade?
- When would you choose MRR over nDCG, and vice versa?
- What is the difference between hit@k and recall@k?
- Name three sources for building a golden set when you have none.
- Why does golden-set leakage inflate offline scores but not online performance?
- What is the single biggest reason to track more than one retrieval metric?
Interview Q&A¶
Q1. How do you measure if your RAG retrieval is good? A. Build a golden set of queries with labeled relevant chunks. Compute recall@k to confirm the right chunks exist in the candidate pool, MRR or hit@k for single-answer queries, and nDCG@k for graded multi-doc queries. Track over time, separate offline metrics (golden set) from online (click-through, thumbs, abstention). Always check the metric matches the task. Common wrong answer to avoid: "I look at the final answers and judge them by feel."
Q2. Recall@k = 1.0 and users still complain. What's happening? A. The relevant chunks are in the top-k but ranked too low to fit the reading desk. The selector takes top-n (n < k) and misses them. The fix: measure recall@n where n is the prompt budget, or add nDCG@k to see the ranking quality recall@k cannot show. Common wrong answer to avoid: "Recall@k = 1.0 means retrieval is fine, blame the LLM."
Q3. When would MRR mislead you? A. On multi-document tasks. MRR rewards the first relevant hit and stops. If your task needs 5 sources combined — research, summarization, comparison — MRR can be perfect while the system still under-retrieves. Common wrong answer to avoid: "MRR is always a good summary metric."
Q4. Why is nDCG normalized? A. Because raw DCG depends on how many relevant docs exist for each query. A query with 10 relevant docs has higher achievable DCG than one with 2. Normalizing by IDCG makes scores comparable across queries. Common wrong answer to avoid: "To keep it between 0 and 1 for aesthetic reasons."
Q5. How do you build a golden set from scratch? A. Start with LLM-generated questions per document — fast and cheap. Validate a sample with human SMEs. Mine real user queries from production logs to capture the live distribution. Add user thumbs-up/down as a continuous source. Hold queries out of any training data to prevent leakage. Tag by intent so you can slice metrics. Common wrong answer to avoid: "Hand-label 50 queries and never update it."
Q6. Why split offline and online metrics? A. Offline metrics are cheap, repeatable, fast, and great for catching regressions in CI before deploy. But they only measure what the golden set already imagines. Online metrics — click-through, follow-up rate, thumbs — catch the distribution shifts the golden set never sampled. You need both. Common wrong answer to avoid: "Online metrics replace offline metrics once the system is live."
Q7. Your team reports nDCG@10 = 0.85 and ships a new embedder. Three weeks later, user satisfaction drops. What investigations would you run? A. First, check if the golden set distribution still matches production query distribution. Mine recent queries; see if a new intent cluster emerged. Second, recompute nDCG@k at the actual k that hits the prompt — often k=3 or k=5, not 10. Third, check for golden-set leakage in the new embedder's training data. Fourth, look at per-bucket nDCG, not the aggregate; one query class could have collapsed. Common wrong answer to avoid: "Roll back the embedder."
Q8. Recall@k vs precision@k — which one matters more, and when? A. They matter at different stages. Recall@k matters upstream — at retrieval, k is large (20-100), and you want to ensure the relevant chunks are somewhere in the candidate pool. Precision matters downstream — after rerank and at selection time, you have a tight prompt budget and need most of the chunks you show to be relevant. A good pipeline optimizes recall at retrieval, precision after rerank. Common wrong answer to avoid: "Always optimize for the F1 of the two."
Apply now (10 min)¶
Step 1 — model the exercise. Here is the metric-selection table I would build for our refund-policy example:
| Task type | Best primary metric | Why | Secondary |
|---|---|---|---|
| One true doc per query (FAQ) | hit@3 or MRR | Need it early | recall@10 as sanity |
| Several relevant chunks (research) | nDCG@10 | Graded ranking matters | recall@10 |
| Compliance / legal lookup | recall@k at the prompt-budget k | Missing a fact is dangerous | precision@n |
| Long-tail rare queries | nDCG@k + per-bucket slicing | Aggregates hide regressions | online thumbs |
Step 2 — your turn. Pick five real queries from your product. For each, label which task type it is from the table above. Then commit to one primary metric per query type. Write it down.
Step 3 — score the example yourself. Without scrolling up, recompute recall@5, MRR, and nDCG@5 on the running example. Check your numbers against the worked computations.
Step 4 — sketch from memory. Draw the top-10 retrieval table. Beside each rank, write the grade. Underneath, write the three metric values. If you can do this cold, you understand the page.
What you should remember¶
This chapter explained why "the answers look good" is not a metric and why a single dashboard tile lies. The metric zoo — recall@k, hit@k, MRR, nDCG@k, MAP — exists because different tasks have different failure shapes. Single-answer FAQ wants hit@k or MRR. Multi-source research wants nDCG with graded labels. Compliance lookup wants recall at the prompt-budget k, not at k=10. Pick the metric that matches the task, then track at least one secondary so a single score cannot hide a regression in the other dimension.
You also learned that every metric assumes a golden set, and a leaked golden set silently inflates offline numbers. Mine production logs, use LLM-generated questions to bootstrap, validate with SMEs in regulated domains, and never let eval queries reach training data. Offline catches regressions in CI; online (CTR, thumbs, follow-up rate) catches the distribution shifts the golden set never imagined.
Carry this diagnostic forward: when offline metrics are green and users complain, recompute the metric at the real k your prompt budget uses, then slice per query bucket. The aggregate hides which bucket collapsed.
Remember:
- Choose the metric to match the task. Single-answer ≠ multi-source ≠ compliance lookup.
- Recall@k at the wrong k is theatre. Use the k your prompt budget actually consumes.
- Golden-set leakage is the silent inflator. Hold queries out of training; re-shuffle when in doubt.
- Offline + online together. Offline catches regressions; online catches blind spots.
- Per-bucket metrics beat aggregates. One query class collapsing is invisible in the global average.
Bridge. Retrieval metrics grade the librarian — did the right books reach the reading desk, and in what order? But there is a second failure mode the librarian cannot cause. The writer can have all the right evidence open and still write something the evidence does not support. That is faithfulness. The next file goes deep on how to measure it.