11. Evaluation Metrics for IR — Measure whether the delivery route is actually good¶
~15 min read. Search quality feels subjective, but we still need disciplined ways to measure it.
Built on the ELI5 in 00-eli5.md. The address label has already produced candidate letters and a ranked delivery route. Now we judge whether that route and its postmark score logic are good enough.
1) Precision@K and Recall@K¶
Look. Precision@K asks a narrow question.
Of the top K results, how many are relevant?
Recall@K asks the complementary question. Of all relevant results that exist,
how many did we capture in the top K? Worked mini example.
Suppose top 5 returned letters are:
[relevant, relevant, not, relevant, not] Relevant count inside top 5 = 3.
So Precision@5 = 3/5 = 0.60. Now suppose there are 6 relevant letters in total.
Then Recall@5 = 3/6 = 0.50. See.
Precision cares about cleanliness. Recall cares about coverage.
Both matter.
2) MRR: did the first good result arrive early?¶
Mean Reciprocal Rank is excellent when users want one good answer fast. For one query,
find the rank of the first relevant result. If it appears at rank 1,
reciprocal rank = 1/1 = 1.0. If it appears at rank 3,
reciprocal rank = 1/3 ≈ 0.333. If no relevant result appears,
score = 0. Average that over many queries,
and you get MRR. Simple, no?
MRR is strong for navigational search, FAQ lookup,
and support search where the first useful hit matters most. But it ignores later relevant results.
That becomes a limitation for richer search tasks.
3) NDCG: graded relevance with position decay¶
Now the more senior metric. NDCG handles graded relevance,
not just relevant versus not relevant. A perfect answer may deserve 3.
A decent answer may deserve 2. A weak but useful answer may deserve 1.
The metric also discounts lower positions. A highly relevant result at rank 1 matters more than the same result at rank 5.
Standard DCG formula:
DCG@K = Σ (2^rel_i - 1) / log2(i + 1)
Then normalize by the ideal ranking:
NDCG@K = DCG@K / IDCG@K Picture first.
gain
▲
│ rank1 keeps full value
│ rank2 gets discounted
│ rank3 gets discounted more
│ rank4 gets discounted more
└──────────────────────────────▶ rank position
4) Worked numerical example: NDCG@5 step by step¶
Suppose returned relevance labels are:
[3, 0, 2, 1, 0] Compute DCG@5.
Position 1:
(2^3 - 1) / log2(2) = 7 / 1 = 7.000
Position 2:
(2^0 - 1) / log2(3) = 0 / 1.585 = 0
Position 3:
(2^2 - 1) / log2(4) = 3 / 2 = 1.500
Position 4:
(2^1 - 1) / log2(5) = 1 / 2.322 ≈ 0.431
Position 5:
(2^0 - 1) / log2(6) = 0 / 2.585 = 0
Add them:
DCG@5 = 7.000 + 0 + 1.500 + 0.431 + 0 = 8.931 Now compute ideal order.
Sort labels descending:
[3, 2, 1, 0, 0]
Ideal DCG pieces:
Position 1:
7 / 1 = 7.000
Position 2:
3 / 1.585 ≈ 1.893
Position 3:
1 / 2 = 0.500
Position 4:
0
Position 5:
0 So IDCG@5 = 7.000 + 1.893 + 0.500 = 9.393
Finally:
NDCG@5 = 8.931 / 9.393 ≈ 0.951 That is quite strong.
The list is not ideal, but it is close.
5) Offline metrics versus online metrics¶
Offline metrics use labels or judgments. They are fast for experiments.
They are reproducible. NDCG, MRR, precision, and recall all live here.
Online metrics use live user behavior.
Common examples are:
-
CTR
-
dwell time
-
reformulation rate
-
zero-result rate
-
add-to-cart rate for commerce search Both worlds matter. Offline metrics help iteration speed.
Online metrics reveal real user behavior. But online metrics are noisier and slower.
6) Why NDCG often beats MRR for ranked retrieval¶
MRR only cares about the first relevant hit. That is useful,
but incomplete for many search experiences. NDCG rewards putting highly relevant results early,
while still recognizing secondary useful results. So for document search,
product search, and content discovery,
NDCG is often the better offline north star. Yes?
Use the metric that matches the product task.
6) Why not one aggregate accuracy score under this workload¶
The tempting alternative is one aggregate accuracy score. It keeps the system simple, and on a toy corpus it often looks good enough.
It breaks when teams need metrics that match ranking quality, not just answer vibes. At that point the search system needs an inspectable artifact: judged result list with Precision@K, Recall@K, MRR, and NDCG. Without that artifact, the team argues about ranking quality from anecdotes instead of traces.
| Option | Works when | Fails when | Cost moves to |
|---|---|---|---|
| one aggregate accuracy score | corpus is small or intent is obvious | teams need metrics that match ranking quality, not just answer vibes | user trust and manual debugging |
| IR evaluation metrics | the failure can be measured before serving | traces or judgments are missing | indexing, scoring, evals, and review |
Mini-FAQ. "Is this overkill for a simple app?" Maybe. The mechanism earns its place when query failures are frequent, expensive, or hard to diagnose from the final result list alone.
7) Production signals — know whether IR evaluation metrics is working¶
Healthy behavior: judged result list with Precision@K, Recall@K, MRR, and NDCG explains why the top results changed.
First metric to watch: metric disagreement by query class.
Misleading metric: average click-through rate. Clicks are position-biased, presentation-biased, and often do not prove satisfaction.
Expert graph: break quality down by query class, source type, language, freshness, and result position. Search failures hide in slices long before the global dashboard moves.
bad search result
-> query trace
-> candidate generation
-> scoring / ranking artifact
-> judged list or user feedback
-> targeted tuning change
8) Boundary — where IR evaluation metrics helps and where it does not¶
Strong fit: the failure is visible in candidate generation, scoring, ranking, or judged result lists.
Weak fit: the corpus does not contain the answer, the user's intent is unknowable, or the business objective is unresolved.
Pathology: the team keeps tuning scores when the real issue is missing content, stale content, or unclear product policy.
Scale limit: every extra retrieval branch, feature, or reranker spends latency and operator attention. Put expensive steps on the queries where the quality gain is measurable.
9) Wrong model — one metric defines search quality¶
The wrong model sounds plausible because it works on simple examples.
Production search is harsher. Users bring short queries, typos, rare entities, ambiguous intent, changing corpora, and biased feedback. The ranking stack has to expose those pressures instead of hiding them behind one score.
If IR evaluation metrics cannot change candidate generation, ranking, evaluation, or debugging, it is not carrying its weight.
10) Failure taxonomy for IR evaluation metrics¶
- Candidate failure — the right document never enters the candidate set.
- Scoring failure — the right document is present but ranked too low.
- Intent failure — the system optimizes for the wrong interpretation of the query.
- Calibration failure — scores from different sources are compared as if they mean the same thing.
- Feedback failure — clicks or labels reflect position, popularity, or annotator disagreement rather than true relevance.
- Freshness failure — stale documents outrank newer but necessary content.
- Debugging failure — no trace connects query, candidates, scores, and final route.
11) Pattern transfer — where this returns later¶
- RAG uses the same candidate-generation and ranking chain before answer synthesis.
- Vector databases make the latency and recall tradeoff physical.
- Advanced RAG reuses query understanding, hybrid fusion, and reranking under stricter evidence constraints.
- Evals in production reuse judged lists, slice analysis, and false-positive/false-negative review.
12) Design review checklist¶
- What failure does this mechanism catch: candidate, scoring, intent, calibration, feedback, or freshness?
- What artifact would you inspect first: query parse, postings list, vector neighbors, feature row, rerank score, or judged list?
- Why is one aggregate accuracy score weaker for this workload?
- Which query slice should improve first?
- Which latency, memory, or labeling cost rises first?
- What rollback signal tells you the tuning made search worse?
Where this lives in the wild¶
-
Google Search — ranking analysts track NDCG and click metrics together.
-
Amazon product search — search scientists watch CTR, add-to-cart, and zero-result rate.
-
Support search at Atlassian — relevance engineers care about first useful hit and reformulation rate.
-
Semantic Scholar — IR researchers use NDCG and Recall@K across benchmark datasets.
-
GitHub Copilot Chat — ML engineers measure Recall@K for chunk retrieval and answer success downstream.
-
Enterprise search — mixes exact product names, acronyms, policies, and natural-language questions.
- Ecommerce search — balances lexical match, popularity, personalization, freshness, and business rules.
- Support knowledge bases — need high recall for policy questions and high precision for top answers.
- Code search — exact identifiers and semantic intent both matter.
- Legal search — missing one relevant document can be worse than showing extra documents.
- Medical literature search — query expansion helps, but false positives are expensive.
- RAG retrievers — use IR as the evidence gateway before generation.
- Recommendation feeds — reuse ranking ideas even when the item source is not text.
- Ad search — relevance competes with auction and business constraints.
- Academic search — citations, freshness, author authority, and topical match all interact.
Recall checkpoint¶
-
What does Precision@K measure that Recall@K does not?
-
Why is MRR useful for one-good-answer tasks?
-
In the NDCG example, what made the score high but not perfect?
-
Why should offline and online metrics both be monitored?
-
Which artifact would you inspect first for IR evaluation metrics?
- What query slice would you use to prove the improvement is real?
- What is the first cost this mechanism adds?
Interview Q&A¶
Q: Why is NDCG often preferred over MRR for general IR evaluation? A: Because NDCG handles graded relevance and discounts lower positions gracefully. MRR only cares about the first relevant result.
Common wrong answer to avoid: "MRR is always better because it is simpler.".
Q: Why can Precision@K improve while Recall@K worsens? A: Because a system can return a cleaner but narrower top K, surfacing fewer total relevant results even as the visible list looks better.
Common wrong answer to avoid: "Precision and recall always move together.".
Q: Why are online metrics not enough by themselves? A: Because user behavior is biased by layout, seasonality, traffic mix, and experimentation noise. Offline judgments still help controlled comparison.
Common wrong answer to avoid: "CTR is pure relevance, so offline labels are unnecessary.".
Q: Why can high offline relevance still fail in production? A: Because latency, freshness, UI presentation, and user intent mix all affect the real experience. A strong delivery route on paper may still disappoint live users.
Common wrong answer to avoid: "If NDCG goes up, product success is guaranteed.".
Q: What artifact would you inspect first when IR evaluation metrics fails? A: I would inspect judged result list with Precision@K, Recall@K, MRR, and NDCG, then walk backward to query parsing, candidate generation, and score construction.
Common wrong answer to avoid: "Just look at the final top result." — The final result hides whether the failure happened in candidate generation, scoring, or evaluation.
Q: How do you know the change helped rather than just moved scores around? A: Track metric disagreement by query class on a judged query slice and compare it with latency, zero-result rate, and false-positive review.
Common wrong answer to avoid: "The average score went up." — Search scores are not comparable across all rankers, branches, or query types unless calibrated.
Q: When should you avoid this mechanism? A: Avoid it when the corpus is tiny, the query class is low-risk, or the team lacks labels and traces to evaluate the change.
Common wrong answer to avoid: "More ranking sophistication is always better." — Sophistication without evaluation adds latency and hides failure modes.
Apply now (10 min)¶
Exercise. Take one ranked list of five results.
Assign relevance labels from 0 to 3. Now compute Precision@5 and NDCG@5.
Sketch.
If two lists have the same precision but different NDCG, you have seen why graded ranking metrics matter.- Reproduce from memory: explain IR evaluation metrics with its pressure, artifact, metric, boundary, and failure mode.
What you should remember¶
Ir evaluation metrics exists because teams need metrics that match ranking quality, not just answer vibes. The practical question is not whether the score sounds elegant; it is whether the system can show why a document entered the candidate set, why it received its score, and why it appeared at its final position.
The artifact to inspect is judged result list with Precision@K, Recall@K, MRR, and NDCG. If you cannot inspect it, you cannot reliably debug relevance.
Remember:
- Search quality fails in candidate generation, scoring, intent understanding, calibration, feedback, and freshness.
- Watch metric disagreement by query class by query slice before trusting global averages.
- A good ranking system is explainable enough for operators, not just accurate enough on a benchmark.
- Every retrieval upgrade should name the quality gain and the latency, memory, or labeling cost it adds.
Bridge. once we can measure a search delivery route, the next job is to improve it using practical tuning knobs in production. → 12-search-relevance-tuning.md