04. BM25 — The scoring function search teams actually ship¶

~15 min read. BM25 keeps useful repetition, but stops repetition from becoming cheating.

Built on the ELI5 in 00-eli5.md. The sorting bins still fetch candidate letters for the address label. BM25 now gives each letter a sharper postmark score before the delivery route is finalized.

1) Picture first: mentions help, but only up to a point¶

Look. If a letter mentions machine learning once, that matters.

If it mentions the phrase five times, that usually matters more. If it mentions it one hundred times,

you should become suspicious. Maybe the letter is long. Maybe it is repetitive.

Maybe extra mentions are not adding new evidence. That is the BM25 intuition. Term frequency should help,

but with saturation. And long letters should not get free advantage. ASCII picture first.

score contribution
▲
│                     k1=5
│                  ╭──────────
│              ╭───╯
│          ╭───╯
│      ╭───╯    k1=1.2
│  ╭───╯   ╭──────────
└──┴──────────────────────────▶ term frequency
   1   2   5   10   20

See. Both curves rise. Both curves flatten.

Lower k1 flattens earlier. That is saturation. Simple, no?

2) The two knobs: k1 and b¶

BM25 has two famous parameters. k1 controls term-frequency saturation. Higher k1 means repeated mentions keep helping longer.

Lower k1 means extra repeats stop helping sooner. Typical range is about 1.2 to 2.0. b controls document-length normalization.

b = 0 means no length correction. b = 1 means full length correction. A common default is 0.75.

So BM25 asks two healthy questions. How much should repetition matter? And how much should length matter?

That is why teams trust it. It matches real search behavior better than raw TF-IDF.

3) Formula, after the picture¶

A common BM25 term score looks like this: score = IDF(q) × ((tf × (k1 + 1)) / (tf + k1 × (1 - b + b × dl/avgdl)))

Where:

tf = term frequency in the document
dl = document length
avgdl = average document length in the corpus
k1 = saturation control
b = length normalization control The total postmark score is the sum over query terms. Yes, the formula looks busy. But the story is simple.

More mentions help. Longer letters get discounted. Extra mentions flatten instead of growing forever.

Keep that picture in your head, and the formula becomes friendly.

4) Worked numerical example: query `machine learning`¶

We have three letters. Use k1 = 1.5 and b = 0.75.

Document lengths are:

D1 length = 100
D2 length = 300
D3 length = 60

Average length: avgdl = (100 + 300 + 60) / 3 = 460 / 3 ≈ 153.3

Term counts:

D1: machine=2, learning=2
D2: machine=6, learning=6
D3: machine=0, learning=0 Assume each query term appears in 2 of 3 docs. So for both terms, IDF = ln(1 + (3 - 2 + 0.5)/(2 + 0.5))

IDF = ln(1 + 1.5/2.5) = ln(1.6) ≈ 0.470

Now D1 term factor: Length part = 1 - b + b×dl/avgdl = 0.25 + 0.75×100/153.3 = 0.25 + 0.489

= 0.739

Multiply by k1: 1.5 × 0.739 = 1.109

Now term factor: (2 × 2.5) / (2 + 1.109) = 5 / 3.109 ≈ 1.608

Per-term BM25 score for D1: 0.470 × 1.608 ≈ 0.756

Two terms total: 0.756 + 0.756 = 1.512 Now D2 term factor. Length part = 0.25 + 0.75×300/153.3

= 0.25 + 1.467 = 1.717

Multiply by k1: 1.5 × 1.717 = 2.576

Term factor: (6 × 2.5) / (6 + 2.576) = 15 / 8.576 ≈ 1.749

Per-term BM25 score for D2: 0.470 × 1.749 ≈ 0.822

Two terms total: 0.822 + 0.822 = 1.644 D3 has no query terms, so score = 0.

Final order: D2 = 1.644 D1 = 1.512 D3 = 0

Notice the key point. D2 mentions the terms three times more often than D1. But its final score is only slightly higher.

That is BM25 saturation doing useful work.

5) Why BM25 beats TF-IDF in practice¶

TF-IDF often over-rewards repeated terms. It also handles document length more crudely. BM25 improves both.

Long catalog descriptions do not automatically crush short precise titles. Repeating a term 50 times does not explode the score linearly. So the delivery route becomes more stable.

That is why BM25 is the default in Lucene-based systems. Elasticsearch uses it. Lucene uses it.

Solr uses it. OpenSearch uses it. Teams keep the sorting bins,

but swap in BM25 as the default postmark score. That is a very practical combination.

6) The next limitation¶

BM25 still depends on lexical overlap. If the address label says doctor for heart attack, and the letter says cardiology for myocardial infarction,

BM25 cannot invent the synonym bridge by itself. So what to do? First-stage scoring is now strong.

But query understanding must get smarter.

6) Why not plain TF-IDF everywhere under this workload¶

The tempting alternative is plain TF-IDF everywhere. It keeps the system simple, and on a toy corpus it often looks good enough.

It breaks when term frequency helps but should saturate, and document length should not cheat. At that point the search system needs an inspectable artifact: BM25 score breakdown by term with k1 and b. Without that artifact, the team argues about ranking quality from anecdotes instead of traces.

Option	Works when	Fails when	Cost moves to
plain TF-IDF everywhere	corpus is small or intent is obvious	term frequency helps but should saturate, and document length should not cheat	user trust and manual debugging
BM25	the failure can be measured before serving	traces or judgments are missing	indexing, scoring, evals, and review

Mini-FAQ. "Is this overkill for a simple app?" Maybe. The mechanism earns its place when query failures are frequent, expensive, or hard to diagnose from the final result list alone.

7) Production signals — know whether BM25 is working¶

Healthy behavior: BM25 score breakdown by term with k1 and b explains why the top results changed.

First metric to watch: manual-judged NDCG@10 after BM25 tuning.

Misleading metric: average click-through rate. Clicks are position-biased, presentation-biased, and often do not prove satisfaction.

Expert graph: break quality down by query class, source type, language, freshness, and result position. Search failures hide in slices long before the global dashboard moves.

bad search result
   -> query trace
   -> candidate generation
   -> scoring / ranking artifact
   -> judged list or user feedback
   -> targeted tuning change

8) Boundary — where BM25 helps and where it does not¶

Strong fit: the failure is visible in candidate generation, scoring, ranking, or judged result lists.

Weak fit: the corpus does not contain the answer, the user's intent is unknowable, or the business objective is unresolved.

Pathology: the team keeps tuning scores when the real issue is missing content, stale content, or unclear product policy.

Scale limit: every extra retrieval branch, feature, or reranker spends latency and operator attention. Put expensive steps on the queries where the quality gain is measurable.

9) Wrong model — BM25 is old so dense retrieval replaces it¶

The wrong model sounds plausible because it works on simple examples.

Production search is harsher. Users bring short queries, typos, rare entities, ambiguous intent, changing corpora, and biased feedback. The ranking stack has to expose those pressures instead of hiding them behind one score.

If BM25 cannot change candidate generation, ranking, evaluation, or debugging, it is not carrying its weight.

10) Failure taxonomy for BM25¶

Candidate failure — the right document never enters the candidate set.
Scoring failure — the right document is present but ranked too low.
Intent failure — the system optimizes for the wrong interpretation of the query.
Calibration failure — scores from different sources are compared as if they mean the same thing.
Feedback failure — clicks or labels reflect position, popularity, or annotator disagreement rather than true relevance.
Freshness failure — stale documents outrank newer but necessary content.
Debugging failure — no trace connects query, candidates, scores, and final route.

11) Pattern transfer — where this returns later¶

RAG uses the same candidate-generation and ranking chain before answer synthesis.
Vector databases make the latency and recall tradeoff physical.
Advanced RAG reuses query understanding, hybrid fusion, and reranking under stricter evidence constraints.
Evals in production reuse judged lists, slice analysis, and false-positive/false-negative review.

12) Design review checklist¶

What failure does this mechanism catch: candidate, scoring, intent, calibration, feedback, or freshness?
What artifact would you inspect first: query parse, postings list, vector neighbors, feature row, rerank score, or judged list?
Why is plain TF-IDF everywhere weaker for this workload?
Which query slice should improve first?
Which latency, memory, or labeling cost rises first?
What rollback signal tells you the tuning made search worse?

Where this lives in the wild¶

Elasticsearch at Shopify — search engineers tune BM25 field weights for product catalog queries.
OpenSearch in Confluence search — platform teams rely on BM25 defaults for document retrieval.
New York Times site search — relevance engineers balance long articles against short headlines.
Apache Solr in ecommerce search — search specialists use BM25 to stop verbose descriptions from dominating.
Confluence search — knowledge search teams trust BM25 as the first strong lexical baseline.
Enterprise search — mixes exact product names, acronyms, policies, and natural-language questions.
Ecommerce search — balances lexical match, popularity, personalization, freshness, and business rules.
Support knowledge bases — need high recall for policy questions and high precision for top answers.
Code search — exact identifiers and semantic intent both matter.
Legal search — missing one relevant document can be worse than showing extra documents.
Medical literature search — query expansion helps, but false positives are expensive.
RAG retrievers — use IR as the evidence gateway before generation.
Recommendation feeds — reuse ranking ideas even when the item source is not text.
Ad search — relevance competes with auction and business constraints.
Academic search — citations, freshness, author authority, and topical match all interact.

Recall checkpoint¶

Why is term-frequency saturation more realistic than linear TF growth?
What does k1 control, in plain language?
What does b control, in plain language?
In the example, why did D2 not beat D1 by three times?
Which artifact would you inspect first for BM25?
What query slice would you use to prove the improvement is real?
What is the first cost this mechanism adds?

Interview Q&A¶

Q: Why is BM25 usually preferred over TF-IDF in production search? A: Because BM25 models term saturation and document length better. That makes ranking more robust on long, repetitive, real-world text.

Common wrong answer to avoid: "BM25 is just TF-IDF with a different constant.".

Q: Why not set b = 0 and skip length normalization entirely? A: Because then long letters accumulate term matches too easily. You lose protection against verbosity bias.

Common wrong answer to avoid: "Document length should never affect relevance.".

Q: Why not set k1 extremely high so repetition always helps more? A: Because excessive repetition often adds little new evidence. High k1 weakens saturation and can reward spammy or bloated text.

Common wrong answer to avoid: "More term mentions are always proportionally better.".

Q: Why does BM25 still need help from query understanding or dense retrieval? A: Because BM25 is still lexical. Without shared words between address label and letter, the strong formula has nothing to score.

Common wrong answer to avoid: "BM25 automatically understands meaning beyond tokens.".

Q: What artifact would you inspect first when BM25 fails? A: I would inspect BM25 score breakdown by term with k1 and b, then walk backward to query parsing, candidate generation, and score construction.

Common wrong answer to avoid: "Just look at the final top result." — The final result hides whether the failure happened in candidate generation, scoring, or evaluation.

Q: How do you know the change helped rather than just moved scores around? A: Track manual-judged NDCG@10 after BM25 tuning on a judged query slice and compare it with latency, zero-result rate, and false-positive review.

Common wrong answer to avoid: "The average score went up." — Search scores are not comparable across all rankers, branches, or query types unless calibrated.

Q: When should you avoid this mechanism? A: Avoid it when the corpus is tiny, the query class is low-risk, or the team lacks labels and traces to evaluate the change.

Common wrong answer to avoid: "More ranking sophistication is always better." — Sophistication without evaluation adds latency and hides failure modes.

Apply now (10 min)¶

Exercise. Take one short document and one long document. Give both the same query term counts.

Now ask whether the longer one should always rank higher. Sketch.

same tf
short letter ──→ smaller length penalty
long letter  ──→ bigger length penalty

If your intuition says the short focused letter often deserves the win, you already understand BM25.

Reproduce from memory: explain BM25 with its pressure, artifact, metric, boundary, and failure mode.

What you should remember¶

Bm25 exists because term frequency helps but should saturate, and document length should not cheat. The practical question is not whether the score sounds elegant; it is whether the system can show why a document entered the candidate set, why it received its score, and why it appeared at its final position.

The artifact to inspect is BM25 score breakdown by term with k1 and b. If you cannot inspect it, you cannot reliably debug relevance.

Remember:

Search quality fails in candidate generation, scoring, intent understanding, calibration, feedback, and freshness.
Watch manual-judged NDCG@10 after BM25 tuning by query slice before trusting global averages.
A good ranking system is explainable enough for operators, not just accurate enough on a benchmark.
Every retrieval upgrade should name the quality gain and the latency, memory, or labeling cost it adds.

Bridge. BM25 gives a strong lexical postmark score, but if the address label itself is messy or ambiguous, the whole delivery route still suffers. → 05-query-understanding.md