13. Honest Admission — What search and IR still do not fully solve¶

~14 min read. Mature search engineers sound calm because they know relevance is never fully finished.

Built on the ELI5 in 00-eli5.md. We still have the address label, the sorting bins, each candidate letter, and the delivery route. This final lesson is an honest look at where the postmark score story still becomes uncertain.

1) Relevance is subjective, not a law of physics¶

Look. Two users can type the same query and want different things.

apple watch band could mean sporty, luxury,

cheap, or official brand-only.

python course could mean beginner coding, data science,

or snake-care jokes if the user is being playful. So relevance is not one stable truth.

It is a mix of user intent, context,

time, and product goals.

That means even the ideal delivery route is partly conditional. Simple, no?

Senior people stop pretending otherwise.

2) Labels are expensive, and humans disagree¶

Learning to Rank needs labels. Labels cost money.

Labels cost time. Labels also disagree.

Worked example.

One query is: best budget headphones

One product page gets these five relevance labels from humans: [3, 3, 2, 1, 0] Mean label = (3 + 3 + 2 + 1 + 0) / 5 = 9 / 5 = 1.8

Now look at the spread. One person called it perfect.

One person called it useless. Three others landed in between.

So what is the ground truth exactly? That is the annotation bottleneck.

Crowdsourced labels help, but disagreement never vanishes.

The model learns from fuzzy supervision.

3) Neural retrieval is powerful, but often opaque¶

Bi-encoders can retrieve semantically aligned documents beautifully. But when they rank one letter above another,

explaining why can be hard. Was it topic overlap?

Was it training artifact? Was it hidden lexical leakage?

Was it domain mismatch? This black-box feeling makes debugging harder.

Sparse methods are easier to inspect. You can see exact tokens and term weights.

Dense methods give better semantic reach, but often weaker interpretability.

Yes? That trade-off is still real.

4) Vocabulary mismatch is reduced, not defeated¶

People sometimes speak as if dense retrieval solved synonym gaps forever. Not true.

Out-of-domain queries still hurt. Rare slang still hurts.

New product names still hurt. Specialized medical or legal jargon still hurts.

If the training data did not represent that language well, the embedding space may map it badly.

So the old vocabulary mismatch problem returns, just wearing a smarter shirt.

That is worth admitting.

5) Freshness versus quality has no perfect formula¶

Search systems often want both. Fresh content matters.

High-quality evergreen content also matters. But there is no universally principled blend.

Worked numerical intuition. Suppose Article A has quality 0.9 and freshness 0.2.

Suppose Article B has quality 0.7 and freshness 0.9. If final score is 0.5×quality + 0.5×freshness,

then A = 0.45 + 0.10 = 0.55 and B = 0.35 + 0.45 = 0.80.

B wins. If product leadership says quality should dominate,

use 0.8×quality + 0.2×freshness. Then A = 0.72 + 0.04 = 0.76

and B = 0.56 + 0.18 = 0.74. Now A wins.

See the issue. The winner flips because the business preference changed,

not because physics changed.

6) Clicks are biased by position¶

Users click higher-ranked results more often. Sometimes because they are better.

Sometimes because they are higher. That is position bias.

So click data is useful, but contaminated.

A poor top result can still collect many clicks, especially on mobile where few alternatives are visible.

Then LTR models learn a feedback loop. Top stays top because top was shown.

Not because top was best. This makes causal evaluation hard.

7) Can one model unify retrieval and ranking?¶

That is still an open practical question. Cross-encoders are accurate but slow.

Bi-encoders are fast but coarse. Late-interaction ideas,

including ColBERT-style approaches, look promising.

They preserve richer token-level matching than vanilla dense retrieval, while staying cheaper than full cross-encoding.

But they are not a solved final answer yet. Serving cost,

index size, and operational complexity still matter.

So the field is moving, but not settled.

8) What to say honestly in interviews¶

If an interviewer asks, “What do we still not understand well?”

do not bluff. Say this calmly.

Relevance is subjective. Labels disagree.

Clicks are biased. Neural models are hard to interpret.

Freshness and quality trade off. One universal retrieval-plus-ranking model is still not operationally solved.

That answer sounds mature, not weak.

Senior people trust honest boundaries.

6) Why not one more ranking model under this workload¶

The tempting alternative is one more ranking model. It keeps the system simple, and on a toy corpus it often looks good enough.

It breaks when relevance depends on users, context, labels, freshness, and biased feedback loops. At that point the search system needs an inspectable artifact: decision table separating subjective, stale, opaque, and biased relevance failures. Without that artifact, the team argues about ranking quality from anecdotes instead of traces.

Option	Works when	Fails when	Cost moves to
one more ranking model	corpus is small or intent is obvious	relevance depends on users, context, labels, freshness, and biased feedback loops	user trust and manual debugging
honest IR limits	the failure can be measured before serving	traces or judgments are missing	indexing, scoring, evals, and review

Mini-FAQ. "Is this overkill for a simple app?" Maybe. The mechanism earns its place when query failures are frequent, expensive, or hard to diagnose from the final result list alone.

7) Production signals — know whether honest IR limits is working¶

Healthy behavior: decision table separating subjective, stale, opaque, and biased relevance failures explains why the top results changed.

First metric to watch: unresolved relevance-disagreement rate.

Misleading metric: average click-through rate. Clicks are position-biased, presentation-biased, and often do not prove satisfaction.

Expert graph: break quality down by query class, source type, language, freshness, and result position. Search failures hide in slices long before the global dashboard moves.

bad search result
   -> query trace
   -> candidate generation
   -> scoring / ranking artifact
   -> judged list or user feedback
   -> targeted tuning change

8) Boundary — where honest IR limits helps and where it does not¶

Strong fit: the failure is visible in candidate generation, scoring, ranking, or judged result lists.

Weak fit: the corpus does not contain the answer, the user's intent is unknowable, or the business objective is unresolved.

Pathology: the team keeps tuning scores when the real issue is missing content, stale content, or unclear product policy.

Scale limit: every extra retrieval branch, feature, or reranker spends latency and operator attention. Put expensive steps on the queries where the quality gain is measurable.

9) Wrong model — search quality has a final objective truth¶

The wrong model sounds plausible because it works on simple examples.

Production search is harsher. Users bring short queries, typos, rare entities, ambiguous intent, changing corpora, and biased feedback. The ranking stack has to expose those pressures instead of hiding them behind one score.

If honest IR limits cannot change candidate generation, ranking, evaluation, or debugging, it is not carrying its weight.

10) Failure taxonomy for honest IR limits¶

Candidate failure — the right document never enters the candidate set.
Scoring failure — the right document is present but ranked too low.
Intent failure — the system optimizes for the wrong interpretation of the query.
Calibration failure — scores from different sources are compared as if they mean the same thing.
Feedback failure — clicks or labels reflect position, popularity, or annotator disagreement rather than true relevance.
Freshness failure — stale documents outrank newer but necessary content.
Debugging failure — no trace connects query, candidates, scores, and final route.

11) Pattern transfer — where this returns later¶

RAG uses the same candidate-generation and ranking chain before answer synthesis.
Vector databases make the latency and recall tradeoff physical.
Advanced RAG reuses query understanding, hybrid fusion, and reranking under stricter evidence constraints.
Evals in production reuse judged lists, slice analysis, and false-positive/false-negative review.

12) Design review checklist¶

What failure does this mechanism catch: candidate, scoring, intent, calibration, feedback, or freshness?
What artifact would you inspect first: query parse, postings list, vector neighbors, feature row, rerank score, or judged list?
Why is one more ranking model weaker for this workload?
Which query slice should improve first?
Which latency, memory, or labeling cost rises first?
What rollback signal tells you the tuning made search worse?

Where this lives in the wild¶

Google Search — ranking scientists wrestle with subjective intent and click bias daily.
Etsy search — relevance engineers balance freshness, conversion, and lexical precision.
Glean enterprise search — ML engineers face annotation scarcity and domain-specific jargon constantly.
GitHub Copilot Chat — applied researchers debug dense retrieval failures that are hard to explain.
Google News — ranking teams struggle with freshness versus authority every hour.
Enterprise search — mixes exact product names, acronyms, policies, and natural-language questions.
Ecommerce search — balances lexical match, popularity, personalization, freshness, and business rules.
Support knowledge bases — need high recall for policy questions and high precision for top answers.
Code search — exact identifiers and semantic intent both matter.
Legal search — missing one relevant document can be worse than showing extra documents.
Medical literature search — query expansion helps, but false positives are expensive.
RAG retrievers — use IR as the evidence gateway before generation.
Recommendation feeds — reuse ranking ideas even when the item source is not text.
Ad search — relevance competes with auction and business constraints.
Academic search — citations, freshness, author authority, and topical match all interact.

Recall checkpoint¶

Why is relevance not a single fixed ground truth?
What does the label-disagreement example show about supervision?
Why can click logs mislead a ranking model?
What makes late-interaction approaches promising but not fully solved?
Which artifact would you inspect first for honest IR limits?
What query slice would you use to prove the improvement is real?
What is the first cost this mechanism adds?

Interview Q&A¶

Q: Why is “relevance is subjective” a serious technical point, not a vague slogan? A: Because ranking objectives, labels, and product choices depend on user intent and context. Different users can legitimately prefer different results for the same query.

Common wrong answer to avoid: "Relevance is subjective only because judges are inconsistent.".

Q: Why does better neural retrieval not remove the need for lexical systems? A: Because out-of-domain language, rare identifiers, and exact symbolic queries still need token fidelity. Dense models reduce some mismatch but do not eliminate it.

Common wrong answer to avoid: "Embeddings solved vocabulary mismatch completely.".

Q: Why are clicks not reliable ground truth for LTR? A: Because click behavior depends on rank position, presentation, trust, and user patience, not only intrinsic relevance.

Common wrong answer to avoid: "A clicked result is always more relevant than an unclicked result.".

Q: Why is one unified retrieval-and-ranking model still a hard production problem? A: Because the best accuracy models are often too slow, while the fastest retrieval models lose fine-grained interaction quality.

Common wrong answer to avoid: "We only need a larger transformer and the problem disappears.".

Q: What artifact would you inspect first when honest IR limits fails? A: I would inspect decision table separating subjective, stale, opaque, and biased relevance failures, then walk backward to query parsing, candidate generation, and score construction.

Common wrong answer to avoid: "Just look at the final top result." — The final result hides whether the failure happened in candidate generation, scoring, or evaluation.

Q: How do you know the change helped rather than just moved scores around? A: Track unresolved relevance-disagreement rate on a judged query slice and compare it with latency, zero-result rate, and false-positive review.

Common wrong answer to avoid: "The average score went up." — Search scores are not comparable across all rankers, branches, or query types unless calibrated.

Q: When should you avoid this mechanism? A: Avoid it when the corpus is tiny, the query class is low-risk, or the team lacks labels and traces to evaluate the change.

Common wrong answer to avoid: "More ranking sophistication is always better." — Sophistication without evaluation adds latency and hides failure modes.

Apply now (10 min)¶

Exercise. Take one query from your own domain.

Write two different user intents for it. Then write one reason your current ranking might favor the wrong one.

Sketch.

same address label
   ├─ user intent A ──→ preferred letter set A
   └─ user intent B ──→ preferred letter set B

If the ideal result changes with user context, you have seen why search remains an open problem.

Reproduce from memory: explain honest IR limits with its pressure, artifact, metric, boundary, and failure mode.

What you should remember¶

Honest ir limits exists because relevance depends on users, context, labels, freshness, and biased feedback loops. The practical question is not whether the score sounds elegant; it is whether the system can show why a document entered the candidate set, why it received its score, and why it appeared at its final position.

The artifact to inspect is decision table separating subjective, stale, opaque, and biased relevance failures. If you cannot inspect it, you cannot reliably debug relevance.

Remember:

Search quality fails in candidate generation, scoring, intent understanding, calibration, feedback, and freshness.
Watch unresolved relevance-disagreement rate by query slice before trusting global averages.
A good ranking system is explainable enough for operators, not just accurate enough on a benchmark.
Every retrieval upgrade should name the quality gain and the latency, memory, or labeling cost it adds.

Bridge. search retrieves letters by similarity, but the next module asks a different question — what if we navigate explicit relationships between entities instead? → ../10_knowledge_graph_retrieval/00-eli5.md