05. Query Understanding — Clean the address label before you search¶

~14 min read. Many bad search results begin with a badly interpreted query, not a bad index.

Built on the ELI5 in 00-eli5.md. The user's address label reaches the same sorting bins and candidate letters. Our job now is to rewrite that address label so the later postmark score has a fair chance.

1) First ask: what is the user trying to do?¶

See. Not every query has the same job. Some queries are navigational.

The user wants one exact page. Example: linkedin login. Some queries are informational.

The user wants to learn. Example: how to clean suede shoes. Some queries are transactional.

The user wants to buy or do something. Example: buy running shoes size 10. Why does this matter?

Because retrieval strategy changes with intent. Navigational queries need very sharp exactness. Informational queries benefit from broader coverage.

Transactional queries care about filters, stock, brand, and price. So the address label is not just a bag of words. It is a task request.

Simple, no?

2) Query parsing: structure hidden inside small text¶

Users type short things. But short things often contain structure. brand:Nike size:10 red trail shoes is not free text only.

It contains fields. It contains attributes. It may contain negation too.

laptop sleeve not leather means leather should be excluded.

Good parsers detect:

field constraints like brand:Nike
numeric filters like size:10
negations like not leather
phrases like machine learning
operators typed explicitly or implicitly ASCII flow.

raw query
   │
   ▼
┌─────────┐   ┌────────────────┐   ┌───────────┐   ┌─────────────┐
│ parser  │→→│ intent model    │→→│ expander   │→→│ search engine│
└─────────┘   └────────────────┘   └───────────┘   └─────────────┘

Look. A parser is not decoration. It decides what part of the address label should hit text search,

what part should hit filters, and what part should be ignored or rewritten.

3) Query expansion and spell correction¶

Users misspell. Users abbreviate. Users use local slang.

So what to do? We enrich the query. That can mean spelling correction.

That can mean synonym expansion. That can mean adding related terms. Edit distance is one old-school spelling clue.

If nke is one edit away from nike, correction becomes plausible. Neural spell correction can do even better,

especially with context. Expansion must stay disciplined. coat might expand to jacket, overcoat, and parka.

But expanding too aggressively can flood the candidate pool. Precision falls. So query understanding is never just “add more words.”

It is “add the right words for this intent.”

4) Worked example: `buy nke shoes`¶

Picture the raw query as a messy envelope. The clerk should not send it unchanged. She should inspect it first.

Raw address label: buy nke shoes Step 1: spell correction.

Candidate corrections for nke:

nike
nuke
ink

Suppose the correction model scores them:

nike = 0.82
nuke = 0.07
ink = 0.03 Highest score wins.

Corrected query becomes: buy nike shoes Step 2: intent classification. This looks transactional.

Why? Because buy strongly hints purchase intent. So we prioritize product inventory pages over blog posts.

Step 3: expansion.

Add close commercial synonyms: shoes → shoes, sneakers, footwear

Keep brand exact: nike stays nike. Step 4: build structured query.

One possible final form is: (brand:nike)^3 AND (title:shoes OR title:sneakers OR body:footwear) See the numeric intuition. If brand match weight is 3,

and title match weight is 2,

then a Nike product titled running shoes gets: brand score 3 + title score 2 = 5

A blog post mentioning Nike in body only gets: brand score 0 + body synonym score 1 = 1 So the delivery route favors the product page. That feels right for a shopping query.

5) Stopwords and entities are trickier than they look¶

People say, “Just remove common words.” Careful.

Sometimes common words matter. The Who is a band. If you drop the,

you may damage the entity. to be or not to be is a phrase. Removing function words can distort meaning.

Entity recognition matters too. Apple support could mean the company. apple nutrition likely means the fruit.

Context disambiguation uses nearby terms. So the address label is a compressed semantic object. You should respect that.

6) Good query understanding does not replace retrieval¶

It prepares retrieval. That is the right mental model. A parser cannot fix an empty index.

A synonym list cannot replace a strong BM25 baseline. A spell checker cannot explain whole-document semantics. But without query understanding,

even good sorting bins and a strong postmark score feel clumsy. Search quality is a pipeline. This stage cleans the input.

6) Why not send raw user text directly to retrieval under this workload¶

The tempting alternative is send raw user text directly to retrieval. It keeps the system simple, and on a toy corpus it often looks good enough.

It breaks when short user queries hide structure that retrieval needs before scoring. At that point the search system needs an inspectable artifact: parsed query with intent, entities, filters, spell fixes, and expansions. Without that artifact, the team argues about ranking quality from anecdotes instead of traces.

Option	Works when	Fails when	Cost moves to
send raw user text directly to retrieval	corpus is small or intent is obvious	short user queries hide structure that retrieval needs before scoring	user trust and manual debugging
query understanding	the failure can be measured before serving	traces or judgments are missing	indexing, scoring, evals, and review

Mini-FAQ. "Is this overkill for a simple app?" Maybe. The mechanism earns its place when query failures are frequent, expensive, or hard to diagnose from the final result list alone.

7) Production signals — know whether query understanding is working¶

Healthy behavior: parsed query with intent, entities, filters, spell fixes, and expansions explains why the top results changed.

First metric to watch: wrong-intent retrieval rate.

Misleading metric: average click-through rate. Clicks are position-biased, presentation-biased, and often do not prove satisfaction.

Expert graph: break quality down by query class, source type, language, freshness, and result position. Search failures hide in slices long before the global dashboard moves.

bad search result
   -> query trace
   -> candidate generation
   -> scoring / ranking artifact
   -> judged list or user feedback
   -> targeted tuning change

8) Boundary — where query understanding helps and where it does not¶

Strong fit: the failure is visible in candidate generation, scoring, ranking, or judged result lists.

Weak fit: the corpus does not contain the answer, the user's intent is unknowable, or the business objective is unresolved.

Pathology: the team keeps tuning scores when the real issue is missing content, stale content, or unclear product policy.

Scale limit: every extra retrieval branch, feature, or reranker spends latency and operator attention. Put expensive steps on the queries where the quality gain is measurable.

9) Wrong model — query understanding is just spell correction¶

The wrong model sounds plausible because it works on simple examples.

Production search is harsher. Users bring short queries, typos, rare entities, ambiguous intent, changing corpora, and biased feedback. The ranking stack has to expose those pressures instead of hiding them behind one score.

If query understanding cannot change candidate generation, ranking, evaluation, or debugging, it is not carrying its weight.

10) Failure taxonomy for query understanding¶

Candidate failure — the right document never enters the candidate set.
Scoring failure — the right document is present but ranked too low.
Intent failure — the system optimizes for the wrong interpretation of the query.
Calibration failure — scores from different sources are compared as if they mean the same thing.
Feedback failure — clicks or labels reflect position, popularity, or annotator disagreement rather than true relevance.
Freshness failure — stale documents outrank newer but necessary content.
Debugging failure — no trace connects query, candidates, scores, and final route.

11) Pattern transfer — where this returns later¶

RAG uses the same candidate-generation and ranking chain before answer synthesis.
Vector databases make the latency and recall tradeoff physical.
Advanced RAG reuses query understanding, hybrid fusion, and reranking under stricter evidence constraints.
Evals in production reuse judged lists, slice analysis, and false-positive/false-negative review.

12) Design review checklist¶

What failure does this mechanism catch: candidate, scoring, intent, calibration, feedback, or freshness?
What artifact would you inspect first: query parse, postings list, vector neighbors, feature row, rerank score, or judged list?
Why is send raw user text directly to retrieval weaker for this workload?
Which query slice should improve first?
Which latency, memory, or labeling cost rises first?
What rollback signal tells you the tuning made search worse?

Where this lives in the wild¶

Google Search autocomplete — search quality engineers infer intent from tiny noisy queries.
Amazon retail search — relevance teams parse brand, size, color, and purchase intent.
Booking.com search — travel search engineers separate destination entities from date and filter terms.
Instacart grocery search — catalog search specialists resolve misspellings and brand variants fast.
Spotify search — discovery engineers disambiguate artist names, songs, and common words used as titles.
Enterprise search — mixes exact product names, acronyms, policies, and natural-language questions.
Ecommerce search — balances lexical match, popularity, personalization, freshness, and business rules.
Support knowledge bases — need high recall for policy questions and high precision for top answers.
Code search — exact identifiers and semantic intent both matter.
Legal search — missing one relevant document can be worse than showing extra documents.
Medical literature search — query expansion helps, but false positives are expensive.
RAG retrievers — use IR as the evidence gateway before generation.
Recommendation feeds — reuse ranking ideas even when the item source is not text.
Ad search — relevance competes with auction and business constraints.
Academic search — citations, freshness, author authority, and topical match all interact.

Recall checkpoint¶

Why is intent classification useful before retrieval begins?
What kinds of structure might be hidden inside a short query?
Why can synonym expansion improve recall yet hurt precision?
Why is The Who a warning against blind stopword removal?
Which artifact would you inspect first for query understanding?
What query slice would you use to prove the improvement is real?
What is the first cost this mechanism adds?

Interview Q&A¶

Q: Why does query understanding often matter more than changing the scorer? A: Because the scorer only sees the query representation you hand it. If the address label is misspelled, ambiguous, or badly parsed, even a great ranking function works on damaged input.

Common wrong answer to avoid: "Ranking can compensate for any query problem later.".

Q: Why not expand every query with as many synonyms as possible? A: Because broad expansion inflates recall while often hurting precision. Good systems expand selectively by intent, domain, and context.

Common wrong answer to avoid: "More query terms always means better search coverage.".

Q: Why treat navigational and transactional queries differently? A: Because the desired success condition differs. Navigational queries want one exact destination. Transactional queries need filters, product attributes, and commercial relevance.

Common wrong answer to avoid: "All queries should use the same retrieval strategy.".

Q: Why are entities harder than they first seem? A: Because the same surface word can refer to different things, and common tokens may be essential inside named entities. Context decides the intended meaning.

Common wrong answer to avoid: "Entity handling is just another synonym table.".

Q: What artifact would you inspect first when query understanding fails? A: I would inspect parsed query with intent, entities, filters, spell fixes, and expansions, then walk backward to query parsing, candidate generation, and score construction.

Common wrong answer to avoid: "Just look at the final top result." — The final result hides whether the failure happened in candidate generation, scoring, or evaluation.

Q: How do you know the change helped rather than just moved scores around? A: Track wrong-intent retrieval rate on a judged query slice and compare it with latency, zero-result rate, and false-positive review.

Common wrong answer to avoid: "The average score went up." — Search scores are not comparable across all rankers, branches, or query types unless calibrated.

Q: When should you avoid this mechanism? A: Avoid it when the corpus is tiny, the query class is low-risk, or the team lacks labels and traces to evaluate the change.

Common wrong answer to avoid: "More ranking sophistication is always better." — Sophistication without evaluation adds latency and hides failure modes.

Apply now (10 min)¶

Exercise. Take three messy queries from real life.

Rewrite each into: intent, filters, spelling fixes, and expansion terms. Sketch.

buy nke shoes
  ├─ intent: transactional
  ├─ fix: nike
  ├─ expand: shoes|sneakers|footwear
  └─ search form: brand + title + body

If that rewrite changes the likely results, you have already improved search quality.

Reproduce from memory: explain query understanding with its pressure, artifact, metric, boundary, and failure mode.

What you should remember¶

Query understanding exists because short user queries hide structure that retrieval needs before scoring. The practical question is not whether the score sounds elegant; it is whether the system can show why a document entered the candidate set, why it received its score, and why it appeared at its final position.

The artifact to inspect is parsed query with intent, entities, filters, spell fixes, and expansions. If you cannot inspect it, you cannot reliably debug relevance.

Remember:

Search quality fails in candidate generation, scoring, intent understanding, calibration, feedback, and freshness.
Watch wrong-intent retrieval rate by query slice before trusting global averages.
A good ranking system is explainable enough for operators, not just accurate enough on a benchmark.
Every retrieval upgrade should name the quality gain and the latency, memory, or labeling cost it adds.

Bridge. even a beautifully cleaned address label still depends on words, so next we learn how vectors match meaning beyond exact lexical overlap. → 06-dense-retrieval.md