Retrieval & Ranking — Interview Questions¶

This file goes deeper than rag-fundamentals.md on the retrieval and ranking algorithms — BM25 internals, ANN index trade-offs (HNSW vs IVF vs DiskANN), hybrid score normalization, reranker selection (Cohere Rerank, ColBERT, LLM-as-judge), diversity (MMR), ranking metrics (NDCG, MRR, recall@k), and embedding fine-tuning for retrieval. If rag-fundamentals is the "I built a basic RAG" interview, this is the "I tuned retrieval until recall@5 hit 0.9" interview. The senior tell is naming a concrete metric, a concrete decision, and a concrete number.

For chunking, query rewriting, HyDE, citation, and Graph/Agentic RAG, see rag-fundamentals.md and rag-advanced.md.

BM25 and sparse retrieval¶

Q: "Explain BM25 in detail. Why is it still relevant in 2026?"¶

Tags: senior · very-common · conceptual · source: standard senior RAG probe; reported in 2026 AI engineer loops; Hybrid Search 2026 guides

Answer outline: - BM25 is a probabilistic relevance scoring function for sparse (keyword) retrieval. For a query Q with terms q_i against document D: score(Q,D) = Σ IDF(q_i) · ( f(q_i, D) · (k1+1) ) / ( f(q_i, D) + k1·(1 - b + b·|D|/avgdl) ). - Three components: term frequency (how often the term appears in the doc), inverse document frequency (how rare the term is across the corpus), and length normalization (longer docs aren't unfairly boosted just for having more words). - Knobs: k1 (typically 1.2-2.0) controls term-frequency saturation — higher means more reward for repeated occurrences. b (typically 0.75) controls length normalization — b=0 ignores length, b=1 fully normalizes. - Why still relevant in 2026: it handles exact-keyword and rare-term queries (product codes, function names, error strings, proper nouns) that dense vectors miss. Zero learned parameters → no training data, no embedding cost, instant deployment. - The 2026 frame: BM25 is the baseline you must beat. Most modern hybrid setups run BM25 + dense in parallel and fuse with RRF. Pure dense rarely wins on enterprise queries with named entities. - Numbers to drop: "BM25 with default k1=1.5, b=0.75 is the standard starting point", "indexes via inverted index (Lucene/Elasticsearch/OpenSearch/Tantivy)", "BM25 search latency: ~10-50ms on millions of docs"

Common follow-ups: - "What's the role of IDF?" - "Why doesn't BM25 work for paraphrase queries?" - "Walk me through tuning k1 and b."

Traps: - Calling BM25 "old" or "obsolete". Frontier 2026 retrieval still relies on it as one leg of hybrid. - Confusing TF-IDF and BM25 — BM25 is a refined TF-IDF with saturation and length normalization.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "When does BM25 beat dense retrieval?"¶

Tags: senior · common · scenario · source: Sparse vs Dense Retrieval 2026 guides; standard RAG-tradeoff probe

Answer outline: - Exact-keyword queries: product codes ("SKU-2031-XR"), function names ("torch.nn.MultiheadAttention"), error strings, regulation IDs, person names. Dense vectors treat these as fuzzy semantic blobs and miss exact matches. - Out-of-domain queries: when the query distribution at runtime differs from what the embedding model was trained on. BM25 is distribution-free; embeddings degrade. - Rare terms / long-tail vocabulary: medical codes, legal citations, technical jargon. The IDF component in BM25 strongly rewards rare terms; embeddings often blur them. - Short queries with low semantic content: "API 503 error code 7" — BM25 nails it; dense sometimes returns conceptually-related but wrong results. - Multi-lingual setups where you have language-specific BM25 analyzers (stemming, tokenization) but a generic embedding model that's weaker on a specific language. - The honest answer: in 2026 production, you rarely choose pure BM25 vs pure dense. You run both via hybrid, and the cases above are where the BM25 leg of hybrid contributes most. - Numbers to drop: "exact-keyword queries: BM25 recall@10 often 95%+ vs dense 60-80%", "rare-term IDF boost: a term appearing in <1% of docs gets massive BM25 weight"

Common follow-ups: - "Can you tune an embedding model to handle exact-keyword queries?" - "Why not just rewrite the query?"

Traps: - Claiming BM25 is always worse than dense. The evidence is mixed and task-dependent.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "What is SPLADE / learned sparse retrieval and when do you use it?"¶

Tags: staff · occasional · conceptual · source: Sparse vs Dense Retrieval 2026 guides; ML Journey BM25 vs Embeddings

Answer outline: - SPLADE (SParse Lexical AnD Expansion) is a learned sparse retrieval method: a transformer (typically BERT) predicts a sparse vector over the vocabulary, where each non-zero dimension corresponds to a vocabulary term with a learned importance weight. - It combines BM25's strengths (sparse, interpretable, exact-match friendly, fits inverted indexes) with neural model's strengths (semantic understanding, query expansion). - Trained with a contrastive objective on (query, positive doc, negative doc) triples, plus a regularizer to keep the vectors sparse. - Versus BM25: SPLADE handles paraphrase better because the model expands query terms semantically. Versus dense: SPLADE is interpretable (you can see which terms drive the score) and fits existing inverted-index infra. - When to use: when pure BM25's lexical gap is your main quality issue and you want sparse-index deployability. When deploying SPLADE, you can co-locate it with BM25 in the same Lucene/Elasticsearch infra. - 2026 maturity: less common than dense + BM25 hybrid. SPLADE's training data dependence and added complexity push most teams toward simpler hybrid setups. - Numbers to drop: "SPLADE-v3 outperforms BM25 by 5-15% NDCG on MS MARCO", "trade-off: 3-5× larger sparse vectors than BM25"

Common follow-ups: - "Why not just use a dense retriever?" - "How does SPLADE compare to ColBERT?"

Traps: - Calling SPLADE "just BM25 with a neural model". It's a fundamentally learned representation, not just term reweighting.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Dense retrieval & embedding models¶

Q: "How do you choose an embedding model for a RAG system in 2026?"¶

Tags: mid · very-common · design · source: MongoDB / Milvus / StackAI embedding-model guides 2026; standard senior RAG probe

Answer outline: - Don't pick the headline-best on MTEB. Pick the right one for your query/document distribution. - Five axes to compare: - Domain match: does the model's training data cover your content type? (Code, legal, medical, multilingual.) BGE / E5 / Cohere / OpenAI ada-3 / OpenAI 3-large dominate the general English benchmarks; specialized models (CodeBERT, BioBERT) for narrow domains. - Asymmetry handling: are queries short and docs long? Asymmetric models (Cohere, OpenAI 3-large with separate query/doc encoders or instruction-prefixed input) outperform symmetric models for short-query/long-doc retrieval. - Multilingual: if any non-English content, choose multilingual (Cohere multilingual-v3, Voyage multilingual, BGE-M3) — beats translation pipelines on most languages. - Dimensionality: smaller dims = cheaper storage / faster search. text-embedding-3-small (1536d, truncatable) vs 3-large (3072d). Truncation via Matryoshka-style models gives flexibility — store full, query truncated. - Cost & latency: API ($0.02-0.13 / 1M tokens) vs self-hosted (BGE/E5 free but ops cost). At high volume, self-hosted often wins. - Always benchmark on your data. MTEB is a starting point; your specific corpus + query distribution decides. - Process: pick 3-5 candidates, build an eval set of 100-300 (query, relevant-doc) pairs from real usage or hand-labeled, measure recall@10 / NDCG@10, pick the winner. - Numbers to drop: "text-embedding-3-small: $0.02/1M tokens, 1536d (truncatable to 512d)", "Cohere embed-v3: $0.10/1M tokens, multilingual", "self-hosted BGE-large: free, ops cost, ~50ms/embedding on GPU", "eval set: 100-300 labeled pairs minimum"

Common follow-ups: - "What's the difference between symmetric and asymmetric embedding models?" - "When would you fine-tune embeddings?" - "How does Matryoshka help?"

Traps: - Picking by MTEB rank only. Distribution mismatch dominates the leaderboard delta. - Forgetting to measure on your data. Default models often surprise on niche content.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/01_ai_engineering/08_rag_system_design/

Q: "When would you fine-tune an embedding model?"¶

Tags: senior · common · scenario · source: standard senior RAG probe; reported in 2026 AI engineer loops

Answer outline: - Default: don't. Off-the-shelf models (OpenAI 3-large, Cohere embed-v3, BGE) cover most cases. Fine-tuning embeddings is a real commitment. - Fine-tune when: - Domain gap is large: medical, legal, scientific, internal-jargon-heavy corpora where stock models lose recall. - Asymmetry is extreme: very short queries against very long structured docs. - Specific failure mode is repeatable: you have labeled examples of "the query should match document X but doesn't" — a few thousand of these are training material. - Recipe: collect (query, positive_doc, hard_negative_doc) triples — 5-50k typical. Use SentenceTransformers / Sentence-T5 with MultipleNegativesRankingLoss or InfoNCE. 1-3 epochs, low LR. Use a frozen base + lightweight adapter (LoRA on the embedding encoder) to minimize forgetting. - Hard-negative mining is the key trick: use BM25 + your current embedder to find docs that look like positives but aren't. These force the model to learn the discriminating signal. - Evaluate on a held-out (query, doc) eval set — recall@5, NDCG@10. Plus a generalization eval — make sure you didn't overfit to a narrow domain at the cost of general capability. - Re-tune when the base embedding model upgrades. Embedding fine-tunes are bound to the base, like LLM fine-tunes. - Numbers to drop: "5-50k labeled pairs typical", "hard-negative mining boosts NDCG@10 by 5-15% over random negatives", "LoRA on the encoder: ~1-5% of base params"

Common follow-ups: - "Where do labeled triples come from?" - "Walk me through hard-negative mining." - "How is this different from fine-tuning an LLM?"

Traps: - Fine-tuning without hard negatives. Random negatives are too easy; the model doesn't learn. - Skipping the generalization eval. Embedding fine-tunes overfit to narrow distributions.

Related cross-cutting: Retrieval, Fine-tuning vs alternatives Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/00_ai_foundation/06_adaptation_compression/

Q: "Where do embeddings fail? Discuss negation, temporal reasoning, and precision requirements."¶

Tags: senior · common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Embeddings collapse meaning into a single vector, and some signals don't survive that compression. - Negation: "documents that don't mention COVID" embeds nearly identically to "documents that mention COVID". The negation token doesn't materially shift the vector. Fix: structured query parsing (handle "not X" as a filter, not as embedded text), or query-rewriting that converts negation to metadata filters. - Temporal reasoning: "papers published after 2024" — embedding has no notion of recency. Fix: extract date constraints to metadata filters; never rely on the vector for time. - Numerical precision: "documents about deals over $10M" — embeddings treat "$10M" as a semantic blob, not a comparison. Fix: extract numerical constraints to filters. - Exact-keyword / rare entities: product codes, person names, function names — covered earlier; BM25 leg of hybrid handles these. - Multi-hop relations: "documents referencing the company that acquired X" requires reasoning across documents. Single-document embedding can't capture this. Fix: agentic retrieval, multi-hop query decomposition, or knowledge-graph RAG. - Polysemy: "Apple" the company vs the fruit. Without context, embedding picks one; sometimes wrong. Fix: query expansion with disambiguating terms, or context-aware retrieval. - The senior insight: embedding is one retrieval signal, not the whole story. Production retrieval combines vectors with structured filters, BM25, and sometimes graph traversal — each handles a class of query the others miss. - Numbers to drop: "negation queries: dense recall@10 drops to ~30-50% vs ~80%+ on positive queries", "temporal queries: pure-dense effectively random above the recency cliff"

Common follow-ups: - "How do you handle negation in practice?" - "Walk me through a multi-hop retrieval design."

Traps: - Believing a stronger embedding will fix all these. The failures are structural, not capacity-bound.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/01_ai_engineering/10_knowledge_graph_retrieval/

ANN indexes — HNSW, IVF, ScaNN, DiskANN¶

Q: "How does HNSW work, and when do you use it?"¶

Tags: senior · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026; standard ANN-internals probe

Answer outline: - HNSW = Hierarchical Navigable Small World graph. The index is a multi-layer proximity graph where each layer is a sparser version of the layer below. - Search: start at the top (sparsest) layer, greedy-walk toward the query. Drop down to the next layer, refine. Continue to the bottom layer where the actual nearest neighbors live. Result: log-scale traversal, sub-linear search time even on billions of vectors. - Build parameters: - M: max connections per node (16-64 typical). Higher M → better recall but more memory and slower build. - efConstruction: build-time search quality (typically 100-500). Higher → better-quality graph but slower build. - Query parameters: - efSearch (or ef): runtime search quality (typically 50-500). Higher → better recall but slower query. This is the runtime knob you tune for recall/latency. - When to use: most production setups. Excellent recall (95-99% of true top-k), fast queries (<10ms for millions of vectors), graceful degradation. - Trade-offs: in-memory (HNSW graph + vectors live in RAM), build cost (minutes-hours for millions of docs), expensive to update at scale (deletes are tombstones, rebuild periodically). - Numbers to drop: "M=16-32, efConstruction=200, efSearch=100 is a robust default", "HNSW recall@10: 95-99% at efSearch=100", "memory: ~1.5-2× the raw vector size for the graph overhead"

Common follow-ups: - "What's the trade-off between M and efConstruction?" - "How do you handle deletes in HNSW?" - "What's the difference between HNSW and IVF?"

Traps: - Saying HNSW is "just k-NN" — it's an approximate nearest neighbor structure with tunable recall/latency. - Confusing build-time and query-time parameters. efConstruction is build, efSearch is query.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "Compare HNSW, IVF, and DiskANN. When do you pick each?"¶

Tags: staff · common · conceptual · source: standard staff-level ANN-tradeoff probe; vector database loops 2026

Answer outline: - HNSW: graph-based, in-memory, best recall/latency on million-to-billion scale when RAM fits. Default for most production setups. Cons: memory-hungry, expensive deletes. - IVF (Inverted File): clusters vectors into N partitions (centroids via k-means); search visits only the top-N closest partitions. Lower memory overhead than HNSW, faster builds, less recall at equivalent latency. Often paired with quantization (IVF-PQ, IVF-OPQ) for billion-scale on commodity hardware. - DiskANN (Vamana graph): designed for out-of-core search — index lives on SSD, only working set in RAM. Best for billion+ vectors on tight memory budgets. Cons: SSD-bound latency, more complex ops. - ScaNN (Google): tree+quantization hybrid optimized for asymmetric inner-product search. Good when query is short and docs are long with normalized embeddings. - FLAT (brute force): exact search, no index. Sub-1M vectors where you can afford O(N) per query — sometimes preferable to deal with the engineering simplicity. - Decision tree: - <1M vectors → FLAT or HNSW; pick FLAT if simplicity matters. - 1M-100M vectors with RAM headroom → HNSW. - 100M-billion vectors → HNSW (lots of RAM) or IVF-PQ + reranking (less RAM). - Billion+ vectors with tight memory → DiskANN. - Numbers to drop: "HNSW: 1.5-2× vector size in RAM", "IVF-PQ: 0.1-0.3× via product quantization", "DiskANN: ~10% of vectors hot in RAM, rest on SSD"

Common follow-ups: - "How does product quantization fit in?" - "What's the recall hit from PQ?" - "Why is DiskANN harder to operate?"

Traps: - One-size-fits-all answer. The right index depends on scale, memory budget, update pattern, latency SLO.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Q: "What is product quantization (PQ) and what does it cost in recall?"¶

Tags: senior · common · conceptual · source: standard senior ANN probe; vector DB interview loops 2026

Answer outline: - PQ splits each vector into M sub-vectors, runs k-means on each sub-vector independently to find K centroids, replaces each sub-vector with its centroid ID (typically 8 bits = 256 centroids). - A 1536-dim FP32 vector (6144 bytes) compresses to ~96 bytes (with M=96 sub-vectors, K=256 centroids) — ~64× smaller. - Search: instead of computing exact distances, compute an approximate distance using lookup tables of centroid distances. Very fast (just additions), but lossy. - Recall hit: typically 5-15% NDCG@10 drop vs FLAT, depending on M and the data distribution. Mitigation: PQ + reranking — retrieve top-100 with PQ, rerank top-100 with exact distances on full vectors. Recovers most of the lost recall. - Variants: OPQ (Optimized PQ) rotates the embedding space first to align with PQ's grid structure — usually 1-3% recall recovery over plain PQ. - Trade-off summary: PQ trades ~10% recall for ~50× memory reduction. Reranking trades a few ms latency for most of the recall back. - Numbers to drop: "PQ 8x96 (M=96, 8-bit) compresses 1536-dim FP32 from 6144 to 96 bytes (64×)", "recall hit: 5-15% NDCG@10", "PQ + reranking with top-100: closes 70-90% of the recall gap"

Common follow-ups: - "Why does PQ work?" - "When is PQ worth the recall hit?" - "Walk me through PQ + reranking."

Traps: - Treating PQ as free. The recall drop is real; only reranking redeems it.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Hybrid search & fusion¶

Q: "How do you tune the alpha in hybrid (dense + BM25) scoring?"¶

Tags: senior · common · scenario · source: standard senior hybrid-retrieval probe; reported across 2026 RAG loops

Answer outline: - Two main fusion strategies: - Score-weighted: score = α · dense_score + (1-α) · bm25_score. Requires score normalization because dense scores (typically [-1, 1] cosine) and BM25 scores (unbounded positive) live in incomparable spaces. - Rank-based (RRF): ignores raw scores; combines ranks via RRF(d) = Σ 1/(k + rank_i(d)) with k≈60 by convention. Prevents score-scale mismatch. Standard 2026 default. - For score-weighted hybrid, normalize per-retriever (min-max or z-score over the top-K) before combining. Then sweep α on a held-out eval set; typical sweet spot α=0.4-0.7 (slightly dense-leaning, but query-distribution-dependent). - For RRF, the only tunable is k (default 60). Doesn't need normalization. Less performant on rare-tail edges but robust and standard. - The 2026 default: start with RRF (k=60). Move to score-weighted only if you have a clear evaluation case for it. - Per-query alpha: some 2026 systems route by query type — keyword-heavy queries weight BM25 higher, paraphrase queries weight dense higher. A small classifier decides. - Always evaluate on a labeled eval set; "alpha = 0.7" is meaningless without per-query-class metrics. - Numbers to drop: "RRF with k=60 is the standard. Score-weighted typically α=0.4-0.7.", "hybrid NDCG@10 lift over pure-dense: 5-20% on mixed-query workloads"

Common follow-ups: - "Why does RRF avoid the normalization problem?" - "How would you build a per-query α router?" - "What's the failure mode of score-weighted without normalization?"

Traps: - Combining raw scores without normalization. The result is dominated by whichever retriever has higher absolute scores. - Not evaluating per query class. The aggregate alpha hides large per-class differences.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "Walk me through implementing RRF for hybrid retrieval."¶

Tags: mid · common · coding · source: standard coding-round probe in 2026 RAG loops; cited in Elasticsearch hybrid search 2026 guide

Answer outline: - Function: given multiple ranked lists of documents (each retriever produces one), produce a fused ranking. - For each document d appearing in any list: RRF_score(d) = Σ_i 1 / (k + rank_i(d)), where rank_i(d) is d's rank in retriever i's list (1-indexed), and k is a constant (60 by convention). - Sketch ~15 lines of Python: aggregate document IDs across retrievers, compute RRF score for each, sort descending, return top-K. - Edge cases: a doc appearing in only one list still gets a score (1/(k+rank)). A doc missing from a list contributes 0 from that retriever. k=60 means rank 1 contributes 1/61 ≈ 0.0164; rank 100 contributes 1/160 ≈ 0.00625 — significant rank differentiation. - Why k=60: empirically chosen in the original RRF paper. Smaller k weights top ranks more aggressively; larger k flattens. - Senior tell: candidate notes that RRF naturally handles the "rare strong signal" case — a doc ranked top by one retriever but absent from the other still scores well. Score-weighted fusion can suppress this. - Numbers to drop: "k=60 standard", "RRF compute: O(N) per fusion (N = union size of retriever results)", "in production: fuse on top-50 per retriever, return top-10 for reranking"

Common follow-ups: - "What if a doc is ranked low in one retriever and high in another?" - "Why not just average ranks?"

Traps: - Combining scores instead of ranks (that's score-weighted fusion, not RRF). - Forgetting that documents missing from a list contribute 0 — handle them carefully if you're not unioning.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "What's the difference between hybrid retrieval and ensemble retrieval?"¶

Tags: senior · occasional · conceptual · source: standard senior retrieval-architecture probe; reported in 2026 RAG loops

Answer outline: - Hybrid typically means combining a few well-known retrievers (dense + BM25, sometimes + structured filter) via a fusion step. - Ensemble is a broader pattern: multiple retrievers, possibly heterogeneous (multiple dense models, multiple BM25 variants with different analyzers, multiple metadata-filter strategies), fused together. - In practice: hybrid is the common case (2-3 retrievers); ensemble is the over-engineered case where you stack 5-10 retrievers. More retrievers ≠ better — recall ceiling caps out, latency adds up, fusion noise creeps in. - The 2026 norm: 2-3 retrievers max (dense + BM25 + sometimes a metadata-filtered pre-retrieval). Anything more usually doesn't justify the latency. - Ensemble pays when retrievers fail in uncorrelated ways. If retriever A and retriever B fail on the same queries, adding B doesn't help. Run a diversity analysis before stacking. - Numbers to drop: "diminishing returns after 3 retrievers in most setups", "hybrid latency: dense (~30-100ms) + BM25 (~10-30ms) + fusion (<1ms) + rerank (~50-200ms)"

Common follow-ups: - "When does ensemble pay off?" - "How would you measure if a new retriever adds signal?"

Traps: - Adding retrievers without checking they're complementary.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Rerankers¶

Q: "What's a cross-encoder reranker and why is it so much better than the retrieval scoring?"¶

Tags: mid · very-common · conceptual · source: Sentence Embeddings hackerllama post; standard senior reranking probe; reported across 2026 RAG loops

Answer outline: - A cross-encoder takes the (query, document) pair as joint input to a transformer and outputs a single relevance score. The model can attend across query and document tokens together. - Versus a bi-encoder (the embedding model): bi-encoders encode query and document separately, then compare via dot product or cosine. No cross-attention; the comparison is limited to what survives in the final vectors. - The asymmetry: bi-encoders are fast (precompute doc embeddings, just embed the query at runtime, compare via ANN) but coarse. Cross-encoders are slow (full transformer pass per (query, doc) pair) but precise. - Production pattern: retrieve top-50 with hybrid (fast, recall-oriented), rerank top-50 with a cross-encoder (slow but precise), pass top-5 to the LLM. Best of both worlds. - Quality gain: cross-encoder reranking typically lifts NDCG@10 by 10-30% over the raw retrieval scores. It's the single highest-leverage RAG improvement after getting basic retrieval working. - Numbers to drop: "retrieve N=50, rerank to K=5", "cross-encoder per-pair: 10-50ms on GPU; CPU-only ~100-300ms", "NDCG@10 lift: 10-30% over raw retrieval"

Common follow-ups: - "Why can't the bi-encoder just be better?" - "How does ColBERT fit in here?" - "What if reranking latency is too high?"

Traps: - Calling the reranker "just another retriever". Architecturally distinct — cross-attention is the key. - Reranking with the same encoder used for retrieval. Defeats the purpose; the cross-attention is what helps.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "How do you choose between Cohere Rerank, BGE, Jina, ColBERT, and an LLM-as-reranker?"¶

Tags: senior · common · scenario · source: Reranking in RAG 2026 guides (Vaibhav Dixit, ZeroEntropy); standard senior reranker probe

Answer outline: - Cohere Rerank (v3 / Nimble): SaaS API, multilingual, ~$2/1k queries. Best when speed-to-ship matters and the budget can absorb per-query cost. Roughly 100ms for top-50 reranking. - BGE reranker / Jina reranker (open source, self-hostable): MIT-licensed weights, run on your own GPU. Comparable quality to Cohere on English; some lag on multilingual. Cost: GPU ops + engineer time vs zero per-query fee. Default for high-volume self-hosted setups. - ColBERT (v2 / ColBERTv2): late-interaction reranker. Encodes query and document as sets of token vectors, computes max-sim across token pairs. Faster than cross-encoder, slightly lower quality, scales to larger candidate sets (top-100 to top-500). Good for retrieval+rerank merger where the candidate set is too big for cross-encoder. - LLM-as-reranker: feed (query, candidate docs) to a chat LLM ("rank these by relevance"). Most flexible — can use instructions, handle complex relevance criteria. Cons: 10× cost and latency vs specialized rerankers, less benchmark-validated. Use only when you have unusual ranking criteria that specialized models can't express. - Decision: start with BGE / Cohere — straightforward cross-encoder rerank. ColBERT for very large candidate sets. LLM-as-reranker only for unusual relevance criteria. - Numbers to drop: "Cohere Rerank: $2/1k queries", "BGE reranker base: ~50-100ms per top-50 on a single GPU", "ColBERT-v2: 5-10× faster than cross-encoder at slightly lower quality", "LLM-as-reranker: 10× cost and latency vs specialized"

Common follow-ups: - "When does LLM-as-reranker make sense?" - "How does ColBERT's late-interaction work?" - "What's the latency budget for reranking?"

Traps: - Reaching for LLM-as-reranker by default. Specialized cross-encoders are cheaper and better on most workloads.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "Explain ColBERT's late-interaction architecture."¶

Tags: staff · occasional · conceptual · source: ColBERT / ColBERTv2 papers; standard staff-level retrieval probe 2026

Answer outline: - Bi-encoder: encode query into one vector, document into one vector, compare by dot product. Fast but compresses meaning. - Cross-encoder: encode (query + doc) jointly through full transformer, output one score. Precise but slow. - ColBERT (late interaction) is in between. Encode query as a set of token vectors (one per token); encode document as another set of token vectors. Score = sum over query tokens of max-sim with document tokens. The "late interaction" is that comparison happens at the per-token level after independent encoding. - Advantage over bi-encoder: per-token granularity, less information loss. - Advantage over cross-encoder: queries and documents encoded independently — can precompute document vectors offline. - ColBERTv2 adds residual compression so token-vector storage is feasible at scale. - Use case: rerank a large candidate set (top-100 to top-500) faster than cross-encoder would. Also a primary retriever in PLAID/ColBERT-based search engines. - Numbers to drop: "ColBERT stores ~50-200 token vectors per document", "5-10× faster than cross-encoder at similar quality", "storage: 10-30× larger than single-vector embedding"

Common follow-ups: - "Why is storage so much higher than bi-encoder?" - "When does ColBERT replace a regular cross-encoder?"

Traps: - Confusing ColBERT with cross-encoder. Late vs full interaction is the key distinction.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "Should you use an LLM as a reranker?"¶

Tags: senior · common · scenario · source: ZeroEntropy LLM-as-reranker guide 2026; standard senior reranker probe

Answer outline: - Most of the time: no. Specialized cross-encoder rerankers (Cohere, BGE, Jina) trained for ranking outperform or match LLM-based reranking across the board, with 10× lower latency and cost. - LLM-as-reranker pros: flexibility (you can instruct it with task-specific relevance criteria — "prefer documents that cite legal precedent", etc.), interpretability (LLM can output reasoning for its ranking). - LLM-as-reranker cons: 10× the cost, 5-10× the latency, sensitivity to prompt phrasing, position bias (LLMs favor documents earlier in the list, sometimes severely). - The 2026 evidence: cross-encoders trained for reranking beat LLM-as-reranker on standard ranking benchmarks (MTEB rerank tasks, BEIR). LLMs occasionally spike on narrow tasks but lose overall. - When LLM-as-reranker makes sense: unusual relevance criteria that specialized models can't express, very small candidate sets (<10) where the cost is negligible, or when you need explainable ranking output. - Pointwise (score each doc individually) is most common; listwise (LLM sees all candidates and reorders) is more powerful but cost grows fast. - Numbers to drop: "LLM rerank cost: 10× specialized cross-encoder", "Cohere Rerank: $2/1k queries vs LLM: $20-50/1k queries", "specialized cross-encoder: 50-100ms; LLM-as-reranker: 500ms-2s"

Common follow-ups: - "What's position bias in LLM-as-reranker?" - "How do you mitigate it?" - "Pointwise vs listwise — which?"

Traps: - Defaulting to LLM-as-reranker because "LLMs are smarter". They are, but specialized models are smarter at this specific task.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Diversity — MMR and beyond¶

Q: "What is MMR (Maximal Marginal Relevance) and when do you use it?"¶

Tags: senior · common · conceptual · source: standard senior retrieval-diversity probe; reported in 2026 RAG loops

Answer outline: - MMR addresses redundancy. Pure top-K retrieval returns the K most similar documents — which often means 5 near-duplicate chunks from the same document. The model sees the same fact restated 5 times. - MMR selects iteratively: pick the document with the highest weighted score λ · Sim(d, query) - (1-λ) · max_{d' in selected} Sim(d, d'). The first term rewards relevance; the second penalizes similarity to already-selected documents. - λ in [0, 1]: λ=1 is pure relevance (top-K), λ=0 is pure diversity. Typical sweet spot: λ=0.5-0.7. - Use cases: - Multi-document Q&A: ensure the top-K covers different sources, not 5 chunks of the same doc. - Long-document summarization: pick chunks from different sections of the same document. - Comparison queries: "compare X and Y" needs chunks about both, not five about one. - Less critical when reranking already enforces diversity (some rerankers do, most don't) or when the retrieval corpus is naturally diverse (low duplication). - Numbers to drop: "λ=0.5-0.7 typical", "MMR overhead: O(K^2) similarity comparisons after retrieval — negligible at K=10-50", "MMR can lift answer quality 5-15% on multi-source queries"

Common follow-ups: - "How does MMR interact with reranking?" - "Why not just deduplicate?" - "What's a failure mode of MMR?"

Traps: - Setting λ=0 (pure diversity) — you get unrelated documents. - Applying MMR before reranking. Better to rerank first, then MMR-diversify the top-K of the reranked list.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Q: "How do you handle duplicate or near-duplicate documents in retrieval?"¶

Tags: senior · common · design · source: standard senior retrieval-hygiene probe; reported in 2026 RAG loops

Answer outline: - Three points to attack: indexing, retrieval, and post-retrieval. - Indexing-time dedupe: hash-based exact dedupe (SHA on normalized text), n-gram-overlap near-dedupe (Jaccard / MinHash), embedding-cosine near-dedupe (vectors with cos > 0.95 are likely duplicates). Best to do all three at ingestion. - Retrieval-time dedupe: if dedupe at indexing missed some, deduplicate the top-K results post-query. Hash by chunk text, or cluster by embedding and pick a representative. - Post-retrieval dedupe via MMR: enforces diversity even when duplicates slipped through. - For corpora with legitimate near-duplicates (versions of the same doc, translations, paraphrases): keep them as separate IDs but cluster at retrieval time so only one representative goes to the LLM. - For RAG-specific cases: a single source document chunked into many overlapping windows will produce near-duplicate retrievals. Either reduce overlap, or MMR-diversify. - Numbers to drop: "n-gram dedupe threshold: 8-gram Jaccard > 0.5", "embedding dedupe threshold: cosine > 0.95", "expect 5-20% of any web/wiki corpus to be near-duplicate without dedupe"

Common follow-ups: - "What's the trade-off between MinHash and embedding-based dedupe?" - "How do you choose the representative when documents are near-duplicate?"

Traps: - Ignoring dedupe. Production corpora always have more duplicates than expected. - Aggressive dedupe at indexing without keeping a version pointer — you lose recall for legitimate version-distinct queries.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/01_ai_engineering/06_evidence_data_pipelines/

Ranking metrics¶

Q: "How do you evaluate a RAG pipeline? What metrics would you use? (NDCG, MRR, precision@k, recall)"¶

Tags: mid · very-common · design · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Two layers: retrieval evaluation (is the retriever pulling the right docs?) and end-to-end evaluation (is the final answer right?). - For retrieval: - Recall@K: of the K documents retrieved, did we pull the relevant ones? Most direct measure of "are we finding the right stuff?" Recall@5 / Recall@10 typical. - Precision@K: of the K retrieved, how many are relevant? Useful when noise hurts (LLM gets confused by irrelevant context). - MRR (Mean Reciprocal Rank): rewards getting the first relevant doc high. Sensitive to position. Best for "first-answer" use cases (single-doc Q&A). - NDCG@K (Normalized Discounted Cumulative Gain): rewards rank position and graded relevance. The standard general-purpose ranking metric. - For end-to-end (after the LLM generates): - Faithfulness: does the answer make claims supported by the retrieved documents? Critical for hallucination. - Answer relevance: does the answer address the user's question? An on-topic but wrong answer fails this. - Context precision / recall (RAGAS-style): given the gold answer, was the right context retrieved (recall) and was it ranked first (precision)? - For interview gravitas, also name: a labeled eval set (200-500 (query, gold doc, gold answer) triples), a cadence (run on every retriever change), and a gate (don't ship if NDCG@10 drops > 3%). - Numbers to drop: "eval set: 200-500 labeled triples minimum", "production retrieval target: recall@10 ≥ 0.9 on the labeled set", "NDCG@10 regression threshold: -3% blocks ship"

Common follow-ups: - "What's the difference between MRR and NDCG?" - "When does precision@K matter more than recall@K?" - "How do you label the gold relevant docs?"

Traps: - Listing only end-to-end metrics. Retrieval-side metrics localize where the bug lives. - Reporting an aggregate. Per-query-class slicing matters.

Related cross-cutting: Retrieval, Production patterns Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Walk me through NDCG@K. Why is it better than precision@K?"¶

Tags: senior · common · conceptual · source: standard senior ranking-metrics probe; reported in 2026 RAG loops

Answer outline: - NDCG (Normalized Discounted Cumulative Gain) rewards both relevance grade and rank position. - DCG@K = Σ (2^rel_i - 1) / log2(i + 1), where rel_i is the graded relevance of the doc at position i (e.g., 0=irrelevant, 1=related, 2=relevant, 3=highly relevant). Logarithmic discount of later positions means top positions matter much more. - NDCG = DCG / IDCG, where IDCG is the DCG of the ideal ranking. Normalizes to [0, 1]. - Better than precision@K because: - Graded relevance: precision treats relevant/irrelevant as binary; NDCG uses a scale (some docs are more relevant than others). - Position-aware: a relevant doc at rank 1 contributes more than at rank 10. Precision@K weights all positions equally. - Best when: you have graded relevance labels (most production setups can produce these) and you care about ranking order, not just the K-set. - MRR is a special case useful when you care only about the first relevant doc. Precision/recall are good for set-based retrieval where rank within the set doesn't matter. - Numbers to drop: "NDCG@10 production target: ≥0.7 on RAG retrieval eval set", "rank position-1 contributes 1.0 to DCG; rank 10 contributes ~0.29 (binary rel)"

Common follow-ups: - "How do you produce graded-relevance labels at scale?" - "When does precision@K still make sense?"

Traps: - Saying NDCG is "always better". Not true if you only have binary labels and don't care about rank order.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/

Production scenarios¶

Q: "Your RAG retrieval recall@5 is 0.6. How do you debug?"¶

Tags: senior · very-common · debugging · source: standard senior RAG-debugging probe; reported in 2026 AI engineer loops

Answer outline: - Localize the failure layer. For 30-50 of the failing queries, manually inspect what was retrieved vs what should have been. Cluster failures: - The gold doc isn't in the top-K but is in top-50 → ranking issue (the retriever finds it but ranks it low). Add or improve reranking. - The gold doc isn't in top-50 either → recall issue. Retriever simply doesn't find it. Bigger problem. - The gold doc isn't in the index at all → indexing issue. Data pipeline missed it. - No clear gold doc → ambiguity issue. The query needs rewriting / clarification. - Per-cluster fix: - Ranking issue → add a cross-encoder reranker, tune α/RRF, improve embedding. - Recall issue → switch to hybrid (BM25 + dense), tune chunking, try a stronger embedding model, fine-tune the embedder on your domain. - Indexing issue → audit the ingestion pipeline. Often it's a parsing bug (PDF tables lost), a stale dump, a filter excluding valid docs. - Ambiguity issue → query rewriting, HyDE, multi-query retrieval. - The senior tell: candidate names the labeled eval set as the diagnostic tool and per-failure-class triage as the method. "Try a better embedding" without diagnosis is junior. - Numbers to drop: "manual review of 30-50 failures gives clear cluster signal", "expect 5-15% of failures to be indexing bugs even in mature pipelines"

Common follow-ups: - "How would you build the eval set?" - "What if the gold doc is too long to fit in any single chunk?" - "When does the right answer come from BM25 instead of dense?"

Traps: - Jumping to "switch to a better embedding model" without diagnosis. Often the bug is in chunking or indexing.

Related cross-cutting: Retrieval, Production patterns Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your retrieval is too slow. Walk me through optimizing it."¶

Tags: senior · common · scenario · source: standard senior retrieval-latency probe; reported in 2026 RAG loops

Answer outline: - Profile first. What's slow: dense ANN, BM25, reranking, fusion, or the network round-trip to the vector DB? - Levers by component: - Dense ANN slow: tune efSearch down (recall trade-off), use a smaller index (sharding), switch from HNSW to IVF-PQ for memory efficiency, move to a GPU-accelerated vector DB. - BM25 slow: usually fast already; if not, smaller analyzer chain, or a faster index (Tantivy, Vespa). - Reranking slow: reduce candidate set (top-50 → top-20 before rerank), use a smaller/faster reranker (BGE-base vs BGE-large, ColBERT for batched workloads), batch rerank scoring. - Fusion slow: O(N) over top-K results; usually negligible. If it's slow, you're doing too much. - Network round-trip: co-locate vector DB with app, persistent connections, batch retrievals. - Architectural moves: - Cache common queries: semantic cache returns instant top-K for repeat queries. - Asynchronous prefetch: when the app classifies intent, start retrieval in parallel with downstream planning. - Tier the retrieval: cheap retriever first, expensive rerank only on a fraction of traffic that needs it. - Numbers to drop: "dense ANN target: <50ms p95 on 10M vectors", "rerank target: <200ms p95 for top-50", "retrieval cache hit rate: 30-60% on repeat-query workloads"

Common follow-ups: - "What's the latency budget for retrieval in a chat product?" - "Trade-off between recall and latency?"

Traps: - Cutting candidate-set size without measuring recall. The point of N=50 → K=5 reranking is N >> K; reducing N kills the quality.

Related cross-cutting: Cost & latency, Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/05_agent_performance_economics/

Q: "How do you scale a RAG system to 10M+ articles? Discuss sharding, caching, and retrieval optimization."¶

Tags: senior · common · design · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - 10M docs ≈ 10M chunks if you chunk at ~1 chunk/doc; more realistically 30-100M chunks at 500-1500 token chunks. Plan for the chunk count, not the doc count. - Index choice: HNSW if memory fits (need ~6-12 GB for 10M chunks at 1536-dim FP32 + graph), IVF-PQ if memory is tight (1-2 GB compressed), DiskANN for billion+. - Sharding: partition the index across N nodes. Each query hits all shards (scatter-gather) for global top-K, or routes to a shard based on metadata (more efficient but requires good partitioning). Most vector DBs (Milvus, Qdrant, Vespa) handle sharding natively. - Replication: each shard replicated 2-3× for HA and read throughput. Reads load-balanced across replicas. - Caching: provider-side prompt cache + application-side semantic cache for common queries. Expect 30-60% cache hit on FAQ-style traffic. - Hybrid leg: BM25 indexed in Elasticsearch / OpenSearch / Tantivy. Same sharding pattern. - Reranking: top-50 across all shards, then rerank locally. Reranker on a dedicated GPU pool, batched. - Update pipeline: incremental updates via append-only WAL, periodic compaction. Bulk re-embed on embedding-model change is the heaviest periodic cost. - Monitoring: per-shard recall, per-shard latency, replication lag, embedding-version coverage (some chunks may lag if a re-embed is in progress). - Numbers to drop: "10M chunks × 1536-dim FP32 = ~60 GB raw vectors", "shard count: 4-16 typical at 10M scale", "re-embed for model swap: hours-days; plan storage for parallel old+new during transition"

Common follow-ups: - "What goes wrong with naive scatter-gather?" - "How do you handle a stale shard?" - "What about updates — append-only or in-place?"

Traps: - Skipping the re-embed plan. Production teams get bitten when the embedding model upgrades. - No per-shard monitoring. A single bad shard degrades overall recall silently.

Related cross-cutting: Retrieval, Production patterns Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "Your RAG works on the eval set but bombs on real production queries. What's happening?"¶

Tags: senior · very-common · debugging · source: standard senior distribution-shift probe; reported across 2026 RAG loops

Answer outline: - Classic distribution-shift signal. Your eval set was curated; production traffic is wild. - Triage: - Sample 100-300 production queries; cluster them. - Identify clusters underrepresented in your eval set (different intents, longer/shorter queries, code/numerical content, multilingual, edge cases). - For each cluster, manually run retrieval and check recall. - Common production-side issues: - Query length skew: eval set has medium-length questions; users send 3-word fragments or 200-word essays. Both retrieve badly without query rewriting. - Out-of-domain content: production includes content types your eval ignored (tables, code, errors). - Multi-intent queries: real users ask compound questions; retrieval finds neither half. - Spelling / typos / non-canonical phrasing: BM25 misses; dense partially handles. - Stale content references: users ask about content that was indexed weeks ago and is now wrong. - Fix: expand the eval set with the production clusters (this is the durable fix), then improve retrieval for the worst-performing clusters. - Process change: continuous-loop sampling of production traffic for eval-set growth. Static eval sets decay; production traffic doesn't. - Numbers to drop: "eval set should grow weekly by 5-20 production-derived examples", "common gap: eval covers 60-80% of intent space; production exposes the rest"

Common follow-ups: - "How do you decide which production queries to add to the eval?" - "What's the most common production-vs-eval gap you've seen?"

Traps: - Treating the eval set as static. It must grow with production.

Related cross-cutting: Retrieval, Production patterns Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you handle the 'lost in the middle' problem?"¶

Tags: senior · common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); standard senior RAG probe

Answer outline: - "Lost in the middle" = LLMs attend most to the start and end of their context; information in the middle is more likely to be ignored, even though it's present. - The retrieval implication: passing 10 chunks ranked by relevance loses signal if the model only attends to chunks 1-2 and 9-10. - Mitigations: - Better reranking: put the most relevant chunks at positions 1 and the absolute last position (since both are attended). Some teams reorder by relevance: [rank1, rank3, rank5, ..., rank4, rank2]. - Fewer chunks: pass top-3 or top-5 instead of top-10. Smaller context, less middle. - Chunk summarization: collapse less-relevant chunks into short summaries; pass them as context but with less prose. - Citation-required answers: force the model to cite source chunk IDs; this acts as a soft attention forcer. - For long-context models (1M+ token windows): the effect is still present though attenuated. Don't assume "we have a big context window" means lost-in-the-middle is solved. - Eval: include "needle in haystack" tests in your eval suite to measure how well the model uses middle-positioned facts. - Numbers to drop: "top-K = 3-5 is the practical sweet spot for most chat workloads", "needle-in-haystack accuracy varies 30-80% at position-50 across models", "position-bias mitigation: reorder so most-relevant at position 1, second-most at the last position"

Common follow-ups: - "How do you know if your model has this problem?" - "Does it apply to long-context models?"

Traps: - Cramming more context "to be safe". Often makes things worse.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/01_ai_engineering/08_rag_system_design/

Q: "Walk me through a multi-stage retrieval pipeline you'd deploy in production."¶

Tags: senior · very-common · design · source: standard senior RAG-architecture probe; reported across 2026 RAG loops

Answer outline: - Stage 0 — query understanding: classify intent (FAQ vs procedural vs comparative), extract entities, parse temporal/numerical constraints into structured filters, rewrite for retrieval if needed. - Stage 1 — candidate generation: hybrid retrieval (BM25 + dense) with metadata pre-filter. Retrieve top-50 from each retriever, fuse via RRF → top-50 combined. - Stage 2 — reranking: cross-encoder reranker (Cohere Rerank / BGE / Jina) on the top-50 → top-10 by relevance. - Stage 3 — diversity / MMR: optional, apply MMR on the reranked top-10 to drop near-duplicate chunks → top-5. - Stage 4 — context assembly: order chunks (most-relevant first or position-aware to mitigate lost-in-the-middle), add citations/IDs for traceability, format with the prompt template. - Stage 5 — generation: pass to LLM with cite-or-refuse instructions. - Stage 6 — post-generation grounding check: claim extraction + verifier on the response, flag unsupported claims. - Each stage observable (span per stage), each stage has a fallback (e.g., if rerank fails, ship the raw retrieval top-5). - Numbers to drop: "stage latency budget: query understanding 50ms, retrieval 50-100ms, rerank 100-200ms, generation depends on model — total <2s p95 for chat"

Common follow-ups: - "Where would you cut to hit a 500ms latency budget?" - "What metric do you alarm on for each stage?"

Traps: - Describing stages without naming the fallback for each.

Related cross-cutting: Retrieval, Architecture choices Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/01_ai_engineering/08_rag_system_design/

Q: "How do you handle metadata filtering alongside vector search?"¶

Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); standard senior RAG probe

Answer outline: - Metadata filters (date range, tenant ID, document type, language) typically narrow the candidate set before or during vector search. - Three patterns: - Pre-filter (filter first, then search the subset): cheapest when filters are very selective (1-10% of corpus passes). Risk: if the filter is too restrictive, recall@K drops because the index returns fewer relevant docs. - Post-filter (search first, filter results): simpler. Risk: if filter rejects most of top-K, you get few survivors. - Filtered search (built into the index): modern vector DBs (Qdrant, Weaviate, Pinecone, Milvus) support filter-aware HNSW where the graph traversal skips nodes that fail the filter. Best of both worlds. - For multi-tenant: always filter by tenant ID. Filter-aware HNSW or strict tenant-level partitioning to prevent any cross-tenant leakage. - Index design: filter fields must be indexed (B-tree or inverted) so the filter step is fast. Otherwise the filter becomes the bottleneck. - Eval: test recall under filters. A query that retrieves recall@10=0.9 unfiltered may drop to 0.6 under a tight tenant or date filter — by design (corpus is smaller), but worth knowing. - Numbers to drop: "filter-aware HNSW: <2× the unfiltered query latency on selective filters", "naive post-filter: can need 10× the candidate-set size to maintain recall@K"

Common follow-ups: - "When would you use post-filter instead?" - "How do you avoid cross-tenant leaks?"

Traps: - Filter applied in the application after vector search. Misses results; bad performance.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Embedding fine-tuning and adaptation¶

Q: "Walk me through fine-tuning an embedding model for a domain."¶

Tags: staff · occasional · design · source: standard staff-level retrieval probe; reported in 2026 RAG infra loops

Answer outline: - Step 1 — labeled data. Collect (query, positive doc, hard negative doc) triples. 5-50k typical. Sources: production query logs paired with the retrieved doc the user thumbed up (positive), and a doc that looks relevant but the user didn't engage (hard negative). - Step 2 — hard-negative mining. Use BM25 + your current embedder to find docs that look similar to the positive but aren't. Random negatives are too easy; hard negatives force the model to learn discriminating features. Critical step. - Step 3 — loss. MultipleNegativesRankingLoss (in-batch negatives) is the standard. InfoNCE for explicit positive/negative triples. Both pull positives close, push negatives far. - Step 4 — training. SentenceTransformers framework. LoRA on the encoder (1-5% of params) to minimize forgetting. Low LR (1e-5 to 5e-5), 1-3 epochs, batch size 32-128. - Step 5 — eval. Held-out (query, doc) pairs. Recall@5, NDCG@10. Plus a generalization eval (MTEB subset or a separate held-out domain) to catch overfitting. - Step 6 — deploy. Re-embed the full corpus with the new model. This is the expensive step — hours to days at typical embedding-API throughput, plan accordingly. - Pitfalls: overfit to narrow domain (mitigate with general data mixed in), hard-negative mining is critical (random negatives produce weak models), re-embed cost is real (budget storage for parallel old+new during transition). - Numbers to drop: "5-50k labeled triples", "hard-negative mining gives 5-15% NDCG@10 lift over random negatives", "re-embed 10M chunks: hours-days at typical API throughput"

Common follow-ups: - "How would you mine hard negatives at scale?" - "When is fine-tuning not worth it?"

Traps: - Random negatives. Always do hard-negative mining. - Skipping the generalization eval.

Related cross-cutting: Retrieval, Fine-tuning vs alternatives Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/00_ai_foundation/06_adaptation_compression/

Q: "What's Matryoshka representation learning and why does it matter for production?"¶

Tags: senior · common · conceptual · source: OpenAI text-embedding-3 docs; standard senior embedding probe 2026

Answer outline: - Matryoshka embeddings are trained so that truncating the vector (keeping only the first D dimensions) preserves most of the retrieval quality. - Practical effect: store full-dimension embeddings (e.g., 3072d for text-embedding-3-large), search with truncated (e.g., 512d) for speed, optionally re-rank with full dimensions. - Why it matters in production: lets you trade quality for speed/storage at query time without re-embedding the corpus. Storage stays full; queries can be cheaper. - Use case: low-latency tier (chat) uses truncated; offline tier (analytics) uses full. - OpenAI text-embedding-3 models are Matryoshka-trained. text-embedding-3-large is 3072d truncatable down to ~512d with minimal quality loss. - Trade-off: at small truncation (D=128, 256), quality drops noticeably. At D=512-1024, quality is typically within 1-3% of full-dim. - Numbers to drop: "text-embedding-3-large: 3072d full, ~95% quality at D=1024, ~92% at D=512", "storage savings: linear with truncation; ~3-6× at typical levels"

Common follow-ups: - "When does truncation hurt quality the most?" - "Why doesn't simple PCA achieve the same thing?"

Traps: - Truncating non-Matryoshka embeddings. Doesn't have the same property. - Storing truncated as your only copy. Then you can't recover full quality without re-embedding.

Related cross-cutting: Retrieval, Cost & latency Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/