Advanced RAG — Interview Questions¶
Where the senior-loop RAG questions live: query rewriting, hybrid scoring, graph RAG, agentic RAG, multi-modal, the "long-context killed RAG" debate. If rag-fundamentals.md covers the pipeline, this file covers the variants and the second-order optimizations.
Query rewriting & decomposition¶
Q: "What is query rewriting, and how does it improve retrieval?"¶
Tags: mid · very-common · conceptual · source: Kalyan RAG Hub Q25; Analytics Vidhya 40 Qs Q21
Answer outline: - The user's literal query is rarely the best retrieval query. Rewriting normalizes vocabulary, adds context, and resolves pronouns. - Three common forms: standalone-question rewrite (resolve pronouns and ellipsis from conversation), multi-query expansion (generate 3-5 paraphrases, retrieve for each, fuse via RRF), and decomposition (break a multi-part question into sub-queries). - Use a small/fast LLM (Haiku, gpt-4o-mini); rewriting latency budget ~100-300ms. - Always cache rewrites for repeated queries; they're deterministic at temperature 0. - Numbers to drop: "Multi-query expansion with 3 paraphrases: ~5-12 points recall lift on conversational benchmarks; ~2× embedding cost."
Common follow-ups: - "When does rewriting hurt rather than help?" - "Pronoun resolution failures — how do you catch them?"
Traps: - Rewriting every query — adds latency and can dilute focused queries. - Conflating rewrite (for retrieval) with prompt-construction (for generation).
Related cross-cutting: Prompt chaining vs single-shot vs agent loop
Related module: learning/01_ai_engineering/09_advanced_rag_patterns/
Q: "Explain how HyDE works and when it helps."¶
Tags: mid · common · conceptual · source: Kalyan RAG Hub Q27; HyDE paper (Gao et al.)
Answer outline: - HyDE (Hypothetical Document Embeddings): use an LLM to generate a hypothetical answer to the query; embed that, not the query; retrieve. - Helps when query and document live in different lexical/syntactic spaces (short query, long answer-style document). - The hallucinated content doesn't matter — what matters is that the hypothetical embedding lands in the same neighborhood as real answer-shaped documents. - Cost: one extra LLM call per query (~150-300ms with a small model). - Doesn't help on factoid queries that already resemble document language; can hurt on rare-entity queries where the hallucinated answer drifts. - Numbers to drop: "HyDE on BEIR-zeroshot: average +3-5 points nDCG@10 over query embedding alone."
Common follow-ups: - "Why doesn't the hallucination in HyDE matter?" - "When does HyDE underperform?"
Traps: - Defending HyDE universally — it has clear failure modes. - Confusing HyDE (generate hypothetical answer for retrieval) with answer-generation.
Q: "What is HyPE and how does it compare to HyDE?"¶
Tags: senior · occasional · conceptual · source: Kalyan RAG Hub Q28-29
Answer outline: - HyPE (Hypothetical Prompt Embedding): the document side gets enriched, not the query. At ingest time, generate hypothetical questions that each chunk could answer; embed those questions; store them as additional vectors pointing back to the chunk. - Retrieval at query time becomes question-to-question matching — usually a more natural similarity than question-to-answer. - Cost is paid offline (once per chunk) instead of per query → cheaper at runtime than HyDE. - Storage cost: 3-5× index size if generating multiple questions per chunk. - Numbers to drop: "HyPE at index time: ~$10-30 per 100k chunks with a small LLM; retrieval latency unchanged from baseline."
Common follow-ups: - "Why is HyPE cheaper at runtime than HyDE?" - "When would you combine HyPE with traditional embeddings?"
Traps: - Treating HyPE as a query-time technique — the work is at ingest. - Forgetting the storage cost of multiple question vectors per chunk.
Q: "How do you handle multi-hop questions in RAG?"¶
Tags: senior · common · scenario · source: applied_ai_interview_focus.md; DataCamp Top 30 (production challenges)
Answer outline: - A multi-hop question requires assembling facts from multiple chunks (e.g., "Who is the CFO of the company that acquired Acme?"). - Single-shot retrieval fails — no single chunk has the answer; rerankers can't compose facts. - Decomposition pattern: LLM breaks the question into sub-questions; retrieve for each; assemble. Robust but adds latency. - Agentic/iterative pattern: the LLM decides which sub-question to ask next based on partial answers (Self-RAG, ReAct). Higher latency, handles harder queries. - Graph RAG pattern: index entities + relations as a graph; multi-hop becomes graph traversal. Best for relationship-heavy domains. - Numbers to drop: "Decomposition: ~2-4× the cost of single-shot RAG; faithfulness gains 10-25 points on multi-hop benchmarks like 2WikiMultiHopQA."
Common follow-ups: - "Decomposition vs agentic RAG — when each?" - "How does Graph RAG handle the same question?"
Traps: - Pretending single-shot retrieval can solve multi-hop. Interviewers test this. - Always reaching for agentic — decomposition is often enough and predictable.
Related cross-cutting: Agent vs chain
Q: "What are the pros and cons of query transformation techniques?"¶
Tags: mid · common · conceptual · source: Kalyan RAG Hub Q26
Answer outline: - Pros: lifts recall on poorly-formed queries; aligns query vocabulary with document vocabulary; handles ambiguity via multi-query expansion; resolves multi-turn conversation context. - Cons: adds 100-400ms latency per LLM call; adds cost; can introduce noise on already-good queries (over-rewriting effect); harder to debug — the retrieved set depends on the rewrite, not the user's literal words. - Always-on rewriting is wasteful; gate it by query classifier ("does this query need rewriting?") to spend the extra call only when likely to help. - Cache rewrites aggressively — exact-string + semantic cache both work. - Numbers to drop: "Gated rewriting: ~30-40% of queries hit the rewrite path; rest go through unchanged. Cost stays under ~$0.0002 added per query."
Common follow-ups: - "How do you know a query needs rewriting?" - "What's the failure mode when you over-rewrite?"
Traps: - Defending always-on rewriting without acknowledging the cost. - Skipping caching — same query rewritten 1000 times daily wastes calls.
Q: "To minimize RAG latency, which pre-retrieval enhancement do you choose?"¶
Tags: senior · occasional · scenario · source: Kalyan RAG Hub Q30
Answer outline: - Order by added latency: none < cached rewrite < HyPE (work paid offline) < small-model rewrite < HyDE < multi-query expansion < decomposition < agentic. - For tight latency budgets (<500ms P95 end-to-end), prefer HyPE — offline cost, runtime free. - Multi-query expansion can be parallelized across retrievers, so the latency is one retrieval not N — but cost is N×. - Cache the rewrite/expansion output by query hash; warm-cache rates often 30-60% on production traffic. - Numbers to drop: "HyPE: ~0ms added at query time. Small-model rewrite: 100-200ms. HyDE: 200-400ms. Decomposition (3 hops): 600-1500ms."
Common follow-ups: - "Show your latency budget for a 1-second P95 end-to-end RAG." - "When does the rewrite latency actually matter?"
Traps: - Ignoring offline vs runtime cost distinction. - Forgetting cache warm rates in latency math.
Hybrid scoring¶
Q: "Explain Reciprocal Rank Fusion (RRF) and why it's used for hybrid retrieval."¶
Tags: mid · very-common · conceptual · source: Kalyan RAG Hub Q51; Pinecone hybrid docs
Answer outline:
- Score-fusion problem: BM25 returns scores in [0, ~20]; dense returns cosine in [-1, 1]. You can't sum them directly.
- RRF sidesteps score scales: for each candidate d and retriever r, compute 1 / (k + rank_r(d)); sum across retrievers. Typical k = 60.
- Hyperparameter-free in practice — k=60 is robust across domains.
- Robust to score-distribution shifts when one retriever's distribution changes (e.g., embedding model upgrade).
- Available natively in Pinecone, Weaviate, Qdrant, Vespa, Elasticsearch.
- Numbers to drop: "RRF typical lift over best single retriever: 5-15 points nDCG@10 across enterprise QA benchmarks."
Common follow-ups: - "When does weighted sum beat RRF?" - "What happens when one retriever dominates the result set?"
Traps: - Mixing in absolute scores — RRF uses ranks for a reason. - Setting k too low (<20) — makes the fusion overly rank-sensitive.
Related cross-cutting: Sparse vs dense vs hybrid retrieval
Q: "When does pure dense retrieval beat hybrid?"¶
Tags: senior · occasional · conceptual · source: Pinecone learning center; Weaviate hybrid blog
Answer outline: - Long-form paraphrastic queries on conceptual corpora where rare-entity coverage doesn't matter — dense alone often wins. - Domains where lexical overlap is misleading (medical synonyms, paraphrased customer complaints, multilingual data). - When BM25 isn't tuned for the corpus (no stopword list, no stemming, no field weighting) — hybrid carries broken sparse, hurting overall. - Rare case in 2026 — hybrid is the default for a reason. But if your corpus is conceptual prose and your queries are paraphrastic, dense-only can be cheaper and simpler. - Numbers to drop: "In practice, hybrid wins ~80% of corpora I've benchmarked; dense-only wins ~15%, sparse-only ~5%."
Common follow-ups: - "How do you know your sparse side is tuned?" - "Why is hybrid default in 2026?"
Traps: - Defending dense-only universally — interview gold for the interviewer. - Skipping the BM25 tuning question — most teams use defaults and don't know it.
Q: "Compare keyword-based retrieval and semantic retrieval. When do you use hybrid?"¶
Tags: mid · common · conceptual · source: Kalyan RAG Hub Q50, Q52
Answer outline: - Keyword (BM25): scores by token overlap with IDF weighting. Strong on exact tokens (SKUs, error codes, function names, dates); weak on synonyms ("revenue" vs "income"). - Semantic (dense): embedding similarity. Strong on paraphrase; weak on out-of-vocab entities and exact-match. - Hybrid: combine both. Use when your queries mix both regimes (most real product queries do). - Default decision rule: ship hybrid unless you have evidence one side is dead weight (no rare entities → pure dense possible; no paraphrasing → pure sparse possible). - Numbers to drop: "Migration from pure-dense to hybrid in support-ticket corpora: ~7-12 point lift in Context Precision@5 on rare-entity queries."
Common follow-ups: - "Walk me through what happens when a query has both a SKU and a paraphrase." - "How do you decide if sparse is dead weight in your corpus?"
Traps: - Saying "we use BM25 because it's faster" — that's the dense-side's job to handle, not BM25's win. - Forgetting that hybrid is two indexes, not one — operational cost matters.
Q: "Compare general re-rankers and instruction-following re-rankers in RAG."¶
Tags: senior · occasional · conceptual · source: Kalyan RAG Hub Q64
Answer outline: - General re-rankers (cross-encoders): BGE-reranker, Cohere Rerank v3 — trained on generic relevance pairs. Score = is this chunk relevant to this query? - Instruction-following re-rankers: newer (RankZephyr, RankGPT, RankT5 with instructions) — take a natural-language instruction ("rerank for recency" or "prefer chunks about pricing") alongside query. Useful when relevance depends on task framing. - Trade-off: instruction-following are larger and slower (often LLM-based, ~50-200ms per pair); general are smaller and faster (~5-30ms per pair). - For most production RAG, general re-rankers are the right default. Instruction-following helps when "relevance" varies by user role / persona / task. - Numbers to drop: "Cohere Rerank v3 on top-50: ~30-80ms total at API; ~5ms self-hosted small variant. RankGPT on top-50: ~1-3s with a small LLM."
Common follow-ups: - "When would you swap Cohere Rerank for an LLM-based rerank?" - "Cost-quality math at 1M queries/day for each?"
Traps: - Treating all re-rankers as interchangeable. - Forgetting the latency cost of instruction-following at top-50.
Related cross-cutting: Cross-encoder rerank vs bigger LLM at generation
Re-ranking strategies¶
Q: "Why is the cross-encoder typically used as the re-ranker rather than the bi-encoder?"¶
Tags: mid · common · conceptual · source: Kalyan RAG Hub Q65
Answer outline: - Bi-encoder processes query and doc independently; similarity is a dot product. Pre-indexable, ~O(1) retrieval. Loses query-document interaction. - Cross-encoder processes query+doc jointly through transformer attention; scores their interaction directly. Much more accurate (4-10 points nDCG lift typical); but no pre-indexing — must score each pair at query time. - Production pipeline: bi-encoder retrieves top-50 cheaply; cross-encoder scores those 50 pairs and reorders into top-5. Best of both. - Reranker latency at top-50: ~30-100ms with BGE-reranker; ~50-200ms with Cohere Rerank API. - Numbers to drop: "Bi-encoder retrieval: ~5-10ms over 1M chunks. Cross-encoder rerank on top-50: ~30-100ms. Combined: ~50-120ms P95."
Common follow-ups: - "When can you skip the cross-encoder entirely?" - "Where does ColBERT sit between the two?"
Traps: - Using a cross-encoder for first-pass retrieval — O(N) blowup. - Skipping rerank because "embeddings are good enough" — measure before you decide.
Q: "If a RAG system retrieves 20 candidates but only 5 fit in context, why does re-ranking matter?"¶
Tags: mid · common · scenario · source: Kalyan RAG Hub Q66
Answer outline: - Without re-ranking, you're trusting the bi-encoder's ranking — usually decent for top-50 recall but poor at top-5 precision. - The top-K cutoff means the order matters: the gold chunk at rank 8 is invisible to the generator. - Re-ranking moves likely-relevant chunks up; Context Precision@5 typically lifts by 10-25 points. - Without re-ranking, you'd need to either expand K (pollutes prompt with noise) or accept lower faithfulness. - Numbers to drop: "Top-5 without rerank vs top-5 with rerank: Context Precision@5 0.55 → 0.78 typical on enterprise QA."
Common follow-ups: - "What if you can't afford the rerank latency?" - "How would expanding K=20 in the prompt compare?"
Traps: - Expanding K to compensate for missing rerank — adds cost and noise. - Forgetting context budget — bigger K means more tokens shipped to the LLM.
Q: "In real-time RAG with strict latency, how do you reduce re-ranking overhead while preserving quality?"¶
Tags: senior · occasional · scenario · source: Kalyan RAG Hub Q72
Answer outline: - Smaller reranker model: BGE-reranker-base (~mini) instead of -large; ~3-5× faster, ~2-3 point precision hit. - Smaller candidate set: rerank top-20 not top-50 — limits the worst-case latency. - Distillation: distill a slower cross-encoder into a smaller student tuned for your domain. - Batched/quantized inference: INT8 quantization on the reranker; ~2× speedup, minimal quality loss. - Selective reranking: classify if the query "needs rerank"; skip for high-confidence top-K from the bi-encoder. - Caching: cache (query, top-5) pairs by query hash for repeated queries. - Numbers to drop: "Selective rerank gate: ~40-60% of queries bypass rerank, saving ~30-50ms P95 with minimal quality loss."
Common follow-ups: - "How do you decide a query 'needs rerank'?" - "ColBERT-as-reranker — where does it fit?"
Traps: - Skipping rerank entirely for latency. - Quantizing without measuring per-domain quality hit.
Graph RAG¶
Q: "When does Graph RAG beat vector RAG?"¶
Tags: senior · common · scenario · source: Neo4j Agentic RAG blog; Pipeline vs Agentic vs Graph RAG (Lanham)
Answer outline: - Graph RAG wins on multi-hop relationship queries ("Which engineers reporting to managers in the EU work on projects funded by Stripe?") — graph traversal answers what vector similarity can't compose. - Wins on cross-document reasoning where the answer requires combining facts from many docs into a coherent chain. - Wins on explainability — the traversal path is the citation chain. - Loses on conceptual/semantic queries where the answer is in prose and there's no useful graph structure to extract. - Hybrid (vector + graph) is increasingly common: vector for retrieval breadth, graph for hop expansion. - Numbers to drop: "Microsoft GraphRAG benchmark: ~20-50 point lift on multi-hop questions vs naive vector RAG on the same corpus; ~5-10× ingest cost."
Common follow-ups: - "What's the ingest cost difference?" - "What does the graph schema look like in your domain?"
Traps: - Adopting Graph RAG when the domain is conceptual prose — pays the cost without the win. - Confusing Graph RAG with vector RAG that uses metadata filters.
Related module: learning/01_ai_engineering/10_knowledge_graph_retrieval/
Q: "How is a knowledge graph constructed from unstructured text for Graph RAG?"¶
Tags: senior · occasional · design · source: Microsoft GraphRAG paper; Neo4j blog
Answer outline: - Entity extraction: an LLM passes over each chunk and extracts (entity, type) tuples. Domain-specific schemas (Drug, Disease, Symptom) or open-domain. - Relation extraction: the same pass extracts (subject, relation, object) triples; canonicalize entities via embedding similarity + LLM dedup. - Community detection: group densely connected sub-graphs (Leiden algorithm); summarize each community for high-level retrieval. - Query time: entity-linking the question → graph traversal → return subgraph + community summaries to the LLM. - Ingest is expensive — full corpus pass with an LLM. Worth it for relationship-heavy domains. - Numbers to drop: "Microsoft GraphRAG paper: ~$10-50 per 1M tokens of ingest at frontier-model prices; ~10× the cost of vector-RAG ingest."
Common follow-ups: - "Entity canonicalization — how do you stop 'NYC' and 'New York City' from being two nodes?" - "What's the role of community summaries?"
Traps: - Underestimating ingest cost. - Skipping entity canonicalization — kills traversal quality.
Q: "What are the trade-offs of Graph RAG versus traditional vector RAG?"¶
Tags: senior · common · conceptual · source: MarsDevs 2026 Guide; MachineLearningMastery agentic RAG
Answer outline: - Strengths: multi-hop, explainability (the path is the citation), cross-document consistency, schema-enforced relationships. - Weaknesses: expensive ingest (LLM pass per chunk), brittle to schema changes, harder to update incrementally, can't recover from extraction errors without reprocessing. - Operational: graph DB (Neo4j, FalkorDB, Memgraph) adds a new system to operate; vector store doesn't go away. - Use case fit: legal compliance graphs (regulations → controls → evidence), medical (drug-disease-symptom), enterprise org/project graphs. - Numbers to drop: "Graph ingest typical cost on 1M-token corpus: ~$50-200 with mid-tier model; query latency adds ~50-200ms for traversal."
Common follow-ups: - "How do you handle schema evolution?" - "What if extraction misses an entity — how do you find that out?"
Traps: - Treating Graph RAG as a vector-RAG replacement rather than a complement. - Underestimating the ops burden of running a graph DB.
Agentic RAG¶
Q: "How does Agentic RAG differ architecturally from classical single-pass RAG?"¶
Tags: senior · very-common · conceptual · source: Analytics Vidhya 40 Qs Q39; Agentic RAG survey (arxiv 2501.09136)
Answer outline: - Classical RAG: linear pipeline — one retrieval, one generation. Predictable cost and latency. - Agentic RAG: an LLM-controlled loop that decides when to retrieve, what to retrieve, whether the retrieval was sufficient, whether to refine the query, when to stop. Tools = retrievers, not just one. - Common patterns: Self-RAG (model decides per-claim whether to retrieve and self-critiques output), Corrective RAG (CRAG — evaluator decides if retrieval was good; triggers correction if not), multi-step ReAct over retrieval tools. - The shift: retrieval becomes a tool the agent calls, not a fixed pipeline stage. - Numbers to drop: "Agentic RAG cost: typically 2-5× single-pass cost (multiple LLM calls); latency 2-4× (sequential reasoning + retrievals); answer quality lift on hard queries: 10-30 points."
Common follow-ups: - "What stops it from looping forever?" - "When is the agentic overhead worth it?"
Traps: - Confusing agentic RAG with multi-query expansion — agentic decides dynamically; expansion is fixed at query time. - Skipping budget/stopping rules in your design.
Related cross-cutting: Prompt chaining vs single-shot vs agent loop
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What new trade-offs does Agentic RAG introduce in cost, latency, and control?"¶
Tags: senior · common · scenario · source: Analytics Vidhya 40 Qs Q40; Agentic RAG survey
Answer outline: - Cost: multiplies. Each iteration is another LLM call + retrieval. Budget caps (max iterations, max tokens) are non-negotiable. - Latency: sequential reasoning adds time; parallel tool calls help where the agent picks them. - Determinism: lower. Two runs of the same query can take different paths. Bad for caching and SLAs. - Observability: non-negotiable. Each step is a span; without traces, debugging is impossible. - Eval shifts from "is the answer right" to "is the trajectory reasonable + the answer right". Trajectory evals are needed. - Numbers to drop: "Set max iterations 3-5 for production; hard cost cap (~$0.10-0.50 per query). Beyond that, fall back to classical RAG path."
Common follow-ups: - "What's your kill switch when an agent loops?" - "Cost cap math at 1M queries/day."
Traps: - No budget caps. Agents run until they break. - No trajectory eval. The answer can be right by accident.
Q: "Explain Self-RAG and when you would use it."¶
Tags: senior · common · conceptual · source: Self-RAG paper (Asai et al., 2023); Self-Corrective Agentic RAG (Ramasamy)
Answer outline: - Self-RAG trains the model to emit special tokens that decide: (a) should I retrieve right now? (b) is the retrieved passage relevant? (c) is my generated claim supported by it? - The model self-critiques per generated segment; segments deemed unsupported can be revised or dropped. - Use when generation quality is the bottleneck — common in fact-heavy domains where hallucinations are critical to catch. - Requires fine-tuning or a base model that supports these reflection tokens; not zero-shot on every LLM. - Numbers to drop: "Self-RAG paper: ~5-10 point lift on long-form factual QA over standard RAG; ~1.5-2× generation cost."
Common follow-ups: - "How does Self-RAG compare to a separate verifier pass?" - "When is the FT requirement a deal-breaker?"
Traps: - Conflating Self-RAG with self-consistency (different technique). - Promising Self-RAG without acknowledging the FT requirement.
Q: "What is Corrective RAG (CRAG) and how does it differ from Self-RAG?"¶
Tags: senior · common · conceptual · source: CRAG paper (Yan et al.); MachineLearningMastery agentic RAG
Answer outline: - CRAG adds an evaluator between retrieval and generation: classify each retrieved chunk as Correct, Ambiguous, or Incorrect. - If Correct: proceed normally. If Incorrect: trigger a web search or alternative retrieval. If Ambiguous: do both and let the LLM reconcile. - Differs from Self-RAG: CRAG decides at retrieval evaluation; Self-RAG decides during generation. CRAG is closer to a pluggable evaluator; Self-RAG requires model behavior changes. - Often combined with query rewriting on the "Incorrect" branch. - Numbers to drop: "CRAG paper: ~3-7 points faithfulness lift on out-of-domain queries by triggering external search."
Common follow-ups: - "How does the evaluator decide Correct vs Incorrect?" - "Cost picture when the Incorrect branch triggers a web search?"
Traps: - Treating the evaluator as free — it's an extra LLM call per retrieval. - Confusing the two approaches.
Q: "Design an agentic RAG system for a legal research assistant."¶
Tags: senior · common · design · source: synthesized from multiple loops; Agentic RAG survey
Answer outline: - Tools: case law retrieval (per jurisdiction), statute retrieval, internal memo retrieval, citation expander (Bluebook), recency filter. - Planner step: LLM decomposes "What's the precedent on X in California?" into sub-queries: identify the cause of action, retrieve California case law, retrieve federal precedent for context, retrieve related secondary sources. - Loop: for each sub-question, retrieve → evaluate sufficiency → optionally rewrite query → re-retrieve. Max iterations: 5. - Synthesis: the agent composes the final brief with mandatory citations; a verifier pass checks every cited case actually exists (citation hallucination is a known failure mode in legal LLMs). - Approval gate: any draft going to a client is reviewed by a human attorney. - Cost cap: ~$1-3 per query; queries are high-value (lawyer-hours saved). - Numbers to drop: "Real legal AI products report ~30-50% lawyer-hour savings; cost cap typically 1-2% of attorney's hourly rate."
Common follow-ups: - "How do you handle citation hallucination?" - "Where does Graph RAG fit (precedent chains)?"
Traps: - No verifier pass on citations — known failure mode (Mata v. Avianca). - Letting the agent draft client-facing output without human review.
Multi-modal RAG¶
Q: "How do you build a multi-modal RAG system that indexes images alongside text?"¶
Tags: senior · occasional · design · source: LlamaIndex multi-modal blog; LangChain multi-modal RAG
Answer outline: - Approach A — separate indexes: text embeddings (BGE) and image embeddings (CLIP) live in two indexes. Query is encoded by both; results fuse via RRF. - Approach B — joint embedding (CLIP-style): text and image share the same embedding space. Single index, cross-modal retrieval. - Approach C — caption-then-embed: caption each image with a vision-LLM at ingest time; store the caption-as-text in the same text index. Cheap and surprisingly competitive. - For documents: PDFs with embedded images — extract images with OCR + captioning; index alongside the surrounding text chunks. - Generation: ship retrieved images to a multi-modal LLM (GPT-4o, Claude 3.5 Sonnet, Gemini); generation cost is higher per image. - Numbers to drop: "Caption-then-embed at ingest: ~$0.001-0.005 per image with small VLM. Storage cost dominated by raw image storage, not embeddings."
Common follow-ups: - "When does caption-then-embed beat CLIP?" - "Cost of multi-modal generation vs text-only?"
Traps: - Defending CLIP as universally better — caption-then-embed wins on cost and on text-heavy queries. - Forgetting image storage costs.
Q: "How do you index tables and figures from PDFs for RAG retrieval?"¶
Tags: senior · occasional · scenario · source: LlamaIndex blog on PDF parsing; Unstructured.io docs
Answer outline:
- Tables: use a table-aware parser (Camelot, Azure Document Intelligence, Unstructured). Each table row → templated sentence ("Customer Acme paid $50k in Q3 2024"); embed the sentence. Keep the source table as metadata for citation.
- Figures: caption with a VLM; embed the caption-as-text. Optionally also store an image embedding for image-similarity queries.
- Preserve page_number, bbox, figure_id in metadata so the LLM can show "see Figure 3 on page 14".
- For numerical tables, consider hybrid retrieval — exact-match on the table cells helps with "What was Q3 revenue?" queries.
- Numbers to drop: "Switching from naive PDF-to-text to table-aware parsing: ~25-40 percentage point lift in Context Precision on financial QA (LlamaIndex 2025 benchmark)."
Common follow-ups: - "What if a single PDF has 100 tables — how do you store them?" - "How do you handle multi-page tables?"
Traps: - Flattening tables to prose — destroys structure and exact-match. - Skipping bbox metadata — citations can't point to the right region.
Long-context vs RAG¶
Q: "Is RAG still relevant in the era of long-context LLMs (1M+ tokens)?"¶
Tags: senior · very-common · conceptual · source: Kalyan RAG Hub Q2
Answer outline: - Yes, for several reasons even when context is "free": - Lost-in-the-middle: even Gemini 1.5 Pro / Claude 3.5 with 1M+ context shows recall drop for facts in the middle of long contexts. RAG focuses attention. - Cost: 1M tokens at 2026 prices is still ~$0.50-2.50 per query. RAG ships ~5k tokens; that's 100-500× cheaper. - Latency: TTFT scales with context. 1M-context queries take 5-30s; RAG keeps under 2s. - Freshness/updates: long context doesn't solve "the doc just changed". Reindexing is faster than retraining context. - Multi-tenant: you can't put 1M tokens per tenant in every call; RAG is per-tenant data access by construction. - Numbers to drop: "Lost-in-the-middle on Gemini 1.5: ~20% recall drop for facts at the middle of a 750k-token context vs the edges."
Common follow-ups: - "When would long-context replace RAG?" - "Hybrid long-context + RAG — when?"
Traps: - Saying RAG is obsolete — interview red flag. - Ignoring cost/latency math in the defense.
Related cross-cutting: RAG vs fine-tune vs prompt engineering
Q: "When would a long-context LLM replace your RAG pipeline?"¶
Tags: senior · common · scenario · source: synthesized from 2026 production posts
Answer outline: - The corpus is small and static (≤ 100k tokens): stuff the whole thing every query, cache aggressively. - The query requires holistic understanding of the entire corpus (summarize this 200-page report) where chunking would lose narrative arc. - Latency budget is loose and cost is not the bottleneck (internal research tools, not customer-facing). - The team can't operate a vector store / index pipeline (early-stage prototypes). - For most production-grade applications with growing corpora, RAG still wins on cost, freshness, and operational maturity. - Numbers to drop: "Tipping point: corpus < 50k tokens, low query volume (<100/day) — long-context can be cheaper than running RAG infra."
Common follow-ups: - "Show your cost crossover math." - "What stops you from sharding the long-context approach?"
Traps: - Dismissing long-context entirely. - Forgetting context caching — most providers offer prompt caching that makes long-context feasible for some workloads.
Q: "What is Cache-Augmented Generation (CAG) and when do you prefer it over RAG?"¶
Tags: senior · occasional · conceptual · source: DataCamp Top 30 Q24
Answer outline: - CAG: pre-load the entire (small) corpus into the model's context using prompt caching; no retrieval at query time. The "cache" is the cached prompt prefix. - Cost model: pay full cost once on first ingest; subsequent queries pay only the delta tokens (Anthropic prompt cache: 10% of base; OpenAI: similar). - Wins when: corpus is small (≤ 100-200k tokens), low update rate, query latency budget is tight (no retrieval round-trip). - Loses when: corpus grows, updates frequently, multi-tenant (cache per tenant). - Numbers to drop: "Anthropic prompt caching: 90% cost reduction for cached portion, 5-min TTL on cache. Re-prime cost: full prompt billed once per refresh."
Common follow-ups: - "What's the operational difference between CAG and just stuffing context?" - "When does the cache TTL bite you?"
Traps: - Treating CAG as a one-size-fits-all alternative. - Forgetting cache TTLs — costs explode if refresh rate exceeds usage rate.
Related cross-cutting: Exact-match cache vs semantic cache vs prefix cache
Q: "Compare reasoning vs non-reasoning LLMs for RAG systems."¶
Tags: senior · common · conceptual · source: Kalyan RAG Hub Q22
Answer outline: - Non-reasoning (gpt-4o, Claude 3.5 Sonnet): fast, predictable cost; good for direct extraction from clear context. - Reasoning (o-series, Claude opus thinking, R1): slower, more expensive; better for multi-hop synthesis, conflict resolution, complex multi-step questions across multiple chunks. - Decision rule by query class: simple lookup → non-reasoning; multi-hop / complex reasoning / contradiction handling → reasoning models. - Cost: reasoning models often 5-30× the cost-per-token; latency can be 5-60s depending on thinking budget. - Hybrid pattern: route by query classifier; default to non-reasoning, escalate to reasoning on hard queries. - Numbers to drop: "Cost ratio at 2026 prices: gpt-4o ~\(0.0025/1k output, o1 ~\)0.06/1k output; ~24× differential."
Common follow-ups: - "How do you build the query classifier?" - "When is the latency of a reasoning model fatal?"
Traps: - Defaulting to reasoning models — overkill for most factoid RAG. - Ignoring per-query cost math.
Q: "What is context engineering, and is it the same as RAG?"¶
Tags: senior · occasional · conceptual · source: synthesized from 2026 production posts; Adil Shamim
Answer outline: - Context engineering is the broader discipline of choosing what information the LLM sees per query. RAG is one tool inside it. - Other context-engineering moves: tool definitions, few-shot examples, persona/system prompts, memory summaries, conversation history. - Context engineering trade-offs: budget allocation across (retrieved chunks, examples, history, tools, instructions). RAG is just the "retrieved chunks" slot. - For senior loops, the question tests whether you treat the LLM context as a finite resource to budget, not as infinite. - Numbers to drop: "Typical 8k-context budget split: ~60-70% retrieved chunks, ~10-15% history/memory, ~10% system + instructions, ~5-10% examples."
Common follow-ups: - "What gets cut first when the budget gets tight?" - "How do you measure if your context engineering is good?"
Traps: - Conflating context engineering with prompt engineering. - Treating context as infinite.
Q: "How do you handle ambiguous queries in agentic RAG that needs user clarification?"¶
Tags: senior · occasional · scenario · source: Analytics Vidhya 40 Qs Q23
Answer outline: - Detection: confidence signals from the retriever (low top-K scores), the LLM ("I'm not sure which X you mean"), or an intent classifier. - Three response strategies: - Ask: the agent asks a single clarifying question; UX cost is one extra turn. - Disambiguate: retrieve for each plausible interpretation; present a "did you mean X or Y?" with citations. - Best-guess + flag: answer the most likely interpretation; surface "answered for interpretation X" so the user can correct. - For low-stakes, "best-guess + flag" wins on UX. For high-stakes (legal, medical), "ask" wins. - The choice is product, not architecture — but defend it with metric numbers. - Numbers to drop: "Clarifying-question rate target: <8% of queries on a mature production assistant. Higher → underlying retrieval or classifier is broken."
Common follow-ups: - "What's the threshold for triggering clarification?" - "How do you avoid annoying users with too many clarifications?"
Traps: - Always-clarify mode — destroys UX. - Never-clarify mode — silently wrong answers on ambiguous queries.
Q: "What are common failure modes of advanced RAG in production?"¶
Tags: senior · common · debugging · source: Analytics Vidhya 40 Qs Q30
Answer outline: - Stale index after embedding-model upgrade: old embeddings + new query embeddings → silent recall collapse. - Re-ranker drift: new domain data the re-ranker wasn't trained on; quality regresses on a slice you didn't measure. - Prompt injection from retrieved docs: poisoned doc with "ignore previous instructions" leaks via retrieval. - Agentic loop runaway: no budget caps; one bad query exhausts the LLM rate limit. - Multi-tenancy ACL leakage: a chunk from tenant A retrieved for tenant B because filters missed. - Lost-in-the-middle on long contexts: the gold chunk is in the prompt but the model ignores it. - Citation hallucination: the LLM cites a chunk ID that exists but doesn't support the claim. - Numbers to drop: "Real production audit: ~30-40% of incidents trace to stale index or missing ACL filter; ~20% to prompt-injection from poisoned content."
Common follow-ups: - "Which of these have you debugged? Walk me through it." - "Prompt injection from retrieved content — your defense?"
Traps: - Focusing on retrieval failures only; ignoring agentic-loop and ACL failures. - No mention of prompt injection from retrieved content — it's a 2026 hot topic.
Related cross-cutting: Input vs output guardrails
Q: "Walk me through a multi-stage retrieval strategy."¶
Tags: senior · common · design · source: Analytics Vidhya 40 Qs Q32
Answer outline: - Stage 1 — broad recall: hybrid (BM25 + dense) over the full corpus, top-100. Cheap, high recall. - Stage 2 — cross-encoder rerank: re-rank top-100 → top-20. Adds precision. - Stage 3 — diversity/MMR: select top-5 from top-20 with maximal marginal relevance to avoid redundancy. - Stage 4 — context compression (optional): for verbose chunks, an LLM extracts only the relevant sentences before assembling the prompt. - Each stage has its own cost/latency budget — Stage 1 is dominant in volume; Stage 4 is the most expensive per chunk. - Numbers to drop: "Multi-stage typical: stage 1 ~10ms, stage 2 ~50-100ms, stage 3 ~5ms, stage 4 (LLM compression) ~200-500ms per chunk. Skip 4 unless answer-token budget is tight."
Common follow-ups: - "Where does query rewriting fit?" - "When do you skip stage 4?"
Traps: - Skipping diversity — top-K becomes near-duplicates. - Putting LLM compression too early — should be after rerank.
Related cross-cutting: Cross-encoder rerank vs bigger LLM at generation
Q: "How would you handle multilingual data in a RAG system?"¶
Tags: senior · occasional · scenario · source: Analytics Vidhya 40 Qs Q38
Answer outline: - Multilingual embedding model: Cohere multilingual, BGE-M3, multilingual-e5 — embed all languages into a shared space; cross-lingual retrieval works natively. - Per-language indexes: one index per language; route by language detection. Simpler ops, no cross-lingual queries. - Translate-then-embed: translate everything to English at ingest; lossy and expensive but lets you use English-only encoders. - For queries in mixed languages or code-switching, multilingual encoders are the only viable path. - Citations are tricky — if doc is in French and user is English, do you cite the French original or a translation? Surface both. - Numbers to drop: "BGE-M3 supports 100+ languages in a single 1024-d embedding space; recall typically 5-15 points below English-only on per-language benchmarks."
Common follow-ups: - "How does language detection sit in the pipeline?" - "Cost picture of translate-then-embed vs multilingual?"
Traps: - Defaulting to translate-then-embed without acknowledging information loss. - Forgetting language metadata for citation.
Q: "Design a RAG system that handles both structured (knowledge graph) and unstructured (text) data simultaneously."¶
Tags: staff · occasional · design · source: Kalyan RAG Hub Q56
Answer outline: - Parallel retrievers: one over the text vector index, one over the KG (Cypher/SPARQL or graph traversal). - Router: an intent classifier or LLM router decides whether the query needs structured lookup (relationship/aggregation) or text recall (definition/example). Many queries need both. - Result fusion: convert KG hits to text snippets ("Entity X is related to Y via Z, source: triple_id_42") and merge with text chunks before reranking. - Unified prompt: the LLM sees a mixed context — citations point to either source type, distinguished by metadata. - For 2026, this hybrid is increasingly the default in enterprise RAG (drug-disease lookups + clinical notes; compliance graphs + policies). - Numbers to drop: "Hybrid KG + text on a medical QA benchmark: ~15-25 points faithfulness lift on multi-hop questions vs text-only RAG."
Common follow-ups: - "How does the router decide?" - "What if the KG and text disagree?"
Traps: - Treating KG as a replacement for text — it complements, not replaces. - No conflict-resolution rule for disagreeing sources.
Related module: learning/01_ai_engineering/10_knowledge_graph_retrieval/
Q: "What is contextualization in RAG and how does it affect performance?"¶
Tags: senior · occasional · conceptual · source: DataCamp Top 30 Q21
Answer outline: - Contextualization = enriching the retrieved chunk with surrounding context so the LLM understands its role in the source document. Counters the "isolated chunk problem". - Techniques: contextual chunk headers (prepend section title + parent heading to each chunk); document-level summary as preface (top of each chunk has a 1-sentence "what doc is this from" line); Anthropic-style contextual retrieval (LLM adds 50-100 token preamble to each chunk at ingest). - Cost: extra ingest tokens for each contextualization; one LLM pass per chunk for Anthropic-style. - Benefit: ~10-30 percentage point lift in retrieval precision on benchmark tests (Anthropic 2024 contextual retrieval blog: 49% reduction in retrieval failure). - Numbers to drop: "Anthropic contextual retrieval blog: ~$1.02 per million tokens of contextualization (with Haiku cache); 49% reduction in retrieval failures."
Common follow-ups: - "When does contextualization not pay off?" - "How does it compare to small-to-big retrieval?"
Traps: - Confusing contextualization with chunk overlap. - Skipping the ingest-time cost in the trade-off discussion.
Q: "How do you handle long documents that exceed model context limits in RAG?"¶
Tags: senior · common · scenario · source: Analytics Vidhya 40 Qs Q35
Answer outline: - Hierarchical retrieval: index at multiple granularities (paragraph, section, doc); retrieve at paragraph level, expand to section as context. - Map-reduce: retrieve top-K chunks; the LLM summarizes each (map), then composes from summaries (reduce). Latency expensive but handles arbitrary length. - Outline retrieval: index a structured outline (TOC, headings); retrieve outline first, then drill down into selected sections. - Compression: LLM-driven context compression (LLMLingua, recompressor) prunes the retrieved chunks down to ~30-50% of their tokens. - Hybrid: the right mix depends on query type (factoid vs summarization). - Numbers to drop: "LLMLingua-2 reports ~2-4× compression with <1 point quality loss on typical RAG benchmarks."
Common follow-ups: - "When does map-reduce make sense vs hierarchical?" - "Cost of compression vs just using more context tokens?"
Traps: - Defaulting to map-reduce for every query — overkill on factoids. - Skipping compression mention in tight-budget scenarios.
Q: "What is the role of metadata in advanced RAG retrieval?"¶
Tags: mid · common · conceptual · source: Analytics Vidhya 40 Qs Q25
Answer outline: - Metadata = anything about the chunk that isn't the chunk text: source, timestamp, author, ACL/tenant, document_type, section_path, language. - Pre-filter: hard filters at retrieval (tenant_id, date range, document_type) cut the candidate set before similarity scoring. Cheap and essential for multi-tenant or temporal queries. - Post-filter / boost: soft signals (recency boost, source-authority boost) modify rerank scores. - Citation enrichment: metadata flows into the LLM prompt so the answer can say "from contract_v2.pdf, page 14". - A RAG system without metadata is a toy. Production RAG is metadata-heavy. - Numbers to drop: "Metadata filters typically cut the search space by 90-99% for tenant-scoped queries → both speed and precision win."
Common follow-ups: - "Pre-filter vs post-filter — when each?" - "How do you handle high-cardinality metadata (millions of tenant IDs)?"
Traps: - Treating metadata as optional. - Post-filtering when pre-filtering would be cheaper and safer.
Q: "What is span highlighting / offset-based citation in RAG, and why does it matter?"¶
Tags: senior · occasional · conceptual · source: Analytics Vidhya 40 Qs Q37
Answer outline:
- Span highlighting: each retrieved chunk includes the character offsets that map back to the source document. Citations point not just to the chunk but to the exact substring.
- Critical for: legal/medical/finance domains where users need to verify the source verbatim; UIs that highlight the supporting text in the source PDF.
- Forced-citation prompts: the LLM must emit [source_id:start-end] for each claim; without offsets, citations are unverifiable.
- Implementation: at chunk creation, store (doc_id, char_start, char_end); map back to PDF page+bbox if needed.
- Numbers to drop: "Span-level citations: 2-5× higher user trust in surveyed legal-AI products vs document-level citations alone."
Common follow-ups: - "How do you maintain offsets through chunking?" - "What if the LLM cites a span that doesn't support the claim?"
Traps: - Document-level citations only — interview red flag for high-stakes domains. - Forgetting to validate the citation post-generation.