RAG Fundamentals — Interview Questions¶

The single highest-yield topic in 2026 AI engineer interviews. "Design a RAG system for customer support" is the most-reported question across companies. Every component is a failure mode worth interrogating.

Pipeline anatomy¶

Q: "What is Retrieval-Augmented Generation (RAG), and why is it critical for LLMs?"¶

Tags: screen · very-common · conceptual · source: Adil Shamim — Top 20 RAG Interview Questions Q1; DataCamp Top 30 RAG Q1

Answer outline: - RAG = retriever (fetches relevant evidence from external sources) + generator (LLM that answers given the evidence). Inference-time grounding, no weight changes. - Solves three concrete LLM weaknesses: frozen training data (no fresh facts), no access to private/enterprise data, and confident hallucinations on tail knowledge. - Compose RAG when failures are "missing/stale facts"; compose fine-tuning when failures are "wrong format/tone/behavior". They are not substitutes — production usually does both. - Adds latency (retrieval round-trip) and cost (embedding + extra context tokens) — quantify both before defending the architecture. - Numbers to drop: "Typical RAG adds 50-200ms retrieval latency and 1-3k tokens of context, ~$0.002-0.005 extra per query at 2026 prices."

Common follow-ups: - "Why not just put everything in the LLM's context window now that we have 1M+ token windows?" - "When is RAG the wrong tool?"

Traps: - Describing RAG as "an LLM that searches the internet" — it searches your indexed corpus. - Saying RAG "teaches the model new facts" — it injects them at runtime; weights don't move. - Treating it as a single box. RAG is a chain of 5-7 components, each its own failure mode.

Related cross-cutting: RAG vs fine-tune vs prompt engineering Related module: learning/01_ai_engineering/08_rag_system_design/

Q: "Walk me through the full RAG pipeline step-by-step."¶

Tags: mid · very-common · conceptual · source: Adil Shamim Top 20 Q9; Kalyan RAG Hub Q15

Answer outline: - Offline (indexing): load → clean → chunk → embed → upsert into vector store, with metadata (source, timestamp, ACLs). - Online (query time): user query → optional rewrite/expand → embed query → ANN search top-K → optional rerank → assemble prompt with system + question + chunks + "answer only from context" → LLM generate → optional verification/citation pass. - Trace every stage as a span: each is a failure mode you'll need to debug. Tag chunk IDs into the trace so you can attribute hallucinations. - The number you optimize end-to-end is answer faithfulness, not retrieval recall — recall is necessary, not sufficient. - Numbers to drop: "Mature production pipelines: 8-12 stages, P95 end-to-end 600-1500ms, retrieval is 5-15% of that latency, generation is 70-85%."

Common follow-ups: - "Where does reranking sit in the pipeline and why?" - "Which stage would you cache first?"

Traps: - Skipping the verification/citation pass — interviewers will probe it. - Listing components without naming a failure mode for each. - Forgetting metadata. Filtering by ACL/tenant/time is not optional in production.

Related module: learning/01_ai_engineering/08_rag_system_design/08-rag-pipeline.md

Q: "What are the two main components of a RAG application's architecture?"¶

Tags: screen · very-common · conceptual · source: Adil Shamim Top 20 Q7

Answer outline: - Retriever: any system that turns a query into a ranked list of evidence chunks. Sparse (BM25), dense (bi-encoder), or hybrid. - Generator: the LLM that synthesizes an answer conditioned on the query + retrieved context. Often the same model that handles regular chat. - Production reality has a third critical component most candidates miss: the evaluator — faithfulness + relevance scoring is what closes the loop. - Frameworks (LangChain, LlamaIndex, Haystack) wrap these three, but the architecture is framework-agnostic.

Common follow-ups: - "What's the third component most people forget?" - "Can the retriever and generator share a model?"

Traps: - Conflating retriever quality with generator quality — different metrics, different fixes. - Mentioning frameworks instead of the abstract architecture.

Q: "Explain the indexing process in a RAG pipeline and why it's essential."¶

Tags: mid · very-common · conceptual · source: Kalyan RAG Hub Q11; Adil Shamim Top 20 Q8

Answer outline: - Ingest → parse (PDF/HTML/Markdown) → clean → chunk → enrich with metadata → embed → upsert into vector store with the metadata as filterable fields. - Idempotency matters: re-ingesting the same source must not create duplicates. Use a stable content hash as primary key. - Incremental indexing > full reindex. Tag chunks with source_id + version so reindex is a delta. - Run a tiny smoke retrieval after every batch — "indexed 10k docs, retrieved top-5 for a canary query, found expected doc" — to catch silent failures. - Numbers to drop: "1M chunks at 1536-dim float32 ≈ 6GB raw; with PQ-compression 256-byte codes, ~256MB. Embedding cost: ~$5-50 per million 500-token chunks at 2026 prices."

Common follow-ups: - "How do you handle a document that gets updated in place?" - "What's the cost of getting the chunking wrong before you notice?"

Traps: - Treating indexing as a one-time job. Documents update; embeddings drift when models update. - Forgetting backup/versioning of the index itself.

Q: "What types of data sources can a RAG system ingest?"¶

Tags: screen · common · conceptual · source: Adil Shamim Top 20 Q6; DataCamp Top 30 Q4

Answer outline: - Unstructured: PDFs, Word docs, HTML, Markdown, Slack/Teams transcripts, email, support tickets, source code, audio transcripts. - Structured: SQL tables (each row → templated text), knowledge graphs (each triple → templated text), CSVs. - Hybrid: PDFs with tables — use a table-aware parser (Camelot, Tabula, Azure Document Intelligence) and keep table chunks separate. - Source-specific extraction is the largest source of garbage-in / garbage-out. Always test the parser before defending an embedding model.

Common follow-ups: - "How do you ingest a 200-page PDF with tables and figures?" - "Why don't you just OCR everything to text?"

Traps: - Saying "we just dump it into LangChain" — the loader choice matters more than the framework. - Forgetting that audio/video transcripts have their own chunking constraints (speaker turn boundaries).

Related module: learning/01_ai_engineering/06_evidence_data_pipelines/

Q: "What are the main benefits of using RAG instead of just relying on an LLM's internal knowledge?"¶

Tags: screen · very-common · conceptual · source: DataCamp Top 30 Q2

Answer outline: - Freshness: knowledge updates by ingesting new docs, not by retraining a model. - Source attribution: every claim cites a chunk; auditable answers. - Hallucination control: the model has the evidence on the desk; the prompt instructs "answer only from these". - Private data: training on customer data is a compliance hazard; injecting at inference is not. - Cost: cheaper than fine-tuning for fact-heavy tasks; you pay tokens, not GPU-hours. - Numbers to drop: "RAG iteration loop: hours. Fine-tune loop: days. Update freshness target: docs available in retrieval within 5 minutes of ingest."

Common follow-ups: - "What does RAG not fix that fine-tuning does?" - "Your customer says 'just train it on our manual' — what do you tell them?"

Traps: - Claiming RAG removes hallucinations entirely — it reduces them; doesn't eliminate. - Claiming RAG eliminates fine-tuning — they compose.

Related cross-cutting: RAG vs fine-tune vs prompt engineering

Q: "Give examples of real-world applications where RAG demonstrated value."¶

Tags: screen · common · conceptual · source: Kalyan RAG Hub Q10; DataCamp Top 30 Q3

Answer outline: - Customer support: ground answers in product docs + ticket history. Reduces resolution time and human escalations. - Internal knowledge bots: HR policies, engineering wikis, runbooks. Best ROI use case in most companies. - Legal/medical/finance: high-stakes domains where citations are mandatory and hallucinations are unacceptable. - Code search & code Q&A: index repo + docs; answer "where do we handle X?" with file pointers. - Be specific in the interview: name the company or product if you can. "Like Notion AI Q&A on your workspace" beats "knowledge management".

Common follow-ups: - "Which of these saw the biggest ROI in your past work?" - "Where does RAG not fit — what kinds of products would you say no to?"

Traps: - Listing generic "chatbot" without a concrete domain. The trap is being too abstract. - Forgetting compliance constraints — legal/medical RAG without audit logging is unshippable.

Chunking¶

Q: "Why is chunking necessary in RAG?"¶

Tags: screen · very-common · conceptual · source: Kalyan RAG Hub Q12; Adil Shamim Top 20 Q15

Answer outline: - Embedding models have a fixed input window (e.g., 512 tokens for BGE, 8192 for text-embedding-3-large). Past that, content gets truncated or averaged into a noisy vector. - Long documents mix multiple topics; one vector for the whole doc represents the average of unrelated ideas → low retrieval precision. - LLMs have context budgets; you ship the top-K chunks, not the top-K full documents. - Optimal chunk size depends on document type: docs that are conceptual paragraphs chunk well at 300-800 tokens; code/tables need structure-aware splits. - Numbers to drop: "Rule of thumb: 256-512 token chunks with 10-20% overlap is the default starting point. Tune from there using retrieval metrics."

Common follow-ups: - "What happens with chunks that are too small?" - "Your domain has 50-page contracts. What chunk strategy?"

Traps: - Defending one chunk size for all data types. - Skipping the overlap discussion — chunk boundaries lose context without it.

Q: "How do you choose chunk size for a RAG system?"¶

Tags: mid · very-common · scenario · source: Kalyan RAG Hub Q13

Answer outline: - Start at 300-500 tokens with 10-20% overlap as a default. This works for most prose corpora. - Constraint 1: embedding model's max input — don't exceed it. - Constraint 2: LLM context budget for top-K — if K=5 and budget is 4k tokens, max chunk size is ~800. - Constraint 3: information density — code/tables need smaller chunks; narrative prose tolerates larger. - Tune empirically using a labeled eval set: vary chunk size, measure Context Precision/Recall and downstream faithfulness. Don't pick by intuition. - Numbers to drop: "Eval setup: 100 (query, golden-chunk) pairs. Sweep chunk size in {128, 256, 512, 1024}. Pick by Context Precision@5."

Common follow-ups: - "How big can overlap go before it starts hurting?" - "What if the same fact spans two chunks?"

Traps: - Picking by intuition without an eval set. - Confusing chunk size with retrieval K.

Q: "What are the trade-offs between chunking documents into larger versus smaller chunks?"¶

Tags: mid · very-common · conceptual · source: DataCamp Top 30 Q19; Kalyan RAG Hub Q14

Answer outline: - Small chunks (≤256 tokens): sharper embeddings, better precision, but lose long-range context. Multi-step reasoning across the doc breaks. - Large chunks (≥1024 tokens): preserve context, but embeddings average too many ideas → lower precision; cost more LLM tokens per retrieved chunk. - The "right" size depends on whether queries are factoid (small) or summarization-style (large). - A robust default: small chunks for retrieval, with the parent document or neighboring chunks fetched alongside for context expansion ("small-to-big" retrieval). - Numbers to drop: "Small-to-big: index at 256 tokens, retrieve top-5, then expand each to its 1024-token parent block. Best of both."

Common follow-ups: - "What's small-to-big retrieval?" - "When would you ship 100-token chunks?"

Traps: - Treating it as a single dial — most production setups use multiple chunk sizes for different document types. - Ignoring cost — bigger chunks mean more tokens shipped per query.

Q: "What are the common chunking methods, and what are their pros and cons?"¶

Tags: mid · common · conceptual · source: DataCamp Top 30 Q18; Kalyan RAG Hub Q34

Answer outline: - Fixed-size: cheap, deterministic; breaks across sentences/sections randomly. Default for prototype. - Sentence/paragraph: respects natural boundaries; uneven sizes. Use NLTK/spaCy for sentence splits. - Recursive character (LangChain default): tries paragraph → line → sentence → character fallback. Good general purpose. - Semantic: embed sentences, split where embedding similarity drops below threshold. Better coherence; 2-3× slower to index. - Structure-aware: for Markdown/code, split on headings/functions. Always preferred when structure exists. - Late chunking: embed full document first, then chunk the token-level embeddings. Preserves long-range context but limited to docs that fit the encoder's window. - Numbers to drop: "Semantic chunking adds ~10-20% to indexing cost vs fixed-size for ~5-10% improvement in Context Precision on prose."

Common follow-ups: - "When does semantic chunking lose to fixed-size?" - "What is late chunking and when does it help?"

Traps: - Defending semantic chunking universally — it's not always worth the cost. - Forgetting Markdown/code chunking — structure-aware always beats character-based for structured docs.

Q: "What is the purpose of character overlap during chunking?"¶

Tags: screen · common · conceptual · source: Kalyan RAG Hub Q8

Answer outline: - Without overlap, a sentence that bridges two chunks gets split — one chunk has the subject, the next has the verb. Retrieval misses both. - Overlap (10-20% of chunk size) duplicates the boundary tokens so context survives the split. - Cost: indexing size grows by the overlap percentage; cost of retrieval is unchanged. - Excessive overlap (>30%) inflates the index and creates near-duplicate retrievals. - Numbers to drop: "Standard: 50-token overlap on 256-token chunks (~20%). Adds 20% to vector store size; recovers most boundary-cut answers."

Common follow-ups: - "What if a fact spans more than 2× the overlap?" - "Does overlap matter for code chunks?"

Traps: - Setting overlap = 0 because "embeddings are smart" — they aren't, for cut sentences. - Setting overlap > 50% — you're just paying for duplicate retrieval.

Q: "How does chunking strategy differ for structured documents (PDFs with tables/figures) versus plain text?"¶

Tags: senior · common · scenario · source: Kalyan RAG Hub Q37

Answer outline: - Plain text: recursive character or semantic chunking works. - Structured PDF: parse with table-aware tools (Camelot, Azure Document Intelligence, Unstructured.io). Tables and figures get their own chunks with structured metadata, not flattened into prose. - For tables: serialize each row as a sentence ("Customer Acme paid $50k in Q3") plus keep the original table as metadata for citation. Embed the row-as-sentence. - For figures: caption-as-text + image hash; multimodal RAG handles the image vector separately. - Code: chunk by function/class boundaries (tree-sitter); never split mid-function. - Numbers to drop: "A real benchmark: switching from naive PDF text extraction to table-aware parsing lifted Context Precision@5 from 0.42 to 0.71 on a financial QA dataset (LlamaIndex blog, 2025)."

Common follow-ups: - "How do you embed a table for retrieval?" - "Where does multi-modal RAG enter for figures?"

Traps: - Flattening tables to text and embedding — destroys all structure. - Treating PDFs as one homogeneous input. PDFs are heterogeneous; the parser owns the answer.

Related module: learning/01_ai_engineering/06_evidence_data_pipelines/

Q: "Your chunk size is 512 but legal documents have 50-page contracts. What breaks?"¶

Tags: senior · very-common · scenario · source: applied_ai_interview_focus.md (synthesized from multiple loops)

Answer outline: - 512-token chunks on a 50-page contract = ~150 chunks per doc. Top-K=5 retrieves 5 disconnected paragraphs; nuanced clauses with cross-references break. - Symptom: "answer is technically in the docs but the model says 'not found' or hallucinates a synthesis". - Fix 1: hierarchical / parent-document retrieval — retrieve at small chunks, expand to clause-level or section-level parents for the LLM. - Fix 2: structure-aware chunking by contract section (Articles, Clauses) — legal docs have rigid structure to exploit. - Fix 3: metadata-aware re-ranking — boost chunks whose section_id matches a query intent classifier. - Numbers to drop: "On a real legal corpus: switching to section-aware chunks raised faithfulness from 0.61 to 0.84 even with the same embedding model."

Common follow-ups: - "Show me the prompt change after you switch to parent retrieval." - "How do you handle a clause that cross-references another clause by name?"

Traps: - Just raising chunk size to 8000 — embeddings get noisy; LLM context fills up; cost balloons. - Ignoring structure — legal docs have section numbers; use them.

Related cross-cutting: Chunk size trade-offs

Q: "Page 1 of a financial report says 'all amounts in thousands.' When you chunk page-by-page, that qualifier is lost. How do you preserve document-wide context?"¶

Tags: senior · common · scenario · source: 2026 RAG loop (document-context probe); Anthropic Contextual Retrieval

Answer outline: - Name the failure: independent chunks lose document-global facts — units, currency, effective date, defined terms ("the Company means Acme"). Retrieve the table chunk without "in thousands" and the model reports 5,000 instead of 5,000,000. The qualifier can be 40 pages away, so bigger chunks don't fix it. - Fix 1 — Contextual Retrieval (Anthropic pattern): before embedding, prepend a short LLM-generated blurb situating each chunk in its document ("This table reports Q3 revenue in thousands of USD for Acme Corp"). Embed + index the contextualized chunk. Pairs well with BM25 + rerank. - Fix 2 — document-level metadata propagation: extract global facts once per doc (units, currency, fiscal year, parties) and attach to every chunk as metadata or a prepended line. - Fix 3 — hierarchical / parent retrieval: retrieve the small chunk, expand to the section or document header that carries the qualifier. - Fix 4 — inject the global facts into the answer-time preamble regardless of which chunks retrieved. - Cost note: contextualizing every chunk is N extra LLM calls at index time — make it cheap by prompt-caching the full document prefix (exactly what Anthropic's writeup does). - Numbers to drop: "Contextual Retrieval: ~35% fewer retrieval failures, ~49% combined with rerank (Anthropic 2024)", "global metadata extracted once per doc, attached to all chunks", "index-time cost amortized with prompt caching"

Common follow-ups: - "How do you keep index-time contextualization cheap?" (cache the document prefix across its chunks) - "What global facts would you extract for a contract vs a financial report?"

Traps: - Assuming bigger chunks fix it — the qualifier may be tens of pages away. - Re-feeding the whole document per chunk with no caching → index-time cost explosion.

Related cross-cutting: Chunk size trade-offs Related module: learning/01_ai_engineering/09_advanced_rag_patterns/

Embeddings & similarity¶

Q: "What are embeddings, and how are they used in RAG retrieval?"¶

Tags: screen · very-common · conceptual · source: Kalyan RAG Hub Q43

Answer outline: - An embedding is a fixed-dimensional vector (typically 768-3072 dim) where semantically similar inputs have geometrically close vectors. - Training objective: contrastive — pull similar pairs together, push dissimilar apart (e.g., InfoNCE on sentence pairs). - At index time: every chunk → vector. At query time: query → vector. Retrieval = approximate nearest-neighbor search. - The embedding model is the semantic backbone of the system — switching it is a full reindex. - Numbers to drop: "BGE-large (1024d), text-embedding-3-large (3072d), Cohere embed-v3 (1024d) are the 2026 production defaults. 3072d ≈ 12KB per chunk at float32."

Common follow-ups: - "Why are the dimensions what they are?" - "Can two semantically opposite sentences have similar embeddings?"

Traps: - Calling them "BERT embeddings" — most production retrieval uses sentence-encoders specifically trained for similarity (E5, BGE, Cohere, OpenAI), not raw BERT [CLS]. - Forgetting normalization. Most retrieval relies on L2-normalized vectors.

Related module: learning/01_ai_engineering/08_rag_system_design/05-embeddings.md

Q: "What role does cosine similarity play in RAG, and why is it preferred?"¶

Tags: screen · very-common · conceptual · source: Kalyan RAG Hub Q9, Q49

Answer outline: - Cosine measures the angle between two vectors, ignoring magnitude. For sentence embeddings, magnitude often encodes irrelevant signal (sentence length, frequency). - Equivalent to dot product when both vectors are L2-normalized — most production stacks normalize and use dot product for speed. - Euclidean distance is sensitive to magnitude; less stable across documents of varying length. - Choice depends on the model — most modern sentence-encoders are explicitly trained with cosine, so cosine is what you should use at retrieval. - Numbers to drop: "Cosine on normalized vectors ≈ 1× the cost of dot product. Index types (HNSW, IVF) all support both."

Common follow-ups: - "When would you actually pick Euclidean?" - "What happens if you forget to normalize?"

Traps: - Saying cosine is "better than Euclidean universally" — it's matched to how the model was trained. - Mixing normalized and unnormalized vectors in the same index.

Q: "How do you choose an embedding model for a RAG system?"¶

Tags: mid · very-common · scenario · source: Kalyan RAG Hub Q44; Adil Shamim Top 20

Answer outline: - Use the MTEB leaderboard as the starting filter; pick top-5 candidates that fit your domain (English-only? multilingual? code? legal?). - Benchmark on your data with a small labeled set (~100 query-chunk pairs). MTEB ranks generic; your domain may invert. - Trade off cost (open-source self-hosted vs API) vs quality vs dimension (higher dim = more storage + slower ANN search but usually better recall). - Consider context length: BGE-small (512 tokens) vs text-embedding-3-large (8192) — chunk-size strategy depends on it. - Stability matters: an OpenAI deprecation forces a full reindex. Self-hosted (BGE, Stella, GTE) gives you version control. - Numbers to drop: "Production sweet spots in 2026: BGE-large-en-v1.5 (1024d, open, ~$0/M tokens self-hosted), text-embedding-3-large (3072d, ~$0.13/M tokens), Cohere embed-v3 (1024d, multilingual)."

Common follow-ups: - "When does fine-tuning the embedding model pay off?" - "Your stack is text-embedding-ada-002 from 2023. What do you do?"

Traps: - Defending OpenAI embeddings without acknowledging vendor lock-in. - Picking by MTEB only, without testing on your domain.

Q: "What's the difference between sparse and dense embeddings, and when do you use each?"¶

Tags: mid · common · conceptual · source: Kalyan RAG Hub Q54

Answer outline: - Sparse (BM25, TF-IDF, SPLADE): vectors are mostly zeros with high values on the words actually present. Strong on exact-match (rare entities, IDs, SKUs, code symbols). - Dense (BGE, OpenAI, Cohere): all dimensions filled by a neural encoder. Strong on semantic similarity ("revenue" ↔ "income"), but can miss exact match (product codes, names). - Hybrid combines both via RRF or weighted sum — almost always wins in production. The interview cliché is "hybrid is the default in 2026". - SPLADE is the bridge: sparse vectors learned by a transformer, so you get interpretability + ANN-friendliness. - Numbers to drop: "On enterprise QA: pure dense ~0.62 nDCG@10, pure BM25 ~0.55, hybrid (RRF) ~0.71. The 9-point lift is why hybrid wins."

Common follow-ups: - "What is RRF and how does it combine scores?" - "Your data is product SKUs in support tickets — would you pick dense or sparse first?"

Traps: - Saying dense is always better — it loses on exact-match retrieval. - Forgetting hybrid exists in the answer.

Related cross-cutting: Sparse vs dense vs hybrid retrieval

Q: "What is approximate nearest neighbor (ANN) search, and why is it used in RAG?"¶

Tags: mid · common · conceptual · source: Kalyan RAG Hub Q46-Q47

Answer outline: - Exact nearest-neighbor over 1M+ vectors is O(N) per query and too slow. - ANN trades recall for speed: HNSW builds a navigable small-world graph (typical), IVF clusters into Voronoi cells, PQ compresses vectors into byte codes. - HNSW is the production default (Qdrant, Weaviate, pgvector): logarithmic-time queries, recall@10 typically 0.95-0.99 with proper params (efSearch, M). - For 100M+ vectors, IVF-PQ or DiskANN — trade more recall for less memory. - Numbers to drop: "HNSW on 10M 768d vectors: ~5-10ms P95 query, ~30GB RAM. IVF-PQ same data, ~1-3GB RAM, ~20ms P95, ~3-5 percentage points recall hit."

Common follow-ups: - "What parameter tunes HNSW for higher recall?" - "When do you graduate from pgvector to a dedicated vector DB?"

Traps: - Treating ANN as a black box; not naming HNSW/IVF. - Forgetting recall — ANN is approximate; measure it.

Q: "What is quantization for embeddings, and what trade-offs does it introduce?"¶

Tags: senior · occasional · conceptual · source: Kalyan RAG Hub Q58-Q60

Answer outline: - Float32 1536-dim vector = 6KB. Scalar quantization (float32 → int8) cuts to 1.5KB with ~1-2% recall loss. Binary quantization (1 bit per dim) cuts to 192 bytes with ~5-10% recall loss. - Product Quantization (PQ): split vector into M subvectors, replace each with the nearest centroid index in a learned codebook. ~30× compression typical. - Use cases: scaling to 100M+ vectors, edge deployment, multi-tenant RAG where storage cost dominates. - Validate per domain — recall hit varies wildly. Some domains tolerate binary; others fall apart. - Numbers to drop: "Cohere binary embeddings on MIRACL: ~95% of float32 recall at 1/32 the storage."

Common follow-ups: - "When would you skip quantization?" - "Matryoshka embeddings — how do they compare?"

Traps: - Defending quantization without measuring recall on your data. - Confusing embedding quantization with model quantization (different concepts).

Retrieval basics¶

Q: "How does the retriever work in a RAG system? What are common retrieval methods?"¶

Tags: screen · very-common · conceptual · source: DataCamp Top 30 Q6; Kalyan RAG Hub Q40

Answer outline: - The retriever maps a query → ranked list of candidate chunks. Three families: - Sparse (lexical): BM25, TF-IDF — score = term-frequency × inverse-document-frequency. Strong on exact match, weak on synonyms. - Dense (semantic): bi-encoder embeds query and chunks; ranked by cosine. Strong on synonyms, weak on rare entities/IDs. - Hybrid: combine sparse + dense scores via Reciprocal Rank Fusion (RRF) or weighted sum. Production default. - Optionally followed by a reranker (cross-encoder) on top-K. - Numbers to drop: "Typical stack: retrieve top-50 hybrid, rerank to top-5 with cross-encoder. Total latency ~50-100ms."

Common follow-ups: - "Where does the reranker sit and why?" - "How does RRF actually compute the fused score?"

Traps: - Saying "the retriever uses cosine similarity" — that's the scoring, not the retrieval. ANN over an index is the retrieval. - Confusing retriever output (chunks) with generator input (prompt).

Q: "Describe what a hybrid search is and how it works."¶

Tags: mid · very-common · conceptual · source: DataCamp Top 30 Q12; Kalyan RAG Hub Q51

Answer outline: - Run BM25 and dense retrieval in parallel; merge their result lists. - RRF (Reciprocal Rank Fusion): for each candidate, sum 1 / (k + rank) across both retrievers (typical k=60). Robust, hyperparam-free. - Weighted sum: normalize both score distributions to [0,1], then α × dense + (1-α) × sparse. Tunable but fragile to score-distribution shifts. - Hybrid almost always lifts recall by 5-15 percentage points over either alone on real corpora. - Costs ~2× retrieval latency (parallelized) and 2× storage. - Numbers to drop: "Pinecone hybrid, Weaviate hybrid, Qdrant fusion, Vespa all support RRF natively in 2026."

Common follow-ups: - "Why RRF over weighted sum?" - "When does pure dense actually beat hybrid?"

Traps: - Skipping hybrid in a senior interview — it's the production default. - Defending weighted sum without normalization (one score type will dominate).

Q: "How do you choose the right retriever for a RAG application?"¶

Tags: mid · common · scenario · source: DataCamp Top 30 Q11

Answer outline: - Default to hybrid (BM25 + dense). The interview cliché but it's correct in 2026. - Pure dense if your queries are paraphrastic and entities are infrequent (e.g., conceptual Q&A on Wikipedia). - Pure sparse if your queries are heavy on rare entities (product SKUs, error codes, function names). - Add a cross-encoder reranker if Context Precision@5 is below 0.7 after retrieval. - Domain-fine-tune the embedding model if recall plateaus — usually +2-8 points on niche corpora. - Numbers to drop: "Order to add complexity: BM25 → hybrid → hybrid + rerank → hybrid + rerank + FT embeddings. Each step is ~2-4× the cost of the previous."

Common follow-ups: - "When does fine-tuning the embedding model beat upgrading it?" - "Show your cost model for adding rerank on 1M queries/day."

Traps: - Picking based on benchmark scores alone; not measuring on your domain. - Adding all four layers at once before validating each.

Q: "How do you handle ambiguous or incomplete user queries in a RAG system?"¶

Tags: mid · common · scenario · source: DataCamp Top 30 Q10; Kalyan RAG Hub Q24

Answer outline: - Query rewriting: an LLM expands "How do I reset?" → "How do I reset my Acme account password?" using session context. - Multi-query expansion: generate 3-5 paraphrases; retrieve for each; union (often with RRF). - HyDE: generate a hypothetical answer to the query; embed that and retrieve. Powerful for short queries. - Clarifying back-and-forth: if confidence is low, ask the user before retrieving. Most production systems skip this in favor of expansion. - Conversational rewrite: for multi-turn, rewrite the query using last 2-3 turns as standalone. - Numbers to drop: "Multi-query expansion: 3 paraphrases at retrieval lifts recall by 5-12 points on conversational benchmarks; adds 2× embedding cost."

Common follow-ups: - "What's HyDE and when does it help?" - "How do you know the query needs rewriting?"

Traps: - Blindly expanding every query — adds cost and can dilute focused queries. - Missing the conversational rewrite case; multi-turn is where most production RAG breaks.

Q: "How does a RAG system maintain context in a multi-turn conversation?"¶

Tags: mid · common · scenario · source: DataCamp Top 30 Q17

Answer outline: - Standalone query rewriting: before retrieval, rewrite "What about Q4?" → "What was Acme's revenue in Q4 2024?" using the conversation history. This is the most reliable approach. - History in retrieval: embed the last N turns + current query as one retrieval query. Simpler but noisier. - Memory summarization: maintain a rolling summary of the conversation; include in the system prompt. - Separate chains: retrieval-time uses standalone rewrite; generation-time receives full conversation history. Don't conflate them. - Numbers to drop: "Standalone rewrite latency: ~150-300ms with a small model (Haiku, gpt-4o-mini). Saves 30-50% retrieval failures on real conversational logs."

Common follow-ups: - "When does standalone rewriting hurt rather than help?" - "Pronoun resolution in the rewrite — how do you handle it?"

Traps: - Stuffing whole conversation history into the retrieval query — noisy, low precision. - Forgetting the difference between retrieval context and generation context.

Citations & faithfulness¶

Q: "How does RAG help reduce hallucinations in LLM-generated responses?"¶

Tags: screen · very-common · conceptual · source: Kalyan RAG Hub Q6

Answer outline: - The LLM no longer guesses from parametric knowledge; it has the evidence on the desk and the prompt instructs "answer only from this context". - Mitigates two failure modes: stale training data (fixed by retrieval freshness) and tail-knowledge confabulation (fixed by source attribution). - Does NOT eliminate hallucinations — generators can still hallucinate when context is missing/wrong/conflicting. - Pairs with citation generation: every claim cites a chunk ID; downstream validators can check the claim against the cited chunk. - Numbers to drop: "RAG cuts hallucination rate by ~30-60% on fact-heavy benchmarks (paper: Lewis et al. 2020 + follow-ups), but doesn't go to zero."

Common follow-ups: - "What's left after RAG — when can the model still hallucinate?" - "How do you detect a hallucination at runtime?"

Traps: - Claiming RAG eliminates hallucinations. - Skipping the "answer only from context" prompt — without it, the model still uses parametric knowledge.

Q: "How do you detect and mitigate hallucinations in a production RAG system?"¶

Tags: senior · very-common · scenario · source: applied_ai_interview_focus.md; DataCamp Top 30 Q34

Answer outline: - Define operationally: hallucination = assertion in the answer not supported by the retrieved context. This is faithfulness, distinct from factual correctness. - Layer 1 — prompt: "answer only from context; if not present, say you don't know". Cuts 20-40%. - Layer 2 — citations: force the LLM to emit [chunk_id] after each claim; reject answers without citations. - Layer 3 — verifier pass: a small LLM judges each claim against its cited chunk (NLI-style or pairwise). Reject low-faithfulness answers; retry or fall back. - Layer 4 — eval gate in CI: faithfulness scored offline on a golden set; block deploy if it regresses. - Numbers to drop: "Faithfulness target: ≥0.85 on the golden set. Cost of verifier: ~$0.0005-0.002 per answer with a small model; ~150-300ms latency."

Common follow-ups: - "Cost of LLM-as-judge at 1M queries/day?" - "Your faithfulness is up but users complain more — what's wrong?"

Traps: - Reaching for BLEU/ROUGE — they don't measure faithfulness. - Confusing retrieval failure (no relevant chunk found) with generation hallucination (relevant chunk found, model ignored it).

Related cross-cutting: Hallucination mitigation choices Related module: learning/01_ai_engineering/08_rag_system_design/13-faithfulness-ragas.md

Q: "How do you ensure the generated output stays consistent with the retrieved information?"¶

Tags: mid · common · scenario · source: DataCamp Top 30 Q34

Answer outline: - Prompt engineering: explicit "answer only from the context below; if not present, say 'I don't know'". - Citations: every claim tagged with [chunk_id]. Strip answers that fail a citation regex check. - Post-hoc validation: a faithfulness scorer (RAGAS, custom NLI judge) runs on (answer, cited chunks) and gates the response. - Constrained decoding: force JSON schema where the schema includes a cited_chunk_ids field — eliminates "forgot to cite". - Lowering temperature (0.0-0.3) reduces creative interpolation but doesn't fix attribution failures. - Numbers to drop: "Citation requirement + post-hoc verifier: faithfulness ~0.92 typical; without either, ~0.65-0.75."

Common follow-ups: - "What if the model cites a chunk but the claim isn't actually in it?" - "Difference between citation and groundedness?"

Traps: - Relying on prompt alone — models violate "answer only from context" silently. - Treating temperature=0 as a hallucination fix.

Q: "How do you handle conflicting information across retrieved sources?"¶

Tags: senior · very-common · scenario · source: applied_ai_interview_focus.md; DataCamp Top 30 Q7

Answer outline: - Detect: if top-K chunks disagree (LLM-judge step or NLI between chunk pairs), flag the conflict. - Resolution policies you'll defend: - Recency wins: boost chunks with newer timestamps; works for policies, pricing, product specs. - Authority wins: boost chunks from authoritative sources via metadata (e.g., source_tier = primary). - Acknowledge conflict: the LLM is instructed to present both views with citations rather than silently picking one. Most defensible for high-stakes domains. - Citation is non-negotiable — without it, the user can't adjudicate. - Numbers to drop: "Conflict-aware fallback ('We found conflicting info; here are both views'): user satisfaction stays high; silent picking causes biggest complaints when wrong."

Common follow-ups: - "Where does the conflict-detector live in the pipeline?" - "Two sources, both authoritative, both current — what do you ship?"

Traps: - Letting the LLM silently pick one source. - Hardcoding recency — sometimes the older source is canonical (legal, regulatory).

Q: "How do you handle out-of-date or stale information in a RAG system?"¶

Tags: senior · common · scenario · source: DataCamp Top 30 Q32

Answer outline: - Ingest cadence: target docs-available-in-retrieval within X minutes of source change. Use change-feed (webhook, CDC) when possible, scheduled crawl as fallback. - TTL on chunks: each chunk carries valid_until; the retrieval filter drops expired chunks automatically. - Recency boost: rerank with a freshness weight (e.g., score = base × exp(-age/τ)). - Source-of-truth pointers: for fast-changing data (prices, inventory), don't embed it — call a fresh API at generation time. RAG over the API spec, fetch the live value. - Version pruning: when a doc updates, mark old chunks deprecated rather than deleting (audit trail). - Numbers to drop: "Target update-to-retrieval lag in 2026 production: <5 min for high-priority sources; <1 hour acceptable for slow-moving docs."

Common follow-ups: - "Embed prices vs call API — defend your choice." - "How do you know what's stale before a customer complains?"

Traps: - "Reindex weekly" — too coarse for most domains. - Letting old chunks linger without TTL or deprecation flag.

Q: "What is faithfulness in RAG, and how is it measured?"¶

Tags: mid · very-common · conceptual · source: Kalyan RAG Hub Q97; RAGAS docs

Answer outline: - Faithfulness = the proportion of claims in the answer that are entailed by the retrieved context. Distinct from factual correctness (which compares to ground truth, not context). - Measurement (RAGAS approach): decompose the answer into atomic claims; for each claim, an LLM judges whether the context supports it (entailment). Score = supported / total. - Range [0, 1]. Production target typically ≥0.85. - A high-faithfulness answer can still be factually wrong if the context was wrong; faithfulness measures the generator's discipline, not the system's truth. - Numbers to drop: "RAGAS faithfulness on a 100-sample golden set: ~$0.20-0.50 to run with a small judge model. Refresh weekly."

Common follow-ups: - "Difference between faithfulness and answer relevance?" - "What's the cheapest way to compute faithfulness at scale?"

Traps: - Conflating faithfulness with factual correctness — they diverge whenever context is wrong. - Using BLEU/ROUGE as a faithfulness proxy — they measure surface n-gram overlap, not entailment.

Failure modes & debugging¶

Q: "What's the difference between a retriever returning the wrong document and a generator ignoring the right one — how do you debug?"¶

Tags: senior · very-common · debugging · source: applied_ai_interview_focus.md

Answer outline: - Two distinct failure modes; isolate by inspecting the trace. - Retrieval failure: the gold chunk isn't in top-K. Diagnosis: compute Context Recall on the golden set — if low, retrieval is the culprit. - Generation failure (context-ignored): gold chunk is in top-K but the answer is wrong or hallucinated. Diagnosis: faithfulness score is low while Context Recall is high. - Triage path: golden set with (query, gold_chunk_id, gold_answer). Compute Context Recall first; only if recall is fine do you investigate generation. - Fix retrieval: better embeddings, rerank, hybrid, chunking. Fix generation: prompt, smaller model maybe larger, structured output, citation enforcement. - Numbers to drop: "Real production split (LangSmith survey 2025): ~60% of RAG bugs are retrieval, ~30% generation, ~10% both."

Common follow-ups: - "Show the actual debug trace — what fields do you read first?" - "Faithfulness 0.95, Context Recall 0.4 — what's broken?"

Traps: - Tuning the generator before measuring retrieval. - Skipping the golden set — without it, you can't separate the two failures.

Q: "What are the common challenges and limitations of RAG systems?"¶

Tags: mid · common · conceptual · source: DataCamp Top 30 Q27; Kalyan RAG Hub Q3

Answer outline: - Quality ceiling = quality of retrieval. Bad retrieval ⇒ bad answer, no matter how good the generator. - Retrieval failure modes: paraphrase mismatch, rare entity miss, multi-hop questions where no single chunk has the answer. - Latency: retrieval adds 50-200ms; reranking adds more. Tight P95 budgets squeeze the design. - Cost: embeddings + reranker + LLM context tokens; can dominate query cost. - Maintenance: index drift, embedding model deprecations, source schema changes — RAG is a system, not a one-shot. - Privacy/multi-tenancy: chunks from different tenants must never mix; ACL filtering at retrieval is mandatory. - Numbers to drop: "Cost split in a mature RAG app: ~10% embeddings, ~5-10% vector store, ~70-80% generation tokens, ~5-10% reranker."

Common follow-ups: - "Which of these has bitten you in production?" - "Multi-hop questions — how do you handle them?"

Traps: - Forgetting maintenance/ops costs. - Skipping multi-tenancy — it's interview gold.

Q: "What happens when you have a weak retriever in a RAG system?"¶

Tags: mid · common · debugging · source: Kalyan RAG Hub Q39

Answer outline: - Symptoms: low Context Recall@K, generator either refuses ("not found in docs") or fabricates from parametric memory. - Even a perfect generator can't recover from a missing gold chunk — there's no information to ground on. - Diagnostic ladder: (a) is gold chunk in top-50? if not, retrieval is fundamentally broken; (b) is gold in top-50 but not top-5? rerank is the fix; (c) gold in top-5 but generator misses? prompt/generator issue. - Fixes: switch embedding model, add hybrid, add rerank, fine-tune embeddings on domain, fix chunking strategy. - Numbers to drop: "Floor for production: Context Recall@10 ≥ 0.85. Below 0.7 means the architecture is wrong, not the parameters."

Common follow-ups: - "Order of fixes — which do you try first and why?" - "What if Context Recall is high but Context Precision is low?"

Traps: - Blaming the generator for what retrieval broke. - Cranking K to 50 in production to "hide" weak retrieval — destroys precision and context budget.

Q: "What happens when you have a weak generator in a RAG system?"¶

Tags: mid · common · debugging · source: Kalyan RAG Hub Q23

Answer outline: - Symptoms: high Context Recall but low faithfulness; answers ignore retrieved context or stitch confidently wrong syntheses. - A weak generator may also fail at instruction-following — won't emit citations, ignores "say I don't know if not in context". - Fixes: upgrade model (Haiku→Sonnet, gpt-4o-mini→gpt-4o); tighter prompt with examples; structured output forcing citation field; small SFT on (context, golden answer) pairs. - Sometimes the right answer is more context, not a better model — Lost-in-the-Middle problem squeezes generators with too-long contexts. - Numbers to drop: "Lost-in-the-middle: claim recall drops ~20% when the relevant chunk sits in the middle of a 30k-token context vs the start or end."

Common follow-ups: - "Lost-in-the-middle — how do you mitigate it?" - "Smaller cheaper model with rerank vs bigger model without — defend."

Traps: - Assuming bigger model = better RAG — Lost-in-the-Middle hits big models too. - Ignoring the prompt; the generator failure may be a prompt failure in disguise.

Q: "What are the possible reasons for a poorly performing retriever?"¶

Tags: senior · common · debugging · source: Kalyan RAG Hub Q38

Answer outline: - Chunking wrong for the domain: legal/code/tables need structure-aware chunking; one-size-fits-all destroys precision. - Embedding-domain mismatch: generic encoder on a niche corpus (medical, legal, code) — fine-tune or pick a domain-specific encoder. - Stale index: the encoder was upgraded but the index wasn't reindexed. - No hybrid: dense-only misses rare entities; sparse-only misses paraphrases. Hybrid is the cheap fix. - Metadata not used: time/source/ACL filters not applied at retrieval; noise from irrelevant tenants. - Wrong K or wrong distance metric. Especially: mixing normalized and unnormalized vectors. - Numbers to drop: "Hybrid alone often lifts Context Precision@5 by 5-15 percentage points before any tuning."

Common follow-ups: - "How do you decide which fix to attempt first?" - "When does fine-tuning the embedding model beat upgrading the model?"

Traps: - Reaching for fine-tuning first when chunking is the real bug. - Skipping metadata filtering — produces hard-to-debug noise.

Q: "How do you ensure data privacy and security in a RAG system, especially when handling sensitive information?"¶

Tags: senior · common · scenario · source: DataCamp Top 30 Q35

Answer outline: - Ingest pipeline: PII detection (Presidio, AWS Macie) before embedding; redaction or chunk-level access tags. - At-rest encryption: the vector store stores chunks; treat it like any database (KMS, customer-managed keys). - In-transit encryption: TLS everywhere. - Access controls at retrieval: ACLs on chunks as metadata; query filter enforces user→tenant→chunk visibility before retrieval. Never rely on post-retrieval filtering. - No PII to third-party LLM unless contracted: route PII queries to in-region / self-hosted models. - Audit logging: every query + retrieved chunks + answer logged with user/session ID for compliance review. - Prompt injection from poisoned docs: treat retrieved content as untrusted; strip instruction-like patterns. - Numbers to drop: "GDPR/HIPAA-compliant deployments: in-region inference, customer-managed KMS, 30-day audit log retention minimum."

Common follow-ups: - "Where does ACL filtering happen — at the vector DB or in the application?" - "Indirect prompt injection via retrieved docs — how do you defend?"

Traps: - Filtering after retrieval instead of before — leaks via timing/result counts. - Treating embeddings as anonymized — they're often partially invertible.

Related cross-cutting: Input vs output guardrails

Q: "How do you ensure the reliability and robustness of a RAG system in production?"¶

Tags: senior · common · scenario · source: DataCamp Top 30 Q29

Answer outline: - Trace every stage: OpenTelemetry GenAI conventions, LangSmith, or Phoenix. Spans per retrieval/rerank/generate. - SLOs: P95 latency, faithfulness, Context Recall, refusal rate. Alert on regressions. - Fallbacks: retrieval returns nothing → graceful "I don't know"; vector store down → fall back to BM25 / cache. - Timeouts at every IO boundary: retrieval timeout, LLM timeout, rerank timeout. No unbounded waits. - Eval gate in CI: a 50-100 question regression set runs on every deploy; faithfulness/Context Recall must not regress more than X%. - Shadow traffic for risky changes: prompt or model swap → mirror live traffic, compare offline before promotion. - Rate limit & queue: the LLM provider will throttle; queue and degrade gracefully. - Numbers to drop: "Production SLO example: P95 < 1.5s, faithfulness ≥0.85, refusal rate ≤8%, deploy gates on all three."

Common follow-ups: - "Your faithfulness drops 5 points after a model upgrade — what do you do?" - "Where does the kill switch live?"

Traps: - No eval gate — every deploy is a coin flip. - No timeouts — one slow tool call cascades into hung requests.

Related module: learning/01_ai_engineering/03_agent_observability_debugging/

Q: "How is the prompt sent to the LLM in a RAG system different from a standard non-RAG prompt?"¶

Tags: screen · common · conceptual · source: Kalyan RAG Hub Q17

Answer outline: - A RAG prompt has 4-5 distinct sections: system role, retrieved context (chunks with IDs), task instructions, the user question, and the output format (often a JSON schema). - A non-RAG prompt has system + question; the model relies on parametric knowledge. - The system prompt is stricter in RAG: "answer only from the context", "cite chunk IDs", "say 'I don't know' if absent". - Context is labeled — [Chunk 1] (source: contract_v2.pdf, page 14) — so citations are unambiguous. - Token budget matters: RAG prompts can hit 5-15k tokens; prompt caching on the system portion saves a lot. - Numbers to drop: "Anthropic prompt cache on the system + instruction block: 90% cost reduction on repeated calls. Worth setting up once you exceed ~1k queries/day."

Common follow-ups: - "Where do you put the question — top or bottom of the prompt?" - "How do you enforce citations at the prompt level?"

Traps: - Putting the question above the context — lost-in-the-middle on long contexts. - Skipping chunk IDs — without them, citations become unverifiable strings.

Q: "How would you design a RAG system for a customer support chatbot?"¶

Tags: senior · very-common · design · source: DataCamp Top 30 Q30; applied_ai_interview_focus.md

Answer outline: - Sources: product docs, help center, past resolved tickets, runbooks. Each with metadata: source_type, product_area, last_updated, ACL (tenant_id). - Ingest: structure-aware chunking (Markdown → by heading, tickets → conversation turns). 300-500 token chunks, 15% overlap. - Index: hybrid (BM25 + dense, e.g., BGE-large), Qdrant or Pinecone, ACL filter on every retrieval. - Retrieval: rewrite query with last 3 turns → hybrid top-50 → cross-encoder rerank to top-5 → assemble prompt with citations. - Generation: small model (Haiku/gpt-4o-mini) with citation enforcement and "escalate to human if not in docs" instruction. - Eval gate: 200-question golden set covering top 20 issue categories; faithfulness ≥ 0.85, Context Recall ≥ 0.85, deflection rate target ≥ 60%. - Ops: trace every query, LangSmith for debugging, alerts on faithfulness regressions, kill switch to deflect to humans. - Numbers to drop: "Target P95 latency 1.5s end-to-end. Cost ~$0.003-0.01 per query at 2026 prices. Deflection economics: $5-15 saved per ticket if accuracy is high enough that humans trust it."

Common follow-ups: - "Where does the human-in-the-loop sit?" - "What's the kill switch?"

Traps: - Skipping ACL/tenancy — interview gold. - Forgetting tickets as a source — past resolutions are the highest-value data. - No eval framework — interviewer will ask "is this vibes-based?"

Related cross-cutting: RAG vs fine-tune vs prompt engineering