Fine-Tuning & Model Adaptation — Interview Questions¶

The senior-loop question that separates "I read a blog about LoRA" from "I shipped a fine-tune in prod" is always "would you fine-tune here?" — and the right answer is usually "no, not yet". In 2026, full-parameter fine-tuning of models >20B is rare; PEFT (LoRA/QLoRA) plus DPO is the industry default for production adaptation. Expect questions on the decision, the recipe, the failure modes, and the eval discipline that surrounds it.

When to fine-tune (vs RAG vs prompting)¶

Q: "When would you fine-tune vs use prompt engineering?"¶

Tags: screen · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026; reported across multiple companies

Answer outline: - Reframe as a decision tree, not a binary. Prompt → RAG → fine-tune is the cost ladder; climb only when the rung below has failed for a specific, measured reason. - Stay on prompts when the failure is about instructions — wrong format, wrong tone, ignoring a constraint. A clearer prompt or a few-shot exemplar fixes most of these. - Move to RAG when the failure is about knowledge the model doesn't have — current facts, proprietary docs, customer-specific data. - Fine-tune when the failure is about behavior — a style, a structured output the model can't reliably hit, a domain vocabulary, an alignment objective, or latency from a too-long prompt you've already trimmed. - The fourth lever exists: when you need both knowledge and behavior, you fine-tune on retrieved-context examples — but that's an advanced answer, not the default. - Numbers to drop: "fine-tune only after 500-2000 labeled examples exist", "PEFT recipes converge in 1-3 epochs on 7B models, ~$50-200 of GPU time", "fine-tuning to save tokens pays back at >100k requests/day"

Common follow-ups: - "What's the smallest signal that would push you from prompting to fine-tuning?" - "Walk me through your decision process for a customer-support chatbot." - "If a startup CTO told you 'we want to fine-tune our own model' — what would you push back on?"

Traps: - Treating fine-tuning as the prestige answer. Senior interviewers want to hear reluctance, not enthusiasm. - Conflating "fine-tune for knowledge injection" with what fine-tuning actually does. Models forget facts they were trained on; they don't reliably acquire new factual knowledge via SFT. - Skipping cost. The interviewer wants to hear about labeling cost, GPU cost, eval cost, and the recurring cost of re-tuning when the base model upgrades.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "Fine-tune or use prompt-engineered RAG?"¶

Tags: mid · very-common · scenario · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - This is a "decide and defend" question. Pick RAG by default and force the interviewer to give you a reason to reconsider. - RAG wins when content updates faster than you can retrain (compliance docs, product catalogs, news), when citation/provenance is required, when content is too large to ever fit in any context window. - Fine-tuning wins when the output format or style matters more than the facts (structured JSON the base model can't hit, domain-specific phrasing, refusing in a brand-safe way), when the prompt is so long that latency becomes prohibitive, or when you have a closed-domain vocabulary the base model badly mishandles. - The mature answer: do both. Fine-tune for behavior (how to answer), keep RAG for content (what to say). Most production stacks I've seen mix these. - Numbers to drop: "RAG iteration loop is hours; fine-tune iteration loop is days-to-weeks", "RAG cost is dominated by retrieval and tokens; FT cost is GPU hours + labeling hours"

Common follow-ups: - "Give me a concrete example where you'd pick fine-tune over RAG." - "Can you fine-tune on RAG outputs to make the model more retrieval-friendly?"

Traps: - Claiming fine-tuning solves hallucinations. It usually doesn't — and can make them worse by overfitting to surface patterns. - Forgetting that base-model upgrades break fine-tunes but generally improve RAG.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "When should you choose fine-tuning over RAG over prompt engineering?"¶

Tags: mid · common · design · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Treat the three as layers, not alternatives. Most production systems use all three: a base prompt, RAG for fresh facts, plus a light fine-tune for output discipline. - Prompt engineering is the always-on default. It costs minutes; it can be reverted; it touches no infra. - RAG layer goes in when you have a corpus of authoritative content and need citations or freshness. Add a vector store, a retrieval policy, and an answer-grounded eval. Cost: weeks of integration, but each iteration is cheap. - Fine-tuning is the last layer because it's the most expensive to set up, hardest to debug, and the highest-leverage when prompts and RAG are saturated. Use it when you can articulate what behavior you need that prompting + RAG cannot produce. - A senior tell: candidate names a guardrail metric that would make them roll back the fine-tune (e.g., "if generalization on the off-domain eval drops by more than 5 points"). - Numbers to drop: "prompt iteration: minutes", "RAG iteration: hours", "FT iteration: days. Eval set per layer: 50-200 examples per slice."

Common follow-ups: - "What's the minimum bar of evidence before you'd kick off a fine-tune?" - "How do you stop the team from over-relying on fine-tuning?"

Traps: - Treating the three as mutually exclusive. The interview answer should always show layering. - Skipping the "is this prompt actually saturated?" check. Most teams fine-tune too early.

Related cross-cutting: Architecture choices, Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is fine-tuning, and when should you fine-tune an LLM?"¶

Tags: screen · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Fine-tuning continues training a pre-trained model on a smaller, task-focused dataset so the weights shift toward a target distribution of inputs and outputs. - The valid reasons to do it: structured-output reliability the base model misses, domain-specific style/vocabulary, latency reduction by encoding instructions into weights, alignment to a specific safety policy, or smaller-model substitution for a frontier model in narrow tasks (cost play). - The invalid reasons: "we want our own model" (vanity), "to add facts" (use RAG), "the prompt is messy" (clean the prompt). - Pre-conditions: a held-out eval, labeled training data with enough variety (typically 500-2000 high-quality examples for SFT, 50k+ pairs for RLHF/DPO if you go that far), a rollback path, and a re-tune plan for base-model upgrades. - Numbers to drop: "500-2000 SFT examples for narrow tasks", "10-20% latency reduction is the typical win when you replace a long instruction prompt with a tuned model", "expect 1-3 epochs of training, learning rate 1e-4 to 2e-4 for LoRA"

Common follow-ups: - "How do you know fine-tuning is done?" - "How does fine-tuning compare to instruction tuning?"

Traps: - Calling fine-tuning "training from scratch" — that's pre-training, very different cost and discipline. - Skipping data quality. The candidate who jumps to "epochs and learning rate" before discussing data curation is junior.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "Explain the difference between full fine-tuning and parameter-efficient fine-tuning (PEFT)."¶

Tags: mid · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Full fine-tuning updates every weight in the model. You need GPU memory for the model, gradients, optimizer states (Adam = 2x model size), and activations — roughly 4-5x the model's parameter memory at FP16. - PEFT freezes the base model and trains a small number of additional parameters that ride on top — LoRA adapters, prefix tokens, prompt embeddings. Memory savings are 5-10x; you store and ship only the adapter (often 10-100 MB) instead of a new copy of the model. - PEFT also reduces catastrophic forgetting — because the original weights are frozen, the base model's general capabilities can't drift. - In 2026, full fine-tuning of models >20B is rare in production. PEFT is the default; full FT is reserved for foundation labs and a few specialized industrial pipelines. - Trade-off: PEFT slightly underperforms full FT on very narrow, high-data regimes — but the deployability and forgetting trade-offs almost always win. - Numbers to drop: "full FT of a 7B model in FP16: ~140 GB GPU memory. LoRA on the same model: ~20 GB. Adapter ships at 50-200 MB."

Common follow-ups: - "Why does PEFT use less memory if the forward pass is the same?" - "When would you still pick full FT over PEFT?"

Traps: - Claiming PEFT is always equal to full FT in quality. It's usually within 1-3% on most benchmarks, but the gap exists. - Forgetting that PEFT requires the same base model at inference — you can't ship just the adapter to someone with a different base.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

LoRA & PEFT family¶

Q: "What is PEFT/LoRA and when would you use it?"¶

Tags: mid · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - LoRA (Low-Rank Adaptation) freezes the pretrained weight matrix W and adds a trainable low-rank update ΔW = B·A, where B is d×r, A is r×k, and r is typically 4-64. Effective W during training is W + B·A; only A and B receive gradients. - The hypothesis: task-specific updates lie on a low-rank manifold, so you don't need to update every singular component of W to shift behavior. - Use it for: narrow domain adaptation (legal, medical), style/format adherence, multi-tenant adapter farms where one base model serves many tenants with swappable adapters, and any case where you'd otherwise full-fine-tune a >7B model on consumer hardware. - Skip LoRA and go to full FT only when you have huge data (>100k examples), need maximum task specialization, and don't care about adapter portability. - Numbers to drop: "typical LoRA rank: 8-16 for narrow tasks, 32-64 for harder tasks", "alpha = 2r is a common heuristic", "adapter parameter count: ~0.1-1% of base"

Common follow-ups: - "Which weight matrices do you apply LoRA to?" - "How does rank affect quality?" - "How do you merge a LoRA back into the base model for deployment?"

Traps: - Saying "LoRA freezes all weights" — it freezes the base weights; A and B are very much trainable. - Confusing LoRA's rank r with attention head count. - Forgetting that the merge step (W + B·A → W') is what lets you deploy a single fused model without runtime adapter overhead.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is LoRA (Low-Rank Adaptation), and how does it work?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - LoRA approximates the weight update during fine-tuning as the product of two low-rank matrices: ΔW = α/r · B·A, where rank r ≪ min(d, k). - During the forward pass, the model computes h = W·x + (α/r)·B·(A·x); only A and B carry gradients. The base W never updates. - A is initialized with Gaussian noise; B is initialized to zero — so the adapter starts as a no-op and learns from there. This stability trick is part of why LoRA trains so cleanly. - Applied per-layer, typically only to attention projections (q_proj, v_proj are most common; some recipes add k_proj, o_proj, and the MLP layers for higher capacity). - At inference, you can either keep the adapter as a separate compute path (allows hot-swapping) or merge it back: W' = W + (α/r)·B·A, after which there's zero overhead. - Numbers to drop: "trainable params for a 7B LoRA at r=16: ~10M params (~0.15% of base)", "memory for LoRA training: 20-25% of full FT memory"

Common follow-ups: - "Why initialize B to zero?" - "What's the role of alpha?" - "If you apply LoRA to all linear layers, what changes?"

Traps: - Mixing up which matrix is initialized to zero — it's B, not A. A's Gaussian init is what gives the adapter expressivity from the start. - Forgetting the alpha/r scaling — most subtle bugs in custom LoRA implementations are scaling errors.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is QLoRA, and how does it enable fine-tuning on consumer hardware?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - QLoRA quantizes the frozen base model to 4-bit (NF4 — "normal float 4") while training LoRA adapters in higher precision (BF16/FP16) on top. - Three pieces beyond standard LoRA: (1) 4-bit NF4 quantization of base weights, (2) double quantization of the quantization constants themselves, (3) paged optimizers that page Adam states to CPU when GPU memory spikes. - Result: a 65B base model fine-tunes on a single 48 GB GPU; a 7B model fine-tunes on a 12 GB consumer card. - The 4-bit base is dequantized on the fly for each matmul, so inference and gradient compute happen at FP16 — quality remains close to LoRA-FP16 (within ~1% on most benchmarks). - Trade-off: training is slower than LoRA-FP16 because of the dequantization overhead. Convergence quality is comparable. - Numbers to drop: "QLoRA of a 7B model: ~6 GB VRAM training. LoRA-FP16: ~20 GB. Full FT: ~140 GB", "training throughput penalty: ~20-30% vs LoRA-FP16"

Common follow-ups: - "Why NF4 instead of regular INT4?" - "What's 'double quantization' and why does it matter?" - "Does QLoRA hurt quality vs regular LoRA?"

Traps: - Saying QLoRA "quantizes the adapter" — it quantizes the base. The adapter stays in higher precision so gradient updates are stable. - Skipping the paged optimizer — this is the third pillar and the question often probes whether you know all three.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is QLoRA and how does it differ from LoRA? When would you choose one over the other?"¶

Tags: mid · common · scenario · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - LoRA: base model stays in FP16 (or BF16), adapter trained in same precision. Fast, simple, the default if you can afford the GPU memory. - QLoRA: base model quantized to 4-bit NF4, adapter trained in FP16 with on-the-fly dequant for matmuls. Slower per step but 3-4x less memory. - Pick LoRA when you have enough VRAM (A100 80 GB for ≤13B models, multiple GPUs for larger) and want max throughput. - Pick QLoRA when memory is the binding constraint — fine-tuning a 70B model on a single 48 GB card, running multiple training jobs on shared GPUs, or any consumer-hardware setting. - Quality difference is small (<1% on most benchmarks). The training-time difference is meaningful (~20-30% slower for QLoRA). - Numbers to drop: "LoRA-FP16 on 7B: ~20 GB. QLoRA on 7B: ~6 GB. Throughput: 1.0 vs 0.75."

Common follow-ups: - "What's the quality gap, in your experience?" - "If you had unlimited GPUs, would you ever pick QLoRA?"

Traps: - Claiming QLoRA quality is worse than LoRA by a large margin — published gaps are small. - Saying "QLoRA is just LoRA on a quantized model" — missing the paged-optimizer and double-quantization details.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "Explain Prefix Tuning and Prompt Tuning. How are they different from LoRA?"¶

Tags: senior · common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Prompt tuning: prepend a small set of trainable embedding vectors to the input (no actual tokens — just learned vectors in embedding space). Train only those vectors; rest of the model is frozen. Very few parameters (often <0.01% of base). - Prefix tuning: prepend trainable vectors to every transformer layer's key/value cache, not just the input. More parameters than prompt tuning, more expressivity, harder to train but stronger. - LoRA: inject trainable low-rank decompositions into the weight matrices themselves. The adapter affects the entire computation, not just a prefix. - Capacity ordering (rough): LoRA > prefix tuning > prompt tuning. Practical: LoRA has won the deployment market because adapter merging gives zero-overhead inference and the quality/cost ratio is best. - Prompt tuning shines for very large models (>10B) where even tiny vector tuning catches up to fuller methods. - Numbers to drop: "prompt tuning: 10-100 trainable vectors of model_dim each", "prefix tuning: ~0.1% of base params", "LoRA: 0.1-1% of base params"

Common follow-ups: - "Why does prompt tuning underperform LoRA at small model sizes?" - "How would you serve 100 different prefix-tuned tasks on one base?"

Traps: - Treating prefix tuning and prompt tuning as the same. They differ in where the trainable vectors live. - Forgetting that prompt tuning's parameters live in embedding space, not the token vocabulary — they don't correspond to any real token.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is adapter-based fine-tuning?"¶

Tags: mid · common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Adapters insert small trainable bottleneck modules (down-project → non-linearity → up-project) inside each transformer block. Base weights frozen; only adapters train. - Historically the first PEFT family, predating LoRA. Original Houlsby/Pfeiffer adapters added latency at inference because the adapter is a separate compute step. - LoRA superseded vanilla adapters in practice because LoRA can be merged into base weights at inference (W' = W + B·A) with zero overhead. - Adapters still useful for multi-task setups where you want to keep adapters separate and route requests by task — e.g., 50 customer-specific adapters hot-loaded on demand. - Numbers to drop: "adapter bottleneck dim: 8-64", "inference overhead of unmerged adapters: 10-20% latency"

Common follow-ups: - "Why did LoRA win over adapters?" - "When would you still pick adapters?"

Traps: - Lumping all PEFT methods together. The interviewer wants to hear the specific structural difference: bottleneck blocks vs low-rank weight updates vs prefix vectors.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What are the key hyperparameters for fine-tuning (learning rate, epochs, batch size, LoRA rank)?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Learning rate: the single most impactful hyperparameter. For full FT: 1e-5 to 5e-5. For LoRA: 1e-4 to 3e-4 (LoRA can take higher LR because gradients are dampened by the low-rank decomposition). For QLoRA: 2e-4 is a robust default. - Epochs: 1-3 for SFT on instruction data; more than 5 epochs usually overfits. Better signal than epoch count: track held-out eval loss and early-stop. - Batch size: depends on memory. Use gradient accumulation to hit an effective batch of 32-128 sequences for stable updates. Don't chase huge batches — small batches with more steps often generalize better. - LoRA rank: 8-16 for style/format tasks; 32-64 for harder tasks (math, code). Doubling rank rarely doubles quality — log-scaling returns. - Alpha: a scaling factor on the LoRA update. Common choice: alpha = 2r (so the effective scaling is 2). Some recipes use alpha = r. - Warmup steps: 3-10% of total. Critical for stability — without warmup, the first updates can wreck attention layers. - Sequence length: pad to a sensible max (1024-4096 for most tasks). Padding too long wastes compute on attention; too short truncates examples. - Numbers to drop: "LoRA LR: 2e-4, alpha=32, r=16, effective batch 64, 2 epochs, 3% warmup — robust default for narrow tasks"

Common follow-ups: - "If your loss plateaus immediately, which hyperparameter do you change first?" - "How do you pick LoRA rank?" - "When would you raise alpha vs raise rank?"

Traps: - Naming "epochs" as the most impactful hyperparameter. It's LR by a wide margin. - Forgetting warmup. A naive run with no warmup on a small dataset often produces a worse model than the base. - Conflating per-device batch size with effective batch size (which includes gradient accumulation).

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "Implement LoRA adapter from scratch."¶

Tags: senior · common · coding · source: reported coding round (multiple FAANG/AI-lab loops, 2026)

Answer outline: - Sketch in PyTorch in ~30 lines. Subclass nn.Module. Hold the frozen base linear layer plus two trainable matrices A (in_features × r) and B (r × out_features). Initialize A from kaiming_uniform_(a=sqrt(5)), B as zeros. - Forward: return self.base(x) + (alpha / r) * (x @ A @ B). - Freeze the base: self.base.weight.requires_grad_(False). Verify with a grad-flow assertion in the test. - Show how to merge the adapter back into the base: base.weight.data += (alpha / r) * (A @ B).T. Test that pre-merge and post-merge produce identical outputs on the same input. - Be ready to handle dimensions correctly — A is (in_features, r), B is (r, out_features), so x @ A @ B is shape (batch, seq, out_features). Many candidates write it transposed. - Senior tell: candidate writes a test alongside the implementation — same output before/after merge, plus a check that only A and B receive gradients. - Numbers to drop: "for a 4096×4096 linear at r=16: 409616 + 164096 = 131,072 params vs 16,777,216 for full layer — 128× compression"

Common follow-ups: - "Where would you insert this in the transformer block?" - "How would you implement multi-LoRA serving (5 adapters on the same base)?" - "What goes wrong if you forget to set B to zero?"

Traps: - Setting both A and B to random init — the adapter starts non-trivially and breaks the base model's behavior immediately. - Mishandling the alpha/r scaling, leading to gradient explosions or vanishing updates. - Forgetting requires_grad_(False) on the base — your training loop quietly updates billions of params.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/00_ai_foundation/04_autoregressive_generation/

SFT, instruction tuning, and data prep¶

Q: "What is instruction tuning, and why is it important for chat models?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026); also Adil Shamim 100+ interviews list

Answer outline: - Instruction tuning is supervised fine-tuning on (instruction, response) pairs where the model learns to follow instructions in natural language rather than just predict the next token from a corpus. - Without it, base models are completion engines — give them "write a poem about cats" and they may continue with "and other animals" rather than writing the poem. - The recipe: collect or generate a diverse pool of instruction-response pairs (FLAN, Alpaca, ShareGPT, internal corpus), apply a chat template, SFT for 1-3 epochs. - It's a prerequisite for almost every modern chat-style alignment pipeline: SFT first, then preference learning (RLHF/DPO) on top. - Numbers to drop: "10k-100k diverse instruction pairs for a basic instruct model; LIMA result showed 1k high-quality pairs can match larger sets if curated", "the chat template matters — train and inference templates must be byte-identical"

Common follow-ups: - "Difference between instruction tuning and pre-training?" - "How big should the instruction dataset be?" - "What goes wrong if your training chat template doesn't match inference?"

Traps: - Conflating instruction tuning with alignment. Instruction tuning teaches format; alignment (RLHF/DPO) teaches preference. - Skipping template hygiene. A mismatched bos/eos/system token between training and serving silently degrades the model.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is instruction tuning and how does it differ from pre-training?"¶

Tags: screen · common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Pre-training: next-token prediction on trillions of tokens of unlabeled text. Objective is generic language modeling. Output: a base model that can complete patterns but not follow instructions cleanly. - Instruction tuning: supervised fine-tuning on labeled (instruction, response) pairs, typically 10k-1M examples, for 1-3 epochs. Objective is to bias the model toward producing the response given the instruction. - The compute gap is orders of magnitude. Pre-training: thousands to tens of thousands of GPU-days. Instruction tuning: tens to hundreds of GPU-hours. - Data difference: pre-training data is web-scale and noisy; instruction data is small but heavily curated, often with human review or LLM-graded quality filters. - Numbers to drop: "pre-training: trillions of tokens, ~$1-10M+ compute. Instruction tuning: 50k-500k examples, $1k-10k of GPU."

Common follow-ups: - "Why does the model 'forget' some pretraining knowledge after instruction tuning?" - "Could you skip pretraining and just instruction-tune a random initialization?"

Traps: - Treating instruction tuning as just "more pretraining". The objective is different (next-token over response only, with loss masked on the instruction prefix in many recipes). - Forgetting that the loss is usually only computed over the response tokens, not the instruction.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/05_llm_training_pipeline/, learning/00_ai_foundation/06_adaptation_compression/

Q: "What is the difference between SFT (Supervised Fine-Tuning) and alignment training?"¶

Tags: mid · common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - SFT teaches the model what correct output looks like by maximum-likelihood training on (input, output) pairs. Single reference response per input. - Alignment training teaches the model to prefer better outputs over worse outputs using pairs or ranked sets — chosen vs rejected. The objective is preference, not exact-match. - SFT alone gives an instruction-following model but doesn't reduce hallucination, harmful output, or sycophancy as effectively. Alignment (RLHF/DPO) shapes those behaviors. - Modern pipelines: SFT → DPO (or RLHF) → eval → ship. Sometimes a final round of "rejection sampling SFT" where you sample many outputs, keep the top-scored by a reward model, and SFT on those. - Numbers to drop: "SFT data: 10k-500k single responses. DPO data: 10k-100k preference pairs. RLHF needs continuous reward-model evaluation during training."

Common follow-ups: - "Could you skip SFT and go straight to DPO from a base model?" - "Why is alignment usually done after SFT and not before?"

Traps: - Calling alignment "another epoch of SFT" — the loss function is fundamentally different. - Skipping the preference data discussion. Where do those pairs come from? (Human raters, LLM judges, behavioral signals — see the next question.)

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "How do you prepare a dataset for fine-tuning an LLM?"¶

Tags: mid · very-common · design · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Start from the eval, not the training set. Build the held-out eval first; it sets your quality bar. - Source candidate examples: production logs (de-PII'd), human-written gold examples, synthetic generation from a stronger model, scraped from a domain corpus. - Quality > quantity. LIMA showed 1k carefully curated pairs can beat 50k noisy ones. Aim for 500-5000 high-quality examples for a narrow task; 10k-100k for general instruction-tuning. - De-duplicate aggressively (exact match + n-gram overlap + embedding similarity). Near-duplicates inflate apparent dataset size but cause overfitting. - Stratify by intent / difficulty / response length so training is balanced. A naive dataset can have 80% short answers and 20% long, biasing the model toward terseness. - Apply a consistent chat template before tokenization. Train and inference must use byte-identical templates. - Hold out an eval set from the same distribution as the training set, plus an off-domain set to detect catastrophic forgetting. - Numbers to drop: "eval set ≥ 200 examples", "5-15% of total as held-out eval", "dedupe by 8-gram overlap or embedding cosine ≥ 0.95"

Common follow-ups: - "How much synthetic data is too much?" - "How do you spot-check 10k examples?" - "Walk me through your de-duplication pipeline."

Traps: - Skipping the eval set discussion. Senior interviewers want to hear about it first. - Using synthetic data without human spot-check — it amplifies the generator model's biases and hallucinations. - Not de-duplicating between training and eval. Massive leakage; metrics look great but model is broken.

Related cross-cutting: Architecture choices Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/01_ai_engineering/06_evidence_data_pipelines/

Q: "How do you convert implicit user behavior (edits, acceptance, rejection) into training signals for model improvement?"¶

Tags: senior · common · design · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Implicit signals are gold for preference data because they scale without labelers. The trick is converting noisy behavioral signals into clean preference pairs. - Acceptance vs rejection: the simplest case — user accepted suggestion A but rejected suggestion B → preference pair (A > B). Filter for cases where the user saw both. - Edits: if the model produced X and the user edited it to X', then X' is the preferred answer. Pair (X', X) as (chosen, rejected). Filter out trivial edits (typo fixes don't carry behavioral signal). - Implicit timing: long dwell time on a response, then a follow-up that doesn't repeat the question → weak positive signal. Quick re-ask → weak negative. - Use the signals to (a) train a reward model on preference pairs, (b) directly SFT on accepted/edited outputs, or (c) run DPO on the curated pairs. - Critical: dedupe, filter low-confidence signals (where action is ambiguous), and run an offline eval on a frozen holdout before promoting any behaviorally-trained model. - Numbers to drop: "filter edit-distance ratio > 0.3 to drop trivial edits", "10k-50k clean preference pairs typically enough for a DPO round"

Common follow-ups: - "How do you avoid reinforcing the model's own biases by training on its own selected outputs?" - "What if your acceptance rate is dominated by lazy users who accept everything?"

Traps: - Treating implicit signals as ground truth. They're noisy by definition — selection bias, click fatigue, exploration vs exploitation. - Forgetting feedback loops. If the model decides what to show and you train on what users picked from that subset, you compound bias quickly.

Related cross-cutting: Production patterns Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How would you design a model that can solve math problems? Walk through data collection, supervised fine-tuning, post-training, and evaluation."¶

Tags: senior · common · design · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Data: curate problem-solution pairs from textbooks, contest problems (AMC, AIME, GSM8K, MATH), Olympiad archives, and synthetic generation from a stronger model. Each example must include the full chain of reasoning, not just the final answer. - Quality filter: verify every solution programmatically when possible (sympy for algebra, exec for code) and discard examples with wrong final answers. Synthetic data without verification is worse than no data. - SFT phase: train on (problem, full reasoning + final answer) for 2-3 epochs. Loss masked on the problem; computed over the response. - Post-training: rejection-sampling SFT. Sample N (typically 64-256) candidate solutions per problem, keep only those with verified-correct final answers, SFT again on those. - Then preference learning: DPO on (correct chain, plausible-but-wrong chain) pairs to teach the model to distinguish good and bad reasoning even when both look fluent. - Evaluation: pass@1 and pass@k on GSM8K, MATH, held-out internal sets. Also a self-consistency metric (sample 16, majority vote) — improvement here vs single-shot indicates better calibrated reasoning. - Senior tell: candidate mentions programmatic verification as the linchpin and self-consistency as a deployable inference-time technique. - Numbers to drop: "GSM8K: 8.5k train, 1.3k test", "MATH benchmark: ~12.5k competition problems", "rejection sampling: 64 candidates × 100k problems × keep top-1 → 100k high-quality SFT pairs"

Common follow-ups: - "Why is verification critical for math but harder for, say, summarization?" - "How would you extend this to multi-step proofs?" - "What's the right base model size for this?"

Traps: - Skipping verification — synthetic math data with wrong solutions degrades the model fast. - Treating "math" as a uniform skill. Word problems, algebra, calculus, proofs all need different recipes. - Not naming the evals (GSM8K, MATH, AIME). A senior interviewer wants to hear specific benchmarks.

Related cross-cutting: Architecture choices Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/01_ai_engineering/15_reasoning_routing_verification/

Alignment — RLHF, DPO, RLAIF, GRPO¶

Q: "What is RLHF and why is it important?"¶

Tags: screen · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - RLHF = Reinforcement Learning from Human Feedback. Three-stage pipeline: (1) SFT to get an instruction-following base, (2) train a reward model on pairwise human preferences, (3) RL-fine-tune the SFT model to maximize reward, with a KL penalty against the SFT model to prevent drift. - It's important because SFT alone can't shape preference — the model knows how to produce instruction-following text, but RLHF teaches it which kind of instruction-following text is preferred (helpful, harmless, calibrated, not sycophantic). - The reward model is the heart of it. It's a separate model (usually a copy of the SFT model with a regression head) trained on (chosen, rejected) pairs to score outputs. - The RL algorithm is typically PPO — but PPO is finicky, requires careful KL balancing, and is largely being replaced by DPO in 2026. - Numbers to drop: "100k+ preference pairs typical for a production reward model", "KL coefficient β = 0.01-0.1, tuned by trial", "PPO clipping ε = 0.1-0.2"

Common follow-ups: - "Why do you need a KL penalty?" - "What does the reward model actually score?" - "What's wrong with PPO?"

Traps: - Saying RLHF "teaches the model facts" — it doesn't, it shapes preference and behavior. - Skipping the KL penalty discussion. Without it, the model game-hacks the reward model and drifts off-distribution.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "Explain the RLHF pipeline: supervised fine-tuning, reward model training, and PPO. How does DPO simplify this?"¶

Tags: senior · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Stage 1 — SFT: train the base model on (instruction, response) pairs. Output: a model that follows instructions. - Stage 2 — reward model: take the SFT model, swap the LM head for a scalar regression head, train it on pairwise preference data (chosen, rejected) with a Bradley-Terry-style loss: maximize log σ(r(chosen) - r(rejected)). - Stage 3 — PPO: RL-fine-tune the SFT model. Each step, sample a response, score it with the reward model, compute the advantage, do a PPO update, apply a KL penalty against the frozen SFT model to keep distributions close. - DPO simplifies by collapsing stages 2 and 3 into one. Mathematical trick: the optimal RL solution has a closed-form relationship to the reward, so you can train directly on preference pairs without ever materializing the reward model. - DPO loss: -log σ(β · (log π(y_chosen|x)/π_ref(y_chosen|x) - log π(y_rejected|x)/π_ref(y_rejected|x))). β is the equivalent of the KL coefficient in PPO. - Result: same alignment quality, ~10× simpler to implement, no online sampling loop, much easier to debug. - Numbers to drop: "DPO trains in hours where PPO took days", "β = 0.1-0.5 typical for DPO"

Common follow-ups: - "Why doesn't DPO need a separate reward model?" - "When would PPO still beat DPO?" - "How sensitive is DPO to β?"

Traps: - Calling DPO "the same as PPO but easier" — the math is genuinely different (closed-form solution vs iterative sampling). - Forgetting that DPO still needs a reference model (the SFT model) at training time.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is DPO and RLHF? When would you prefer one over the other?"¶

Tags: senior · very-common · scenario · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - RLHF = three-stage pipeline (SFT → reward model → PPO). DPO = one-stage replacement (SFT → DPO directly on preference pairs). - Pick DPO by default in 2026. It's simpler, faster, has no online sampling loop, no reward-model drift, easier to reproduce. - Pick RLHF (or related online methods like GRPO) when: (a) the reward signal must come from a programmatic verifier you can call mid-training (math correctness, code execution, tool-use success), (b) you need to explore beyond the preference data because your data is sparse, (c) you're at frontier scale where the marginal alignment quality matters. - Hybrid: SFT → DPO → small PPO refinement with a verifier reward. Common in modern reasoning models. - Numbers to drop: "DPO training time: 2-8 hours on a 7B model with 50k pairs. PPO equivalent: 1-3 days."

Common follow-ups: - "Why is online RL better for math/code tasks?" - "Can DPO and RLHF be combined?" - "Have you tried GRPO? How does it differ?"

Traps: - Claiming DPO has fully replaced RLHF — frontier labs still use online RL for reasoning training. - Conflating DPO with offline-RL more broadly. DPO is a specific algorithm derived from RLHF's closed-form solution.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/01_ai_engineering/15_reasoning_routing_verification/

Q: "What is RLHF (Reinforcement Learning from Human Feedback), and how is it used to align LLMs?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - RLHF aligns model outputs to human preferences using RL, not just supervised data. Pipeline: SFT base → train reward model on (chosen, rejected) pairs → PPO-train the SFT model to maximize reward minus a KL penalty against the SFT distribution. - The KL penalty is the safety belt — without it the policy drifts off-distribution toward reward-hacking outputs (nonsense that the reward model loves). - It's used because pure SFT doesn't capture which of many valid responses humans prefer. RLHF teaches helpfulness, harmlessness, honesty as soft preferences rather than hard constraints. - Modern variants: RLAIF (preference data labeled by an AI model rather than humans, much cheaper), constitutional AI (preferences derived from a written constitution + critic model), GRPO (group-relative policy optimization, used in recent reasoning models). - Numbers to drop: "Helpful preference data: 60-100k pairs. Harmlessness data: 30-50k pairs. Combined RLHF: a week of GPU time for a 7B."

Common follow-ups: - "Why does the reward model overfit so easily?" - "Can you align without humans in the loop?"

Traps: - Treating RLHF as a single algorithm. It's a pipeline; each stage has its own failure modes (reward hacking, KL collapse, sycophancy emergence). - Skipping the discussion of reward-hacking — a senior interviewer always probes here.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What is RLAIF (RL from AI Feedback), and how does it differ from RLHF?"¶

Tags: senior · common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - RLAIF replaces the human labeler in the preference-pair generation step with a strong LLM acting as judge. The judge model rates (chosen, rejected) pairs given a rubric or constitution. - Same downstream pipeline as RLHF — reward model trained on these pairs, then PPO or DPO on top. - Why: human preference labels cost $1-10 per pair, often more for expert domains. AI labels cost cents and can be scaled to millions. - Trade-off: AI judgment is biased toward the judge's own preferences and limitations. A judge that prefers verbose, polished answers will train a verbose, polished model — and may miss subtle harms a human would catch. - The constitutional-AI variant: provide the judge with a written constitution (e.g., "responses should be helpful, harmless, honest"), and use the judge's critiques to guide preference labeling. - Numbers to drop: "human preference data: $1-10/pair. AI labels: ~$0.001-0.01/pair. Quality of resulting model: within 1-3% on most benchmarks."

Common follow-ups: - "How do you validate AI labels against human ground truth?" - "When does RLAIF break down?" - "What is constitutional AI?"

Traps: - Saying RLAIF is "free" — it's not, the judge model has its own inference cost and bias. - Forgetting that you still need some human labels to calibrate the judge.

Related cross-cutting: Fine-tuning vs alternatives, Production patterns Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What's GRPO and why are reasoning models using it?"¶

Tags: staff · occasional · conceptual · source: post-training guides (Sundeep Teki SFT/RLHF/DPO/GRPO post, 2026); recent reasoning-model papers

Answer outline: - GRPO = Group Relative Policy Optimization. A PPO-family algorithm that drops the value (critic) network and instead computes advantage by comparing a group of sampled responses to each other. - For each prompt, sample G responses (G = 8-64 typical), score each with a (verifier or reward) function, compute advantage relative to the group mean: A_i = (r_i - mean(r)) / std(r). - Skips the value-function approximation that PPO needs — works well when the reward signal is a hard, programmatic verifier (correctness check on math/code). - Why it took off for reasoning models: when the reward is binary correctness (passes the test or doesn't), a learned critic is noisy and unnecessary. Group-relative advantage is more stable. - Trade-off: needs more samples per prompt, more inference compute during training. Pays off when the reward signal is high-quality. - Numbers to drop: "G = 16-64 samples per prompt", "compute cost: ~5-10× DPO for the same wall-clock, but much higher-quality reasoning outputs"

Common follow-ups: - "How is GRPO different from PPO?" - "Why does it work well for reasoning but not for open-ended chat?"

Traps: - Calling GRPO "just PPO without a critic" — true mechanically, but the group-relative advantage trick is the key insight that makes it stable.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/01_ai_engineering/15_reasoning_routing_verification/, learning/00_ai_foundation/06_adaptation_compression/

Quantization¶

Q: "Explain quantization. What are the trade-offs between model size, speed, and accuracy?"¶

Tags: mid · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Quantization reduces the bit-width of model weights (and sometimes activations) from FP16/BF16 down to INT8, INT4, or specialized formats like NF4 and FP8. - Goal: shrink model size, reduce memory bandwidth (the dominant cost at inference), allow larger models to fit on smaller hardware. - Trade-offs: size scales linearly with bit-width (FP16 → INT4 = 4× shrink). Speed depends on the kernel support — INT8 matmul kernels are widely supported; INT4 needs specialized kernels (vLLM AWQ/GPTQ, llama.cpp Q4). Accuracy degradation is usually small (<1% on most benchmarks for INT8, 1-3% for INT4) but task-dependent. - Activation quantization is harder than weight quantization because activations have outliers — methods like SmoothQuant migrate the difficulty from activations to weights via per-channel scaling. - Numbers to drop: "INT8 cuts VRAM by 50% with ~0-1% quality loss", "INT4 (AWQ/GPTQ) cuts by 75% with 1-3% quality loss", "FP8 on H100/H200: similar to INT8 in size, often better quality"

Common follow-ups: - "INT8 vs INT4 — when do you pick which?" - "What's the difference between weight-only and weight+activation quantization?" - "Why does INT4 hurt some models more than others?"

Traps: - Conflating model size with inference speed. Quantization mostly helps memory bandwidth; raw FLOPs may not drop as much. - Forgetting that quantization quality is highly model-dependent — smaller models lose more from aggressive quantization.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/02_ai_infrastructure/02_inference_serving_systems/

Q: "Explain quantization and model distillation for inference optimization."¶

Tags: mid · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - These are the two main post-training optimizations for inference cost. Both ship a smaller/cheaper model than what you trained. - Quantization: reduce bit-width of weights. Cheap to apply (hours), preserves model capability mostly intact, no new training run. INT8 weights are essentially free; INT4 has small but real quality cost. - Distillation: train a smaller "student" model to mimic the outputs (or internal representations) of a larger "teacher". Expensive (a full training run), but the student can be 5-10× smaller while retaining 80-95% of teacher quality on the target task. - They compose: distill the model down to a smaller architecture, then quantize the student. Common for edge deployment. - Pick quantization first — it's the cheapest move and often enough. Add distillation when quantization saturates and you still need more speedup. - Numbers to drop: "INT4 quantization: 2-4× throughput improvement, hours to apply", "distillation: 3-10× smaller model, training cost similar to a fine-tune, retains 80-95% quality"

Common follow-ups: - "Walk me through distillation for a specific task." - "What gets lost in distillation that quantization preserves?"

Traps: - Treating these as alternatives. They're complementary; production pipelines often use both. - Forgetting that distillation requires labeled data or a teacher to query at scale — neither is free.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/02_ai_infrastructure/05_agent_performance_economics/

Q: "What is model quantization?"¶

Tags: screen · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Mapping the model's parameters (and optionally activations) from higher-precision floats (FP32, FP16) to lower-precision formats (INT8, INT4, NF4, FP8). - The mapping is usually affine: quantized = round((real - zero_point) / scale). Each tensor (or each channel) has its own scale and zero_point so that the dynamic range is well-used. - Two main flavors: post-training quantization (PTQ — applied after training, fast) and quantization-aware training (QAT — simulate quantization noise during training, more accurate but expensive). - Why it matters: smaller weights → less memory → fewer bytes to read from HBM per token → faster inference (because LLM inference is memory-bandwidth-bound). - Modern stack (2026): GPTQ and AWQ for INT4 weight-only; FP8 on H100/H200/B200 with native hardware support; INT8 SmoothQuant for activation+weight on broader hardware. - Numbers to drop: "FP16 → INT4 = 4× memory reduction, ~3× throughput improvement", "FP8 on H100: similar size to INT8, often better quality"

Common follow-ups: - "Walk me through PTQ vs QAT." - "Why is INT4 weight-only more common than INT4 weight+activation?"

Traps: - Treating quantization as a single technique. The method (GPTQ, AWQ, SmoothQuant, GGUF), the precision (INT8 vs INT4), and the granularity (per-tensor, per-channel, per-group) all matter.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "What's the difference between INT8, GPTQ, AWQ, and GGUF?"¶

Tags: senior · common · conceptual · source: VRLA Tech LLM Quantization 2026 guide; Local AI Master GGUF/GPTQ/AWQ comparison

Answer outline: - INT8: classic 8-bit weight quantization. Well-supported on every GPU. ~50% memory reduction, near-zero quality loss. Often the default if you can spare the size. - GPTQ: layer-by-layer post-training quantization that minimizes per-layer reconstruction error using second-order information (approximated Hessian). Produces high-quality INT4 models but is slow to compute — hours for large models. Good when you want INT4 and have time to do it carefully. - AWQ (Activation-aware Weight Quantization): uses a calibration pass to identify the most important weight channels (those with largest activation magnitudes) and protects them from aggressive quantization. Faster to compute than GPTQ, often slightly higher quality. The current best-practice for vLLM serving. - GGUF: file format (formerly GGML) used by llama.cpp. Supports many quantization schemes (Q4_K_M, Q5_K_S, Q8_0, etc.) — these are GGUF-specific quantization recipes that interleave block-wise scales. Targeted at CPU and consumer-GPU inference, not vLLM-style server deployments. - Decision: AWQ for GPU serving with vLLM/TensorRT-LLM. GPTQ for legacy/wide-support. GGUF for llama.cpp / Ollama / Mac. INT8 when you want maximum safety and have memory to spare. - Numbers to drop: "AWQ retains ~95% quality at INT4; GPTQ ~90%; GGUF Q4_K_M ~92%", "AWQ + vLLM is the production default for self-hosted inference in 2026"

Common follow-ups: - "Why does AWQ outperform GPTQ in practice?" - "When would you ever pick GGUF over AWQ?"

Traps: - Treating GGUF as a quantization method — it's a file format that supports many methods. - Conflating AWQ and GPTQ. They're both INT4 weight-only PTQ approaches but use different optimization criteria.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/02_ai_infrastructure/02_inference_serving_systems/

Q: "You quantized your LLM, but accuracy dropped significantly. How do you minimize quantization loss?"¶

Tags: senior · common · debugging · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - First, localize the loss. Run the model at full precision and quantized side-by-side on the same eval slices; identify which slice degraded most. Math/code typically degrades more than chit-chat. - Second, check the calibration set. AWQ and GPTQ both calibrate on a small dataset; if your calibration is off-distribution, weights get quantized for the wrong activation pattern. Use 128-512 examples from your production distribution. - Third, increase the group size (or decrease for finer granularity). Per-group quantization with group_size=128 is a common default; smaller groups = better quality, larger weights. - Fourth, try a different method. If GPTQ degraded, try AWQ. If both INT4 methods are bad, fall back to INT8 — much safer floor. - Fifth, mixed precision: keep critical layers (output projection, first/last transformer blocks) at higher precision while quantizing the rest aggressively. SmoothQuant-style outlier handling helps if activations have outliers. - Last resort: quantization-aware training (QAT) — add fake-quant ops during a small SFT pass so weights learn to be quantization-friendly. Expensive but recovers most of the loss. - Numbers to drop: "AWQ INT4 with group_size=128, calibration on 256 in-distribution examples — typical setup", "QAT recovery: closes ~70% of the PTQ quality gap"

Common follow-ups: - "Walk me through the calibration data choice." - "How do you decide which layers to keep at higher precision?"

Traps: - Just dropping the bit-width without investigating which layers / which tasks are failing. - Calibrating on web-scale data when production is a narrow domain.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/06_adaptation_compression/

Distillation¶

Q: "Explain knowledge distillation."¶

Tags: mid · common · conceptual · source: standard mid/senior AI loop opener; LLM Interview Questions 2026 (multiple sources)

Answer outline: - Distillation trains a small student model to reproduce the behavior of a large teacher. Student has fewer parameters but should perform close to the teacher on the target task. - Three flavors: (1) response distillation — student SFTs on teacher's generated completions (Alpaca, Vicuna, Orca recipes); (2) logit distillation — student matches the teacher's full softmax distribution (KL-divergence loss between student and softened-teacher logits, DistilBERT-style); (3) reasoning distillation — student trains on teacher's chain-of-thought traces, transferring not just answers but the reasoning skill. - The "dark knowledge" idea: the teacher's full distribution over outputs (including the relative confidences across alternatives) carries far richer supervision than one-hot labels. Temperature-softened logits expose this. - Practical: training set is teacher outputs on a wide prompt distribution. Student typically 5-10× smaller. Retains 80-95% of teacher quality on the target task. - Numbers to drop: "distillation temperature: T=2-4 typical for logit distillation", "student size: typically 10-30% of teacher params", "training data: 100k-1M teacher generations"

Common follow-ups: - "When does logit distillation beat response distillation?" - "How do you pick the student architecture?" - "What about reasoning distillation — how is it different?"

Traps: - Treating distillation as "just SFT on teacher outputs". Response distillation is that simple, but logit distillation requires the teacher's full softmax — much higher infra cost. - Forgetting that distilled models inherit teacher biases, hallucinations, and memorized data. Re-run safety evals on the student.

Related cross-cutting: Cost & latency, Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/02_ai_infrastructure/05_agent_performance_economics/

Q: "How would you distill a 70B model down to 7B for production?"¶

Tags: senior · common · design · source: standard senior cost-optimization probe; multiple FAANG loops 2026

Answer outline: - First, scope the task. Distillation works best when the student is specialized — pick the narrow set of tasks the student needs to do well. Trying to distill all of the teacher's capability into a 10× smaller model usually fails. - Pick the student carefully. Either a smaller checkpoint from the same family (Llama-7B if teacher is Llama-70B — minimizes architectural drift) or a recipe-matched smaller model. - Generate training data: query the teacher on a wide, in-distribution prompt set. For each prompt, capture the response (and optionally logits for the next-token distribution at every position). 100k-1M prompts typical. - Response distillation is the default: student SFTs on teacher (prompt, response) pairs. Cheap; works well for narrow tasks. - Logit distillation if you have infra to capture and store teacher logits at scale. KL loss between student and teacher softened distribution. Adds 2-5% quality but expensive. - Add reasoning distillation for tasks where teacher's chain-of-thought is the value (math, code) — capture the CoT and SFT student on it. - Evaluation: compare student vs teacher on the target eval set, plus an off-task eval to check for over-specialization. Target: student matches teacher within ~5% on target tasks. - Numbers to drop: "300k teacher generations + 2 epochs SFT = a robust distillation baseline", "expect 80-95% of teacher quality on target tasks, 50-70% off-task"

Common follow-ups: - "Where do you source the prompts to query the teacher on?" - "How do you avoid distillation overfitting to teacher hallucinations?" - "Why not just quantize the teacher instead?"

Traps: - Trying to preserve general capability with a 10× smaller model. The student has to be specialized. - Skipping the safety re-eval. Distilled models inherit teacher leaks; treat the student as a new artifact.

Related cross-cutting: Cost & latency, Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/02_ai_infrastructure/05_agent_performance_economics/

Catastrophic forgetting & evaluation¶

Q: "What is catastrophic forgetting, and how do you prevent it during fine-tuning?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026); also referenced in 2026 LLM interview Q&A roundups

Answer outline: - Catastrophic forgetting: the model loses general capabilities while learning the narrow task you fine-tuned on. After fine-tuning, the model is better on your task but worse on everything else. - Mechanism: SGD updates on a narrow distribution shift weights in directions that overwrite features useful for other tasks. Recent research identifies three loci — gradient interference in attention weights, representational drift in mid-layers, loss-landscape flattening around prior-task minima. - Mitigations: - PEFT (LoRA/QLoRA): keeps base weights frozen; forgetting can only happen via the adapter. Drastically reduces but doesn't eliminate forgetting. - Low learning rate + early stopping: smaller updates, fewer epochs. Track an off-domain eval; stop when it starts dropping. - Mix in general data: include 10-30% of generic instruction-following data alongside your task-specific data. Replay-style memory. - EWC / weight regularization: add a penalty term that resists updating weights important for prior tasks (rarely used in LLM fine-tuning; more common in continual learning research). - Self-distillation: use the original model as a teacher and add a soft KL loss to its outputs on a held-out general set. - Eval discipline is the safety net: always run an off-task eval set in addition to your task-specific one. If your off-task scores drop more than 5%, stop or roll back. - Numbers to drop: "general-data replay ratio: 10-30% of training mix", "off-task eval drop threshold: <5% to ship"

Common follow-ups: - "How do you measure forgetting quantitatively?" - "Why is LoRA less prone to forgetting than full FT?" - "What's the difference between forgetting and overfitting?"

Traps: - Calling forgetting the same as overfitting. Overfitting = memorizing training set; forgetting = losing other capabilities. Different problems, different fixes. - Skipping the off-task eval. If you don't measure it, you can't know if it happened.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "How do you evaluate a fine-tuned model's performance?"¶

Tags: mid · very-common · design · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Multi-slice evaluation, not a single number. - Slice 1 — target task: held-out eval drawn from the same distribution as training. Hard accuracy or rubric-based grade. This proves the model learned the task. - Slice 2 — off-task / generalization: a fixed set of general instruction-following examples (e.g., a slice of Alpaca-eval, MMLU subset, HumanEval). This catches catastrophic forgetting. - Slice 3 — safety: prompt injection set, PII probe set, harmful-content red-team set. Catches alignment regression. - Slice 4 — production replay: if you have real production traffic, sample 100-500 examples and grade them with an LLM judge calibrated against humans. Catches distribution shift between your training data and reality. - Compare to two baselines: the base model (was the fine-tune worth doing?) and the previous fine-tune (regression check). Promote only when target improves and off-task is within tolerance and safety holds. - Numbers to drop: "target eval ≥ 200 examples, off-task ≥ 200 examples", "off-task drop tolerance: <5% to ship", "calibrate LLM judge to ≥85% agreement with humans before trusting it"

Common follow-ups: - "What's your single most important eval metric?" - "How do you avoid overfitting to your eval set?"

Traps: - Reporting a single average number. Senior interviewers want per-slice metrics. - Calling held-out test set "the eval" — that's one slice, not the whole story. - Skipping safety / off-task slices. Big production fires come from regression on those, not target task.

Related cross-cutting: Production patterns Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your fine-tuned LLM forgot its general capabilities after domain-specific fine-tuning. How do you fix catastrophic forgetting?"¶

Tags: senior · common · debugging · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Confirm the diagnosis first. Run an off-task eval (MMLU, HumanEval, BBH, or a held-out internal general set). If it dropped by more than 3-5% from the base model, you have forgetting. - First lever — switch to PEFT. If you full-fine-tuned, redo it as LoRA with r=16-32. LoRA isolates the task-specific update and protects the base. - Second — reduce intensity. Lower the learning rate (try 5× smaller), reduce epochs, or add early stopping on the off-task eval. - Third — mix replay data. Add 10-30% of generic instruction-following examples to the training mix. The model relearns to handle both distributions. - Fourth — self-distillation. Use the base model as a soft teacher on a held-out general set: add a small KL loss to its outputs during training. - Fifth — roll back and re-evaluate the decision. If forgetting is unavoidable at the data quality you have, consider whether the task is small enough to handle with prompting or a separate small model that you route to (a "specialist" gated by a router). - Numbers to drop: "10-30% replay ratio is a robust default", "LR reduction: 5× is the first dial to turn"

Common follow-ups: - "Which layers tend to drift the most?" - "How does this differ from overfitting?"

Traps: - Reaching for more epochs to "fix it" — more training makes forgetting worse. - Skipping the diagnosis. You can't fix forgetting until you've measured the off-task drop.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/

Q: "Your fine-tuned model memorized training data verbatim instead of learning patterns. How do you fix overfitting?"¶

Tags: mid · common · debugging · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Diagnose: compare training loss to held-out eval loss. If training loss is much lower (typical sign: training loss <0.5, eval loss >1.5), you're overfitting. Confirm by prompting the model with the first half of a training example and checking if it completes the rest verbatim. - First — reduce epochs. 1 epoch on a small dataset is often enough; 2-3 is the practical max. Anything beyond 5 epochs almost always overfits. - Second — add more diverse training data. Synthetic augmentation (paraphrase, vary length, mix problem types) helps if you can't get more human data. - Third — lower LoRA rank. r=64 is more prone to overfit than r=8 on a small dataset. Match rank to dataset size. - Fourth — increase weight decay and dropout. Standard regularization helps less in LLM fine-tuning than in classic ML, but it's a knob. - Fifth — early stopping on held-out eval. Track per-epoch eval loss; pick the checkpoint with best eval, not best train. - Sixth — check for training/eval leakage. Often "overfitting" is actually data leakage — your eval set contains examples that appear in (or are paraphrases of) training. - Numbers to drop: "1-3 epochs for SFT, 1-2 for QLoRA on small data", "weight decay: 0.01-0.1", "early-stop patience: 1-2 epochs of no improvement"

Common follow-ups: - "How do you detect training/eval leakage at scale?" - "Why is LLM fine-tuning so prone to memorization?"

Traps: - Confusing overfitting with forgetting. Overfitting = high train acc / low eval acc. Forgetting = high task acc / low off-task acc. They co-occur but are different. - Skipping the leakage check. Often the "overfit" model has just memorized leaked eval data.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your fine-tuned LLM produces factually wrong outputs due to training data quality issues. How do you fix it?"¶

Tags: senior · common · debugging · source: Amit Shekhar — AI Engineering Interview Questions repo (GitHub, 2026)

Answer outline: - Confirm the data is the issue, not the recipe. Sample 50-100 outputs that are wrong, trace each back to whether the answer was in the training data (if yes, it's data quality; if no, it's generalization). - Step 1 — programmatic filters. Run the training data through automated checks: format validity, length distribution, language detection, factual verification where possible (regex for dates, sympy for math, run code for code). - Step 2 — LLM-judge cleaning. Have a strong model grade each training example on accuracy and quality, drop bottom 10-20%. LLM judges aren't perfect but they're scalable. - Step 3 — de-duplicate aggressively. Near-duplicates with conflicting answers confuse the model; keep one consistent version. - Step 4 — manual spot-check. Have a domain expert review a random 100-200 examples to catch systemic issues the automated filters missed. - Step 5 — if data is salvageable after cleaning, re-train. If not, switch strategy: RAG with curated source documents may be safer than fine-tuning on unreliable text. - Numbers to drop: "drop bottom 10-20% by LLM-judge score", "dedupe by 8-gram or embedding cosine ≥ 0.95", "always retain ≥ 200 manually verified gold examples"

Common follow-ups: - "How would you validate the LLM judge before trusting it?" - "When would you abandon fine-tuning and switch to RAG?"

Traps: - Assuming "more data" fixes quality. More bad data makes the model more confidently wrong. - Skipping manual spot-check. Automated filters miss systemic biases that humans catch in 10 minutes of skimming.

Related cross-cutting: Production patterns Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you decide when a fine-tune is done?"¶

Tags: senior · common · scenario · source: standard senior AI loop probe; reported across multiple 2026 interview reports

Answer outline: - "Done" is defined by the eval, not by training loss. Held-out task accuracy plateaus or starts dropping → done. - The full release gate: target accuracy on the held-out set ≥ baseline + X% (statistically meaningful margin), off-task accuracy within 5% of base model, safety evals pass, production replay (if available) shows no regression. - Watch the training loss curve as a sanity check, not a stopping signal. Loss can keep dropping while generalization gets worse. - Use early stopping on the off-task eval as the safety belt — if generalization drops, stop and pick the last good checkpoint. - If the gate isn't met after a planned number of epochs, treat it as a failed run: investigate (data quality, LR, rank, base model fit) and re-plan. Don't keep training. - Numbers to drop: "stop when held-out eval stops improving for 1-2 epochs", "expected eval lift: 5-15% over base on the target task; if you only see 1-2%, the recipe is wrong"

Common follow-ups: - "If training loss is still going down but eval is flat, what do you do?" - "What about overnight runs — when do you set the max-epochs cap?"

Traps: - Stopping by training loss alone. Most overfit fine-tunes happen this way. - Treating "more epochs" as the lever when eval has already plateaued.

Related cross-cutting: Production patterns Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Mixed / scenario¶

Q: "A startup CTO asks you to fine-tune their own private LLM. What do you push back on?"¶

Tags: senior · common · scenario · source: standard senior pushback question; multiple AI infra interview loops 2026

Answer outline: - Three questions to ask before agreeing: (1) what specific behavior do you need that prompting + RAG can't deliver? (2) how much labeled data do you have, and is it clean? (3) what's the budget for ongoing re-tuning when the base model upgrades? - Common failure modes I'd warn them about: fine-tuning "to add knowledge" (doesn't work reliably, use RAG), fine-tuning on <500 examples (rarely converges), fine-tuning without a held-out eval (no way to know if it worked), full fine-tuning a 70B (cost is wildly under-budgeted by most teams). - Recommend the cheaper path first: a strong prompt + RAG layer + a small LoRA fine-tune only after that's measured and saturated. - If they still want to proceed: PEFT (LoRA/QLoRA), 7B-13B base, narrow task scope, 1000+ labeled examples, formal eval set, plan for 2-week re-tune cycles every quarter. - Total cost realism: a credible fine-tune project is $20-50k of engineer time + $1-5k of GPU/labeling, then ~$5k/quarter for re-tunes. Most CTOs hear "fine-tune" and budget $500 of GPU, which is wildly wrong. - Numbers to drop: "$20-50k engineering time per project, $1-5k GPU, $5k/quarter re-tune"

Common follow-ups: - "What if they insist on full fine-tuning despite your pushback?" - "Have you ever fine-tuned and regretted it?"

Traps: - Just saying yes to keep the customer happy. Senior interviewers want to see pushback. - Pushing back without offering an alternative.

Related cross-cutting: Fine-tuning vs alternatives, Cost & latency Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/01_ai_engineering/12_model_vendor_strategy/

Q: "Your team fine-tuned a 13B model for code generation. The base model just got upgraded to a stronger checkpoint. What do you do?"¶

Tags: senior · common · scenario · source: standard senior MLOps probe; 2026 AI infra loops

Answer outline: - Don't auto-upgrade. The fine-tune is bound to the exact base weights; switching the base usually breaks it. - Step 1 — A/B the new base model without your fine-tune against your current (old-base + fine-tune). If new base alone beats your current production, the fine-tune is no longer earning its keep — drop it. - Step 2 — if you still need the fine-tune, re-tune from the new base. Same recipe, same training data, same eval. Cost: a few thousand dollars and an engineer-week. - Step 3 — compare the new tuned model to the old one on your full eval suite, including off-task and safety. Promote only if it strictly dominates. - Step 4 — version everything. The fine-tune's identity is (base_version, training_data_version, recipe_version). Tag and store the artifact with all three pinned. - Long-term: budget for one re-tune cycle per quarter (or per base-model upgrade), and consider whether prompting+RAG would now beat your fine-tune given the stronger base. - Numbers to drop: "re-tune cost: $1-5k GPU + 1-2 engineer-weeks", "expect a re-tune at every base upgrade or quarterly"

Common follow-ups: - "How do you decide when a fine-tune is no longer worth maintaining?" - "What if you can't reproduce the original training data?"

Traps: - Assuming the fine-tune transfers across base versions. It rarely does cleanly. - Forgetting to A/B the bare new base — sometimes the upgrade alone is the win.

Related cross-cutting: Production patterns, Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "Walk me through fine-tuning a small model to replace GPT-4o on a narrow task."¶

Tags: senior · common · design · source: cost-optimization scenario; reported in AI engineer loops 2026

Answer outline: - This is a distillation + fine-tune pipeline. Goal: 50-95% cost savings, sub-5% quality loss. - Step 1 — scope the task narrowly. "Replace GPT-4o" only works for a single, well-defined behavior (e.g., a specific JSON extraction task, a routing classifier, a summarization style). If you need GPT-4o's general capability, you can't replace it. - Step 2 — generate teacher labels. Run GPT-4o on 10k-100k production-like prompts; capture outputs. This is your training set. Spot-check 1-2% manually for quality. - Step 3 — pick the student. 7B-13B open-weight base (Llama, Mistral, Qwen) or a small commercial model. Match the architecture family to the task — smaller models from the same family as the teacher distill better than radically different architectures. - Step 4 — fine-tune (LoRA or full FT depending on size). 1-3 epochs, LR 1e-4 to 2e-4, modest LoRA rank. - Step 5 — eval rigorously. Side-by-side comparison on held-out: agreement rate with GPT-4o on a 500-example set, plus a human-graded slice of 50-100 examples. - Step 6 — deploy with a fallback. Route to the student by default, fall back to GPT-4o on low-confidence cases (where the student outputs malformed JSON, hedging language, or the router decides). This bounds your worst case. - Numbers to drop: "$500-2000 to generate 50k teacher labels", "student 7B inference: 5-20× cheaper than GPT-4o per call", "agreement with teacher: 90-97% on narrow tasks"

Common follow-ups: - "How do you pick the confidence threshold for fallback?" - "What goes wrong if you skip the manual spot-check?"

Traps: - Trying to replace GPT-4o "in general" with a 7B. The task scope must be narrow. - Skipping the fallback. The student will fail in surprising ways; you need a safety net.

Related cross-cutting: Cost & latency, Architecture choices Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/02_ai_infrastructure/05_agent_performance_economics/

Q: "What's the smallest signal that would push you from prompting to fine-tuning?"¶

Tags: senior · common · scenario · source: standard pushback probe; reported in 2026 senior loops

Answer outline: - Three concrete signals: - Saturation on the prompt: I've tried 5+ prompt revisions with structured eval, and the failure rate on a specific class of inputs has plateaued at an unacceptable level. - Cost / latency wall: my prompt is 4k+ tokens of instructions and examples, and the per-call cost or latency is now the binding constraint at scale (>100k requests/day). - Behavioral mismatch the prompt can't fix: the model reliably produces a structurally-wrong output (wrong JSON shape, wrong style, wrong refusal pattern) and prompt tweaks don't fix it. - Even with one of these signals, I'd first try: RAG (if knowledge is the issue), a stronger or smaller specialized model (if capability or cost), few-shot examples from production failures (if pattern), structured-output decoding (if format). - Only when those fail do I move to fine-tuning. The signal must be repeatable and measurable on an eval set — not "the model sometimes does X". - Numbers to drop: "5+ prompt iterations with measured eval", "prompt > 4k tokens before fine-tune is cost-justified at low scale", "100k+ requests/day for cost arbitrage"

Common follow-ups: - "What if the prompt is short but the model still fails on a class of inputs?" - "How do you avoid premature fine-tuning?"

Traps: - Citing "the model isn't working" as a signal. Too vague — interviewer wants a quantitative gate. - Skipping the alternatives. Senior interviewers want you to resist fine-tuning until cheaper options are exhausted.

Related cross-cutting: Fine-tuning vs alternatives Related module: learning/00_ai_foundation/06_adaptation_compression/, learning/00_ai_foundation/07_prompting_fundamentals/