LLM Fundamentals — Interview Questions¶

The "do you know how this thing actually works" round. Most candidates can talk RAG and prompts; fewer can clearly explain BPE, the Q/K/V projections, why KV cache matters, what GQA buys over MHA, or why a 1M-token context window often produces worse output than a 32k one. The senior tell is connecting mechanism to production consequence — "memory bandwidth-bound decode" follows from KV cache size, "lost in the middle" follows from position-bias in attention, etc.

For decoding parameters (temperature, top-p, top-k) and prompt-related techniques, see prompt-engineering.md.

Tokenization¶

Q: "What is tokenization in LLMs? How does it work?"¶

Tags: screen · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); MyEngineeringPath 2026; standard LLM screen opener

Answer outline: - Tokenization splits raw text into discrete units (tokens) that the model maps to embedding vectors. The model sees integer token IDs, never raw characters. - Token boundaries are not words. Subword tokenizers (BPE, WordPiece, SentencePiece) split rare words into pieces. "tokenization" might be ["token", "ization"]; "unbelievable" might be ["un", "believ", "able"]. - Why subword: handles out-of-vocabulary words without UNK tokens, keeps vocabulary size manageable (~30k-128k typical), works across morphologically rich languages. - Modern tokenizers in 2026: GPT/Claude use BPE-family (tiktoken / cl100k / o200k); Llama uses SentencePiece-BPE; Gemini uses SentencePiece. Vocabulary sizes have grown to 100k-256k as multilingual coverage expanded. - Practical consequences: 1 English word ≈ 1.3 tokens on average; code and non-Latin scripts are token-heavier (Mandarin sometimes 2-3 tokens per character on older tokenizers, near 1 on modern ones); long numbers like phone numbers fragment into many tokens. - Cost implication: API pricing is per-token. Better tokenizers compress more text into fewer tokens → cheaper at the same content. GPT-4o's o200k tokenizer is ~10-20% more compact than the older cl100k on most content. - Numbers to drop: "vocabulary size: 100k-256k typical in 2026", "1 English word ≈ 1.3 tokens", "code: 2-3× the token-per-char ratio of prose"

Common follow-ups: - "Walk me through BPE." - "Why subword rather than character-level?" - "Why is tokenization a security concern?"

Traps: - Saying "tokens are words". They aren't. - Skipping the cost angle. The interviewer often probes pricing implications.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Q: "Explain BPE (Byte Pair Encoding)."¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - BPE starts with a vocabulary of individual bytes (or characters). It then iteratively merges the most frequent adjacent pair across the training corpus, adding the merged pair as a new vocabulary entry. - Training loop: count adjacent pair frequencies → merge most frequent → repeat for N iterations (vocabulary size). Each merge becomes a learnable tokenizer rule. - Encoding: greedy or longest-match decomposition of input text using the merge rules in order. - Tiktoken / cl100k / o200k: OpenAI uses byte-level BPE (operates on UTF-8 bytes, not Unicode characters). Means no out-of-vocabulary tokens are ever possible — worst case any string falls back to byte-level encoding. - Versus WordPiece (BERT) and SentencePiece (Llama, Gemini): SentencePiece treats input as raw bytes/Unicode without pre-tokenization on whitespace, useful for languages without word boundaries (Chinese, Japanese). - 2026 maturity: byte-level BPE has won the mainstream. Newer tokenizers (Llama 3, GPT-4o) have larger vocabularies (128k-256k) for better multilingual compression. - Numbers to drop: "GPT-4o o200k vocabulary: 200,019 tokens", "Llama 3 tokenizer: 128k vocabulary", "byte-level BPE: no OOV possible"

Common follow-ups: - "Why byte-level rather than character-level?" - "What's the trade-off between vocabulary size and embedding cost?" - "Why does the BPE merge order matter?"

Traps: - Confusing BPE with word-level tokenization. BPE is subword. - Forgetting that token IDs depend on tokenizer version. Tokenizer changes break fine-tunes.

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Q: "Why does tokenization matter for LLM behavior?"¶

Tags: senior · common · conceptual · source: LLM Fundamentals 2026 (MyEngineeringPath); standard senior probe

Answer outline: - Tokenization is upstream of everything the model does. Bad tokenization → bad behavior on the affected content. - Specific failure modes: - Numerical reasoning: "12345" might tokenize as [12, 345] or [1, 2345] depending on tokenizer. Different splits → different model performance on math. Modern tokenizers fix this with per-digit tokenization. - Code: indentation tokens, weird whitespace, common identifiers may not have efficient encodings, leading to inflated token counts and weaker performance on specific languages. - Multilingual: older tokenizers used 2-3× more tokens for non-English text. The same prompt in Hindi or Arabic could cost 2-3× more API spend. - Rare entities / proper nouns: long unfamiliar names fragment heavily, sometimes losing semantic coherence. - Adversarial tokens: research has shown that specific token sequences cause LLMs to behave unexpectedly (the "SolidGoldMagikarp" effect — a rarely-seen token from the training corpus that produces strange outputs). - Operational consequence: when comparing models, equal characters doesn't mean equal tokens — cost and context-fit differ. - Numbers to drop: "older multilingual penalty: 2-3× tokens for non-English. Modern tokenizers (o200k, Llama 3): nearly 1×.", "per-digit number tokenization in 2024+ models improved arithmetic benchmarks by 5-15%"

Common follow-ups: - "How do you handle non-English content cost-effectively?" - "What's an example of a tokenization-related failure?"

Traps: - Treating tokenization as a solved, invisible step. It still bites in production.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Transformer architecture¶

Q: "What is the difference between encoder-only, decoder-only, and encoder-decoder Transformer architectures?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Encoder-only (BERT, RoBERTa): bidirectional attention — every token can attend to every other token. Trained on masked language modeling (predict masked tokens given the full sequence). Output: contextual embeddings, not generation. Used for classification, NER, embedding-as-retrieval. - Decoder-only (GPT, Claude, Llama, Gemini): causal/masked attention — each token only attends to itself and earlier tokens. Trained on next-token prediction. Used for generation. The dominant architecture in 2026. - Encoder-decoder (T5, BART, original Transformer): an encoder processes the input fully (bidirectional), a decoder generates output autoregressively while cross-attending to encoder outputs. Best for translation, summarization where input is consumed and output is generated separately. - Why decoder-only dominates in 2026: scaling laws favor it (simpler to scale, every parameter sees both tasks), it can be prompted into virtually any task via instruction tuning, separate-encoder-and-decoder is overhead for the chat-style use case. - Encoder-only still wins for: dense embeddings (BGE, E5 are encoder-based for retrieval), classification, NER, where you don't need generation. - Numbers to drop: "decoder-only is the 2026 default for chat/generation", "encoder-only is the default for retrieval embeddings", "encoder-decoder still wins on specific seq2seq tasks but is fading"

Common follow-ups: - "Why doesn't BERT generate well?" - "Why did encoder-decoder lose to decoder-only?" - "What about T5 and Flan-T5?"

Traps: - Saying "all LLMs are decoder-only". Embedding models are encoder-based. - Confusing causal masking with not seeing future tokens at training time. Training does see them; masking enforces they aren't attended to.

Related cross-cutting: — Related module: learning/00_ai_foundation/03_transformer_mechanics/

Q: "Walk me through what happens when a transformer processes a token."¶

Tags: senior · common · conceptual · source: standard senior transformer-internals probe; reported in 2026 AI engineer loops

Answer outline: - Input: a sequence of token IDs. - Step 1 — embedding: each token ID looked up in the embedding table → vector of dim_model (e.g., 4096 for Llama-3-7B). - Step 2 — positional encoding: a position-dependent vector added (sinusoidal) or rotation applied (RoPE) so the model knows token order. - Step 3 — transformer blocks: typically 20-100+ blocks stacked. Each block has: - Layer norm (typically RMSNorm in modern models). - Multi-head attention: project to Q, K, V via three learned matrices; split into heads; compute scaled dot-product attention per head; concatenate; project back. - Residual connection around the attention. - Another layer norm + feed-forward (a 2-layer MLP, often with SwiGLU or GeGLU activation). The FFN typically expands to ~4× the hidden dim and projects back. - Residual connection around the FFN. - Step 4 — final layer norm. - Step 5 — output head: a linear projection from dim_model to vocabulary size, producing logits per token in the vocabulary. - Step 6 — softmax over the logits (during sampling) to get probabilities over the next token. - During training: the loss is the cross-entropy between the model's predicted distribution at each position and the actual next token. During inference: sample (or greedy) from the distribution for the last position. - Numbers to drop: "Llama 3 8B: 32 blocks, dim_model=4096, 32 attention heads, FFN dim=14336", "FFN typically 2-4× the parameter count of attention", "total parameters ≈ blocks × (attention_params + FFN_params)"

Common follow-ups: - "Why two layer norms per block?" - "What's the difference between LayerNorm and RMSNorm?" - "Why SwiGLU?"

Traps: - Saying "attention is the whole transformer". FFN is typically 2-4× the params. - Skipping residual connections.

Related cross-cutting: — Related module: learning/00_ai_foundation/03_transformer_mechanics/

Attention mechanism¶

Q: "What is self-attention, and how does it work in Transformers?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Self-attention: each token in the sequence computes a weighted sum of every other token, where the weights come from learned similarity. - Mechanics: given input embeddings X (shape [seq_len, dim]): - Project to Queries Q, Keys K, Values V via three learned weight matrices: Q = X·W_q, K = X·W_k, V = X·W_v. - Compute attention scores: scores = Q·K^T / sqrt(d_k) (the sqrt scales for stability). - In causal (decoder) self-attention, apply a mask that sets scores[i, j > i] = -inf so each token can only attend to earlier (and itself). - Softmax over rows to get attention weights summing to 1. - Output: attn = softmax(scores) · V. - The intuition: Q asks "what am I looking for"; K advertises "this is what I am"; V is "this is what I provide if I match". Tokens with similar Q·K^T are most attended to. - Why dot product: simple, efficient, well-conditioned with sqrt scaling. - Numbers to drop: "Q/K/V projections each: dim × dim parameters", "softmax + matmul per position is O(seq_len²) — the quadratic-scaling problem"

Common follow-ups: - "Why three separate projections — why not just compare X with itself?" - "Why scale by sqrt(d_k)?" - "What's the difference between self-attention and cross-attention?"

Traps: - Mixing up Q, K, V semantics. Q is the current token's question; K and V come from all tokens being attended to. - Forgetting the sqrt(d_k) scaling. Without it, dot products grow with dimension and softmax saturates.

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/00_ai_foundation/03_transformer_mechanics/

Q: "Explain the Query, Key, and Value in attention."¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Q (query), K (key), V (value) are three separate projections of the same input embeddings. Each is computed by multiplying input X by a learned weight matrix: Q = X·W_q, K = X·W_k, V = X·W_v. - The metaphor: imagine searching a database. Q is your query, K is the index keys, V is the records. You compute similarity between Q and each K to decide which Vs to retrieve and how strongly to weight them. - For each query position, the attention output is a weighted sum of all Vs, where weights come from softmax(Q·K^T / sqrt(d_k)). - Why three different projections rather than X-with-X: gives the model expressive flexibility. The same token can be "looking for" something different than what it "advertises", and what it "provides" if matched can differ from both. Empirically, three separate projections beat parameter-tied alternatives. - In cross-attention (encoder-decoder): Q comes from the decoder side, K and V come from the encoder side. The decoder is "querying" the encoded source. - Numbers to drop: "in MHA, Q/K/V each split into N heads. Llama 3 8B: 32 heads, d_k = d_v = 128 per head"

Common follow-ups: - "Why not just have one projection?" - "How is Q different from K?" - "What is cross-attention?"

Traps: - Saying Q, K, V are different inputs. They're different projections of the same input (in self-attention).

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Q: "What are multi-head attention mechanisms? Why use multiple attention heads?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Multi-head attention runs N independent attention operations in parallel, each with its own Q, K, V projection matrices. Each head operates in a d_model / N-dimensional subspace. - Why: a single head averages all the relationships a token has into one vector. Multiple heads can specialize — different heads attend to different types of relationships (syntactic, semantic, positional). Empirically much better than a single big head. - Mechanics: head_i computes its own attention; outputs are concatenated and passed through a final output projection (W_o). - Cost: total parameter count is the same as a single big head (each head is d_model / N dimensional, N heads × that dim = d_model). The win is expressivity, not capacity. - In 2026: classical MHA has been partially displaced by Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) which share K/V across heads to reduce KV cache size — important for inference efficiency. - Numbers to drop: "Llama 3 8B: 32 query heads, 8 key/value heads (GQA)", "d_k per head: 64-128 typical", "concat + W_o projection unifies head outputs"

Common follow-ups: - "What does each head learn? Can you interpret them?" - "What's the difference between MHA, MQA, and GQA?" - "Why not just one bigger head?"

Traps: - Claiming heads have distinct interpretable roles. Some do; many don't. Interpretability research is mixed. - Saying MHA increases parameters. It doesn't — same total, just split.

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Q: "What is Grouped-Query Attention (GQA), and how does it differ from Multi-Head Attention (MHA)?"¶

Tags: senior · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - MHA: N query heads, N key heads, N value heads. Full expressive flexibility but the KV cache grows linearly with N during inference. - MQA (Multi-Query Attention): N query heads, 1 key head, 1 value head shared across all queries. Smallest KV cache; some quality loss. - GQA (Grouped-Query Attention): the compromise — N query heads, K (small) key heads, K value heads, with N/K query heads sharing each key/value head. Llama 3 8B has 32 Q heads and 8 KV heads, so groups of 4 Q heads share each KV head. - Why it matters: inference is memory-bandwidth-bound (see "memory bandwidth-bound decode" question). KV cache loads dominate per-token cost. Smaller KV cache → faster decode and lower memory pressure. - Quality trade-off: GQA loses very little quality vs MHA (~0.5-1% on benchmarks) while dramatically shrinking the KV cache. MQA loses more (1-3%) but compresses further. - 2026 default: GQA in nearly all production frontier models (Llama 3, Claude, Gemini, GPT-4-class). Pure MHA is reserved for some research models and smaller historical models. - Numbers to drop: "Llama 3 8B GQA: 32 Q / 8 KV heads → 4× KV cache reduction vs MHA", "MQA: ~8-32× KV cache reduction vs MHA, larger quality hit"

Common follow-ups: - "Why is KV cache size the binding constraint?" - "How do you train a model with GQA?" - "When would you still use MHA?"

Traps: - Confusing GQA with MQA. GQA has multiple KV heads; MQA has one. - Saying GQA hurts quality significantly. The hit is small; the inference win is large.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/00_ai_foundation/03_transformer_mechanics/

Q: "What is Flash Attention?"¶

Tags: senior · common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - FlashAttention is an IO-aware exact attention implementation. Same math as standard attention; faster execution by minimizing memory traffic. - Standard attention computes the full N×N attention matrix in HBM (GPU global memory), which is slow because HBM bandwidth is the bottleneck and the N×N matrix is O(N²) memory. - FlashAttention tiles the computation: load blocks of Q, K, V into the much-faster SRAM (on-chip cache), compute attention for the block, never materialize the full attention matrix in HBM. Output written back to HBM block by block. - Three generations: FlashAttention-1 (2022) introduced the technique, -2 (2023) parallelized better across SMs, -3 (2024) added Hopper-specific (FP8, async) optimizations. - Performance: 2-4× faster than naive attention, with much higher memory efficiency. Enables long-context training and inference that would otherwise OOM. - Critically: FlashAttention is exact — bit-identical output to naive attention, up to floating-point rounding. Not an approximation. - Numbers to drop: "FlashAttention-2: 2-4× speedup over PyTorch's standard attention", "memory: O(N) instead of O(N²) for the attention computation", "enables 100k+ token context windows that would otherwise OOM"

Common follow-ups: - "Is it an approximation?" - "Why does memory bandwidth matter?" - "What's FlashAttention-3?"

Traps: - Calling it an approximation. It's exact. - Conflating FlashAttention (algorithmic optimization) with sparse attention (which is approximate).

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/02_ai_infrastructure/02_inference_serving_systems/

KV cache¶

Q: "What is KV cache, and how does it speed up inference?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); standard senior LLM mechanics probe

Answer outline: - During autoregressive generation, the model produces one token at a time. Each new token's attention requires Keys and Values for all prior tokens in the sequence. - Without caching: for each new token, re-compute K and V projections for every prior token in the sequence. O(N²) compute and memory traffic per generation step. - With KV cache: K and V for prior tokens are computed once (when each token was first processed) and stored. Each new token only computes K and V for itself, appends to the cache, and reuses the rest. - Result: per-token compute drops from O(N²) to O(N) during decode. Massive speedup. - Cost: memory. The KV cache stores K and V tensors for every layer × every prior token × every head. For Llama 3 8B with a 32k context and batch size 1: roughly 8 GB just for the KV cache. - This is the reason LLM inference is memory-bandwidth-bound rather than compute-bound — each generation step must read the entire KV cache from HBM. - Numbers to drop: "KV cache size per token: 2 × n_layers × n_kv_heads × d_head × bytes_per_element. Llama 3 8B at 32k tokens FP16: ~8 GB.", "without KV cache: decode time grows quadratically; with: linearly"

Common follow-ups: - "Why does GQA help here?" - "What's PagedAttention?" - "How does the KV cache size constrain batch size?"

Traps: - Confusing KV cache with prompt cache. Prompt cache (provider-side) reuses across requests; KV cache (in-process) reuses within a single generation. - Forgetting that the KV cache scales with sequence length. Long contexts are KV-cache-heavy.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/02_ai_infrastructure/02_inference_serving_systems/

Q: "Why is LLM inference memory-bandwidth-bound and not compute-bound?"¶

Tags: senior · common · conceptual · source: standard senior inference-perf probe; reported in 2026 AI infra loops

Answer outline: - Modern GPUs have far more FLOPs than memory bandwidth. The ratio of peak FLOPs to peak bandwidth defines an arithmetic intensity ridge — operations need a certain FLOPs/byte ratio to fully utilize compute. Below the ridge, you're memory-bound. - LLM decode has very low arithmetic intensity: each generation step reads the entire model's weights + KV cache from HBM but does very few FLOPs per byte read. The compute units sit idle waiting on memory. - Specifically: at batch size 1, the model loads ~14 GB of FP16 weights to produce one token's worth of FLOPs (a small matmul stack). Arithmetic intensity ≈ 1-2 FLOPs/byte; H100 ridge ≈ 100-200 FLOPs/byte. Massively memory-bound. - Implications: - Batching helps: serving N requests in parallel reuses each weight load N times, pushing arithmetic intensity up. Continuous batching is a huge inference win. - Quantization helps: INT4 weights have 4× less memory traffic per token than FP16, near-linear speedup on decode. - GPU memory bandwidth matters more than FLOPs: H200 (4.8 TB/s) vs H100 (3.35 TB/s) gives ~40% decode speedup at the same FLOPs. - Prefill is different: processing a long prompt has high arithmetic intensity (all tokens attend to each other, lots of compute per byte). Prefill is often compute-bound. - Numbers to drop: "arithmetic intensity for decode at batch 1: ~1-2 FLOPs/byte. H100 ridge: ~100+ FLOPs/byte.", "H200 vs H100 bandwidth: 4.8 vs 3.35 TB/s. Decode throughput nearly matches the ratio."

Common follow-ups: - "Does this change as batch size grows?" - "If decode is memory-bound, why does FLOPs FP8 help?" - "How does this affect speculative decoding?"

Traps: - Saying decode is compute-bound because the GPU is "busy". Memory is the bottleneck; the SMs are starved.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/02_ai_infrastructure/02_inference_serving_systems/

Positional encoding¶

Q: "What is positional encoding, and why is it needed in Transformers?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Attention is permutation-invariant: scrambling the input tokens produces the same output (up to token reordering). The model has no notion of order unless we inject it. - Positional encoding fixes this by adding (or rotating) position-dependent information into the token embeddings. - Variants: - Sinusoidal (original 2017): fixed sinusoidal functions of position added to embeddings. No learned parameters. Works but degrades on very long contexts. - Learned (BERT, GPT-2): learn a position embedding per index (up to a max length). Simple but doesn't extrapolate beyond training-time max length. - RoPE (Rotary Position Embedding): instead of adding a position vector, rotate the Q and K vectors by an angle that depends on position. Encodes relative position naturally; extrapolates better. Now the dominant choice (Llama, Mistral, Gemini, many others). - ALiBi (Attention with Linear Biases): adds a position-dependent bias to attention scores. No vector manipulation; easy to extrapolate. - Why this matters in 2026: long-context performance hinges on how the model handles positions beyond its training-time max. RoPE with scaling tricks (NTK scaling, YaRN) is the dominant approach for extending context to 100k+ tokens. - Numbers to drop: "sinusoidal: original transformer", "learned absolute: BERT/GPT-2", "RoPE: Llama family, dominant in 2026", "ALiBi: BLOOM, MPT"

Common follow-ups: - "Why doesn't learned positional encoding extrapolate?" - "What's NTK / YaRN scaling?" - "How does ALiBi compare to RoPE?"

Traps: - Claiming the transformer "knows" position from token order. It doesn't — positional encoding is required.

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Q: "How does Rotary Position Embedding (RoPE) work, and why is it preferred over learned positional embeddings?"¶

Tags: senior · common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - RoPE rotates the Q and K vectors by an angle proportional to position before computing attention. Specifically: split Q (and K) into 2D pairs; for each pair, rotate by angle θ_i × position, where θ_i is a per-dimension frequency. - Effect: when computing Q·K^T for tokens at positions m and n, the result depends only on the relative offset (m - n), not on absolute positions. The model naturally captures relative position. - Why preferred: - Extrapolation: RoPE can be extended to positions beyond training-time max via frequency scaling (NTK, YaRN, position interpolation). Learned positional embeddings can't extrapolate at all. - Relative-by-construction: no separate "absolute" and "relative" position systems. Cleaner architecture. - No extra parameters: rotation matrix is determined by position; no learned position parameters. - Composability: rotating Q and K is mathematically equivalent at attention time — doesn't disrupt anything else. - Used in: Llama family (including Llama 3), Mistral, Gemini, Qwen, most 2026 open-weight models. - Long-context tricks: NTK scaling adjusts the rotation frequencies to handle longer contexts; YaRN combines NTK with attention scaling for better extrapolation; position interpolation linearly stretches positions. - Numbers to drop: "RoPE base frequency: typically 10000 (legacy) or larger (1M+ for long-context models)", "extension via scaling: 2-4× context length with minimal training; 10×+ with continued pretraining"

Common follow-ups: - "How exactly does the rotation work?" - "What's NTK scaling?" - "Why don't all positions degrade gracefully?"

Traps: - Conflating RoPE with adding sinusoidal vectors. RoPE rotates; sinusoidal adds.

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Mixture of Experts¶

Q: "What is Mixture of Experts (MoE), and how does it work in models like Mixtral?"¶

Tags: senior · common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - MoE replaces the dense FFN (feed-forward) layer in each transformer block with a set of expert FFNs and a small router that picks which experts to activate per token. - For each token, the router (a small linear classifier over the token's hidden state) outputs scores over experts, picks the top-K (typically K=2) experts, and weights their outputs. - Result: only K of N experts compute per token. The model has many more total parameters but only a fraction are active per forward pass — sparse activation. Total params grow; FLOPs per token stay roughly constant. - Examples: Mixtral 8x7B (8 experts, top-2) has 47B total params but ~13B activated per token. Mixtral 8x22B (8 experts, top-2): 141B total, ~39B activated. Newer MoE models (DeepSeek-V3, Llama 4 MoE variants in 2026) scale this further. - Why: better parameter efficiency. Sparse models match dense models with much higher total parameter counts at the same FLOPs. - Trade-offs: - Memory cost: all experts must be resident in GPU memory even though only K activate per token. Total params determine memory; active params determine compute. - Routing instability: poorly-trained routers can collapse (one expert handles everything). Auxiliary load-balancing losses fix this. - Serving complexity: batch routing means different tokens in a batch hit different experts; expert-parallel serving distributes experts across GPUs. - Numbers to drop: "Mixtral 8x7B: 47B total, 13B active per token", "compute ~ active params; memory ~ total params", "top-K typical: 1-2"

Common follow-ups: - "Why does sparse activation help?" - "What's load-balancing loss for?" - "How does serving differ for MoE vs dense?"

Traps: - Saying MoE "uses fewer parameters". It uses more total; fewer active. - Forgetting memory cost. The MoE model is as memory-heavy as its full parameter count.

Related cross-cutting: Cost & latency Related module: learning/00_ai_foundation/03_transformer_mechanics/

Context window & long-context¶

Q: "What's a context window? How big is too big?"¶

Tags: mid · very-common · conceptual · source: standard senior context-window probe; MyEngineeringPath LLM Fundamentals 2026

Answer outline: - Context window = the maximum number of tokens the model processes in one forward pass (input + so-far-generated output). - 2026 landscape: 128k tokens is common (GPT-4o, Claude Sonnet 4.x); Gemini 2.5 Pro reaches 1-2M; Claude long-context tiers go to 1M; experimental research targets 10M+. - "How big is too big" — three angles: - Quality: effective context window is much smaller than the nominal. Models reliably use the first ~32k-64k tokens; performance on middle positions degrades ("lost in the middle"); deep-middle accuracy on needle-in-haystack tests drops 30-80% across models. - Cost: input tokens are usually 4-5× cheaper than output, but at 1M-context, the per-call cost is still huge. Many providers price long-context tiers higher. - Latency: prefill scales quadratically with context length (attention is O(N²)). A 1M-token prefill takes seconds even on premium hardware. - Practical rule: pass the smallest context that contains what the model needs. RAG + reranking to top-5 chunks beats stuffing top-100 chunks even on long-context models. - Numbers to drop: "GPT-4o, Claude 3.5/4 Sonnet: 128k-200k typical", "Gemini 2.5 Pro: 1-2M", "effective: 32-64k reliable on most models", "prefill time: scales O(N²) — long context is expensive both in cost and latency"

Common follow-ups: - "What's 'lost in the middle'?" - "Is RAG dead with 1M-token context?" - "How do you measure effective context?"

Traps: - Treating nominal context as effective. They diverge sharply. - Always-pass-everything thinking. Smaller, well-curated context usually wins.

Related cross-cutting: Cost & latency, Retrieval Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/01_ai_engineering/08_rag_system_design/

Q: "What are the failure modes of large context windows?"¶

Tags: senior · common · conceptual · source: MyEngineeringPath LLM Fundamentals 2026; standard senior probe

Answer outline: - Three primary failure modes: - Lost in the middle: information in the middle of long context is attended to less than start/end. Needle-in-haystack accuracy drops 30-80% in mid-positions across many models. - Distractor sensitivity: as context grows, the model is more likely to be misled by irrelevant content. Adding more chunks can make answers worse if they contain near-misses or contradictions. - Cost and latency: prefill is O(N²) compute; the per-call cost for long context is high even at "cheap" per-token rates. Per-call latency seconds, not milliseconds. - Secondary issues: - Position-bias artifacts in answers: tendency to over-cite or over-quote material from the start of context. - Quality degradation on multi-document context: even when each doc is short, packing many into context can confuse the model. - Caching gets harder: long static-prefix caching helps, but if any part of the long context varies per call, cache hit rate drops. - 2026 reality: papers and benchmarks show effective context is far shorter than nominal. The maximum-effective-context for real-world reasoning is typically 30-60% of the nominal context window on most models. - Mitigations: aggressive retrieval (don't stuff context; retrieve the right 5-10 chunks), reorder so the most-important content is at position 1 and the last position, citation-required output (forces attention to source chunks). - Numbers to drop: "effective context: 30-60% of nominal", "lost-in-the-middle accuracy drop: 30-80%", "1M-context prefill: 2-10s wall-clock typical"

Common follow-ups: - "How do you measure effective context?" - "What's the right way to use a 1M-token window?" - "Walk me through your latency budget at 100k context."

Traps: - Assuming a bigger context window solves your problems. Usually it just shifts them.

Related cross-cutting: Cost & latency, Retrieval Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/01_ai_engineering/08_rag_system_design/

Scaling laws¶

Q: "Explain scaling laws for LLMs."¶

Tags: senior · common · conceptual · source: standard senior LLM-internals probe; reported in 2026 AI engineer loops

Answer outline: - Empirical relationships between model size, training data size, training compute, and final loss/quality. - Kaplan et al. (2020): test loss scales as a power law in compute, parameters, and data. Quality smoothly improves with more of any of those — no obvious cliff. - Chinchilla / Hoffmann et al. (2022) refined this: for a given compute budget, the optimal ratio is roughly 20 tokens of training data per parameter. Earlier models (GPT-3 175B trained on 300B tokens) were under-trained per Chinchilla — should've used more data. - 2026 picture: most production models follow Chinchilla-style ratios or train over-trained (more tokens per param) because smaller models with more training are easier to deploy. Llama 3 8B: ~15T training tokens (~2000 tokens/param — far above Chinchilla optimal, but justified by better inference economics). - Implications: - More data per parameter is usually better for deployable models, not just bigger model. - Compute spent on training vs serving is a trade-off: a smaller-but-more-trained model is cheaper to serve. - Scaling has not yet plateaued in 2024-2026 on most benchmarks — though gains per dollar are slowing. - Architectural innovations (MoE, GQA, FlashAttention) shift the compute/quality curve favorably. - Numbers to drop: "Chinchilla optimal: ~20 tokens per parameter", "Llama 3 8B: ~2000 tokens/param — heavily over-trained", "scaling laws are empirical; they're observed, not derived"

Common follow-ups: - "Why over-train a small model?" - "Has scaling plateaued?" - "What about MoE — does it change the scaling law?"

Traps: - Treating scaling laws as physical laws. They're empirical fits with regime-dependent slopes.

Related cross-cutting: — Related module: learning/00_ai_foundation/05_llm_training_pipeline/, learning/00_ai_foundation/03_transformer_mechanics/

Generation behavior¶

Q: "Why do LLMs hallucinate?"¶

Tags: screen · very-common · conceptual · source: MyEngineeringPath LLM Fundamentals 2026; standard 2026 LLM screen

Answer outline: - LLMs are next-token predictors trained to output fluent continuations, not correct ones. Hallucination is the training objective working as designed — confident, fluent text whose factuality wasn't directly optimized. - Specific causes: - Gaps in training data: facts the model never saw get plausibly invented. - Outdated training data: facts shifted post-cutoff; the model still emits the old facts. - Calibration miscalibration: the model doesn't reliably know what it doesn't know. RLHF reduces this but doesn't eliminate it. - Pressure to answer: instruction-tuned models default to producing an answer rather than refusing, even when uncertain. - Compression artifacts: training compresses billions of facts into billions of parameters. Some facts get fuzzy. - Mitigations: RAG (ground generation in retrieved sources), citation-required output, lower temperature for factual tasks, post-hoc verification (claim-extraction + entailment check), self-consistency (sample N, majority-vote). - The honest answer: hallucination can be reduced (RAG + verification + RLHF cuts severe cases 60-95%), not eliminated. Plan for it as a known failure mode. - Numbers to drop: "RAG + grounding-check cuts severe hallucination by 60-95%", "self-consistency at N=5: catches 60-80% of confidently-wrong outputs"

Common follow-ups: - "What's the difference between closed-domain and open-domain hallucination?" - "Why doesn't RAG fully solve it?" - "Are bigger models less prone to hallucinate?"

Traps: - Claiming hallucination can be eliminated. - Saying "the model is lying". The model isn't being deceptive; it's optimizing the wrong objective for factuality.

Related cross-cutting: Production patterns, Architecture choices Related module: learning/03_ai_security_safety/00_safety_guardrail_design/, learning/01_ai_engineering/08_rag_system_design/

Q: "What is the difference between temperature 0 and temperature 1?"¶

Tags: screen · very-common · conceptual · source: MyEngineeringPath LLM Fundamentals 2026; standard screen-tier probe

Answer outline: - Temperature scales the logits before softmax: softmax(logits / T). - T=0 is mathematically the limit where the softmax becomes a one-hot vector over the highest-logit token. In practice, T=0 means greedy decoding — always pick the most-probable next token. Deterministic, reproducible. - T=1 is no scaling — sample from the model's natural distribution. Diverse output, less predictable. - T>1 flattens the distribution further, more random; 0<T<1 sharpens it. - Use T=0 for: extraction, classification, structured output, factual Q&A — anywhere correctness > variety. - Use T=0.3-0.7 for: balanced output, code generation (some variety helps recover from local-optimum mistakes), conversational responses. - Use T>0.7 for: creative writing, brainstorming where variety matters. - Subtleties: T=0 is not perfectly deterministic across providers due to tie-breaking and floating-point order. For full reproducibility, set both temperature and seed. - For decoding-parameter deep-dive (top-p, top-k, repetition penalty), see prompt-engineering.md. - Numbers to drop: "T=0 default for extraction/classification", "T=0.3-0.7 for balanced", "T>1 rarely used in production"

Common follow-ups: - "How is temperature different from top-p?" - "Why isn't T=0 fully deterministic?"

Traps: - Conflating temperature and top-p. They're orthogonal knobs.

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/00_ai_foundation/07_prompting_fundamentals/

Q: "What are different decoding strategies (greedy, beam search, sampling)? When do you use each?"¶

Tags: mid · common · conceptual · source: standard senior decoding probe; reported in 2026 LLM loops

Answer outline: - Greedy: pick the single highest-probability token at each step. Fast, deterministic. Best for tasks with one right answer (classification, extraction). - Beam search: maintain top-N partial sequences (beams), expand each, keep top-N overall. More globally optimal than greedy but tends to produce repetitive, generic text. Standard in classical NMT; rarely used in modern open-ended generation. - Sampling: sample from the next-token distribution. Plain sampling (T=1) is too random; production setups use either top-K (sample from top-K most-likely tokens) or top-P / nucleus (sample from the smallest set of tokens whose cumulative probability is ≥ p, typically p=0.9-0.95). - Speculative decoding: a separate technique (covered in cost-latency-optimization.md) — uses a draft model to propose tokens, target model verifies in parallel. Same math as the underlying sampling. - Use case mapping: - Extraction / classification / structured output: greedy (T=0). - Code completion / structured generation: top-p with low T (0.2-0.5). - Conversational / creative: top-p with mid T (0.5-0.8). - Multi-output diverse sampling (eval, brainstorming): higher T (0.7-1.0) with top-p. - 2026 production default: top-p (p=0.9-0.95) with T=0 to 0.7 depending on task. - Numbers to drop: "top-p = 0.9-0.95 typical", "beam search rarely used in 2026 chat", "self-consistency: sample N=5-10 then majority-vote"

Common follow-ups: - "Why doesn't beam search work well for chat?" - "Top-K vs top-P — which?" - "How does self-consistency use sampling?"

Traps: - Recommending beam search for open-ended generation. Tends to repetitive output.

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/, learning/00_ai_foundation/07_prompting_fundamentals/

Q: "Explain the attention mechanism to a product manager."¶

Tags: senior · common · conceptual · source: MyEngineeringPath LLM Fundamentals 2026; classic senior-level "explain it simply" probe

Answer outline: - "Imagine reading a long sentence — when you reach the word 'it', your brain instantly figures out which noun 'it' refers to by paying attention to the right earlier words. The attention mechanism in transformers is the math version of this. - For each word the model is processing, it computes a relevance score with every other word in the input. The highest-scoring words get the most influence on how the current word gets interpreted. - That's how transformers handle long dependencies — they're not reading left-to-right and forgetting; they have a way to look back at any earlier word that matters. - The 'multi-head' part means the model does this several times in parallel — each parallel pass can focus on a different kind of relationship (one head might track who-is-doing-what; another might track time references)." - Senior tell: candidate can compress correctly without losing the key intuition. Avoid jargon (Q/K/V, softmax) unless the PM asks; lead with the analogy. - Why this matters: a chunk of senior interviews include "explain to a non-technical stakeholder" questions because in real product orgs you'll need to. - Numbers to drop: not needed for this answer style.

Common follow-ups: - "Why is this better than older approaches like RNNs?" - "If I make my input twice as long, what happens?"

Traps: - Jumping straight to Q, K, V. Loses the listener. - Over-correcting toward toy analogies that miss the point ("the model 'focuses' like a human" — partially true but vague).

Related cross-cutting: — Related module: learning/00_ai_foundation/02_tokens_embeddings_context/

Q: "How would you design a system that uses both a small model and a large model?"¶

Tags: senior · very-common · design · source: MyEngineeringPath LLM Fundamentals 2026; standard senior architecture probe

Answer outline: - This is the model-routing question. Frame the answer as a tiered system, not just two models. - Layers (covered in depth in cost-latency-optimization.md): - Small model handles the bulk of traffic that's "easy enough" — short answers, format-bound tasks, classification, intent extraction. - Large model handles the long tail — multi-step reasoning, ambiguous questions, anything the small model can't ace. - Routing mechanism: - Static rules: by query type, by length, by tenant tier. - Classifier router: small fast model labels query complexity; routes accordingly. - Confidence cascade: small model attempts; if its confidence is low (logprob, structured-output schema failure, hedging language detected), escalate to the large model. - Calibrate on real traffic. Run shadow comparisons of small vs large on the same queries; learn the routing threshold from data. - Cost win at scale: typically 60-70% of traffic to the small model at 5-10× cost reduction; quality regression <1-3% on most workloads. - Always have a fallback path. If the small model can't produce a parseable response, fall through to the large model rather than failing. - Numbers to drop: "60-70% of traffic to small model typical", "5-10× cost reduction per routed call", "router itself ~$0.001/call with <300ms latency"

Common follow-ups: - "How do you handle the routing classifier being wrong?" - "What if the small model handles 60% of traffic but those 60% are the cheap users — how does cost actually shake out?" - "How would you A/B routing changes safely?"

Traps: - A binary routing decision without a fallback. Production needs an escape hatch. - Skipping calibration. Routing thresholds tuned without real data are guesses.

Related cross-cutting: Cost & latency, Architecture choices Related module: learning/01_ai_engineering/12_model_vendor_strategy/, learning/02_ai_infrastructure/05_agent_performance_economics/