Cost & Latency Optimization — Interview Questions¶
The senior-loop differentiator. Anyone can build a working LLM feature; few can keep one profitable at 100k+ requests/day. Expect a layered design question: routing + caching + batching + smaller models + serving infra. The candidate who names the dominant cost driver before reaching for fixes is the one who's actually shipped this. In 2026, a well-tuned routing + caching + batching stack typically cuts API spend 70-85%.
Routing & model tiering¶
Q: "How would you reduce token costs in an LLM-powered product at scale?"¶
Tags: mid · very-common · design · source: standard senior AI engineer interview opener; reported across DataCamp, MyEngineeringPath, MockExperts 2026 question lists
Answer outline: - Start with measurement. "Reduce cost" without a cost breakdown is theatre — name your dominant driver before naming fixes. Top suspects: long system prompts replayed every call, oversized model for the task, no caching, unbatched async work. - The five-lever playbook in priority order: (1) caching — highest ROI, lowest risk; (2) model routing/tiering — pick the cheapest model that passes the task eval; (3) prompt compression / shorter outputs; (4) batching async traffic; (5) distillation or fine-tune to a smaller open-weight model for hot paths. - Combined, these typically cut spend 70-85% without quality regression. Caching alone is 60-90% on cache hits; routing alone is 40-70% on full traffic. - Name the trade-off you'll measure: every cost cut must come with an eval gate on quality, or you're trading dollars for silent regressions. - Numbers to drop: "Anthropic cache hit: 0.1× input cost (90% off). Batch API: 50% off, 24h SLA. Router can send 60-70% of traffic to a 5-10× cheaper model with a 1-3% quality delta on most tasks."
Common follow-ups: - "Walk me through how you'd measure your dominant cost driver." - "Of these levers, which would you do first and why?" - "How do you avoid quality regressions while cutting cost?"
Traps: - Jumping to "use a smaller model" without measuring whether the prompt or the cache is the actual problem. - Treating cost optimization as a one-shot exercise. It's continuous — model prices change, traffic patterns shift, base models upgrade. - Forgetting the eval gate. Senior interviewers always probe: "how do you know the optimization didn't break quality?"
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "What's a router model and how do you build one?"¶
Tags: senior · very-common · design · source: Mavik Labs LLM cost optimization 2026 guide; MindStudio router model post; standard senior loop probe
Answer outline: - A router is a small, fast classifier that decides which downstream model handles each request. The bet: 60-70% of production traffic is easy enough for a cheap model; only the hard 30% needs the expensive one. - Build it in three layers. (1) Intent classifier: short-text classifier (often a 1-3B model or even a fine-tuned BERT) labeling the request as simple/medium/complex or by task category. (2) Cheap-tier model: the default route — Haiku, GPT-4o-mini, a self-hosted 7B. (3) Expensive-tier model: the fallback — Sonnet/Opus, GPT-4-turbo, or a frontier model. - Add a confidence escalation: if the cheap model's output fails a structural check (malformed JSON, refusal, hedging language) or the classifier confidence is low, fall back to the expensive tier automatically. - Calibrate with side-by-side comparison: route 5-10% of traffic in shadow mode to both tiers, compare answers with an LLM judge, learn the routing thresholds from real data. - The router itself must be cheap — ~$0.001 per classification with ~50-500ms latency. Otherwise the routing cost eats the savings. - Numbers to drop: "60-70% of traffic to cheap tier", "router cost: $0.001/req, ~430ms latency", "savings: 40-70% on full traffic budget at <3% quality regression"
Common follow-ups: - "How do you train the router?" - "What if the classifier itself is wrong?" - "How do you handle the case where the cheap model refuses?"
Traps: - Building a fancy router with a frontier model — the router cost cancels the savings. - No fallback path. Every router must have an escape hatch to the expensive tier. - Skipping the calibration phase. Routing thresholds tuned without real production traffic are guesses.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/01_ai_engineering/12_model_vendor_strategy/
Q: "When would you route to a cheaper model vs always use the strongest?"¶
Tags: mid · common · scenario · source: AI engineer cost-tradeoff probe; reported in multiple 2026 senior loops
Answer outline: - Default to "always use the strongest" only at low scale (<10k requests/day) or for high-stakes, single-call interactions (legal, medical, irreversible actions). - Route to cheaper models when (a) request volume creates real cost pressure, (b) the task has a clear difficulty distribution, (c) you have evals to detect routing failures, (d) the latency of the cheap model is materially better. - A senior tell: candidate names the signal used to route — request length, presence of specific keywords (code, math, image), conversation depth, intent classifier confidence — not just "we route by classifier". - The other senior tell: candidate names the quality bar — e.g., "the cheap tier must achieve ≥95% agreement with the expensive tier on a 500-example eval before we route any traffic to it." - Numbers to drop: "5-10× cost reduction per routed call", "quality delta typically 1-3% on most tasks if calibrated", "cheap-tier latency advantage: 2-5× faster TTFT"
Common follow-ups: - "What happens at the boundary — requests right at the threshold?" - "How do you A/B routing changes safely?"
Traps: - Routing without a calibration eval set. - Routing on a single signal (e.g., request length only). Real-world prompts don't split cleanly by length.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "How do you decide between a frontier API model and a self-hosted open-weight model?"¶
Tags: senior · common · design · source: standard senior cost question; reported in 2026 AI infra interviews
Answer outline: - Frontier API wins for low-to-medium volume, fast iteration, edge cases that need the strongest reasoning, and any team that doesn't have ML infra ops capacity. - Self-hosted wins at high volume (typically >50k requests/day where the per-call savings amortize ops overhead), strict data residency / privacy requirements, latency targets where round-trip to an API provider is too slow, and predictable workloads where you can keep GPUs hot. - Math: at $4/M input + \(20/M output (frontier tier), a 1k-input/200-output call costs ~\)0.008. Self-hosted on H100 with a 7B model at ~\(3/hour amortized: ~\)0.0005 per call at decent utilization. Break-even depends on utilization — at 30% GPU utilization you're paying full GPU cost for partial throughput. - Hidden costs of self-hosted: GPU ops, model upgrades, quantization, vLLM tuning, on-call. Budget 1-2 FTE-engineers for any serious self-hosted production stack. - Hybrid: self-host the hot path (the cheap router tier), use frontier API for the slow path (the expensive fallback). Common 2026 architecture. - Numbers to drop: "frontier API: $0.001-0.05/call typical. Self-hosted 7B: $0.0001-0.001/call at high utilization, but FTE ops cost is real."
Common follow-ups: - "What's your break-even volume estimate?" - "How do you handle base-model upgrades on self-hosted?" - "When would you reverse this decision?"
Traps: - Forgetting FTE ops cost on self-hosted. The GPU bill is only half of the actual cost. - Comparing $/token without comparing quality on your specific eval.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/01_ai_engineering/12_model_vendor_strategy/, learning/02_ai_infrastructure/02_inference_serving_systems/
Caching¶
Q: "Walk me through caching for LLM applications."¶
Tags: mid · very-common · design · source: standard senior AI loop opener; Mavik Labs / Morph cost-optimization 2026 guides
Answer outline: - Two layers, both required at scale. (1) Provider-side prompt caching: the API provider stores the KV cache of your static prompt prefix (system prompt + few-shot examples + tool definitions); subsequent calls hit the cache for the prefix and only pay full price for the new suffix. (2) Application-side response caching: you store full response keyed by request hash, return instantly without calling the LLM at all. - Provider-side prompt caching is the bigger win for agent-style workloads where the same 5-20k-token system prompt is replayed every turn. Anthropic, OpenAI, Google all offer it. Anthropic charges 0.1× input on cache hits = 90% savings. - Application-side caching has two flavors: exact-match (hash the request) and semantic (embed the request, look up similar embeddings, return cached answer if similarity above threshold). Semantic catches paraphrases — same question worded differently — but introduces false-positive risk. - TTL is the catch. Anthropic's prompt cache TTL is 5 minutes by default (1 hour available at 2× write cost). Application caches you control — minutes to hours for FAQ, longer for stable docs. - Cache hit rate is the senior metric. Agent stacks with stable system prompts typically achieve 60-80% cache hit rate on the prompt prefix. - Numbers to drop: "Anthropic 5-min TTL cache: 1.25× write cost, 0.1× hit cost. 1-hour TTL: 2× write. Cache hit rate target: 60-80% for agents."
Common follow-ups: - "How does semantic caching work?" - "What's the failure mode of semantic caching?" - "How do you set TTL for prompt caching?"
Traps: - Treating "caching" as one thing. Provider-side prefix caching and application-side response caching are different problems with different trade-offs. - Skipping cache invalidation. If your system prompt changes, the cache must invalidate or you serve stale instructions. - Forgetting that prompt caching's structure matters — the cacheable prefix must be identical byte-for-byte across calls.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "Explain semantic caching. When does it bite you?"¶
Tags: senior · common · conceptual · source: AWS LLM caching blog 2026; standard senior ops probe
Answer outline: - Semantic caching embeds the incoming query, searches a vector index of previous queries, and if a previous query's embedding has cosine similarity above a threshold (typical: 0.9-0.97), returns that cached answer instead of calling the LLM. - Win: handles paraphrases. "What's the refund policy?" and "How do I get my money back?" hit the same cache entry. Exact-match caches miss this. - Bite cases: (1) semantically similar questions with materially different answers — "what's the price of plan A" vs "what's the price of plan B" can have near-identical embeddings; (2) time-sensitive questions where the cached answer is stale; (3) personalized answers where the same question deserves different answers per user. - Mitigation: include the user/tenant/context in the cache key; set a low TTL on time-sensitive intents; use a higher similarity threshold for sensitive intents; never semantic-cache anything with PII or user-specific data without scoping the key. - Eval discipline: track cache false-positive rate — how often the cached answer is wrong for the new query. Should be <1% to ship. - Numbers to drop: "similarity threshold: 0.95 default, 0.97 for higher precision", "semantic cache hit rate: 30-60% for repeat-question workloads", "false-positive target: <1%"
Common follow-ups: - "How do you measure false-positive rate?" - "What about embedding drift over time?"
Traps: - Setting the threshold too low (0.85-0.9) — false positives go up fast. - Not scoping the cache key by user/tenant/context. The cache leaks across users in a single global hash table. - Caching tool-use responses semantically. The tool might have side effects; serving a cached tool result is dangerous.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "Your prompts contain per-user data, which is killing your cache hit rate. How do you cache without breaking personalization?"¶
Tags: senior · common · design · source: 2026 cost-optimization loop; standard prompt-caching probe
Answer outline:
- Diagnose the two cache layers separately — personalization breaks each differently.
- Prefix (prompt) cache: providers cache the static prefix and charge ~0.1× on a hit. Per-user data injected near the top invalidates the whole prefix. Fix is ordering: static system prompt + tool defs + few-shot at the top (cacheable), user-specific context at the very end (the small uncached suffix). You keep the 90% discount on the big static block and pay full price only on the personalized tail.
- Response cache: don't key on the raw prompt. Key on (normalized_intent, user_or_tenant_id, relevant_context_hash). Same question, different user → different key → no cross-user leak. Same user, same intent → hit.
- Semantic cache: scope the vector key by user/tenant and exclude PII from the embedded text; raise the threshold for personalized intents.
- What's genuinely uncacheable (truly user-specific generative answers) shouldn't be forced into a cache — route it to a cheaper tier instead. Caching isn't the only cost lever.
- The senior tell: name the privacy failure mode — an unscoped cache key serves one user's answer to another.
- Numbers to drop: "prefix cache: 0.1× input on hit = 90% off the static block", "put the ~5-20k static tokens first, per-user tail last", "response key = intent + tenant + context hash", "never a single global hash table"
Common follow-ups: - "How much of your prompt is actually static?" (system + tools + few-shot — usually the majority for agent workloads) - "What's the cache key for a logged-in vs anonymous user?"
Traps: - Putting personalization at the top of the prompt, silently destroying the prefix cache. - A global response cache with no user scoping — both a correctness and a privacy bug. - Semantic-caching anything containing PII.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/03-prompt-caching.md
Q: "How does Anthropic prompt caching work and how do you maximize its hit rate?"¶
Tags: mid · common · conceptual · source: Anthropic prompt caching docs; DEV.to 2026 post on 5-minute TTL change; standard 2026 cost-optimization probe
Answer outline:
- Mechanism: Anthropic stores the KV cache for prefix segments of your prompt marked with cache_control. When a subsequent request comes in with the exact same prefix (byte-identical), it reuses the cached KV state instead of recomputing.
- Cost: writing to cache costs 1.25× normal input for the 5-minute TTL tier, 2× for the 1-hour tier. Reading from cache costs 0.1× normal input — a 90% discount on the cached prefix.
- TTL: default is 5 minutes. Anthropic shortened this from 60 minutes in early 2026 — for many production workloads this single change raised effective costs 30-60% if you haven't restructured around the new TTL.
- Maximize hit rate by: (1) pinning stable content in the cacheable prefix — system prompt, tool definitions, knowledge-base context — and putting the dynamic user message after the cache breakpoint; (2) keeping callers warm — for sub-5-min sessions, replay any cached prefix at least every 4 minutes; (3) using the 1-hour TTL for long-running agent sessions where the 2× write cost amortizes.
- Cache key is byte-identical. A single whitespace change invalidates everything below it.
- Numbers to drop: "5-min TTL: 1.25× write / 0.1× hit. 1-hour TTL: 2× write / 0.1× hit. Hit rate target: 60-80% for agents with stable prompts. Replay cadence: every 4 min for 5-min TTL."
Common follow-ups: - "Why did Anthropic drop the TTL?" - "How would you architect an agent to maximize cache hits?" - "What invalidates the cache silently?"
Traps: - Putting the user message before the system prompt — kills the cache. - Not realizing that even minor formatting changes (extra newline, different bullet style) invalidate the cache. - Using the 1-hour TTL when the workload doesn't justify the 2× write cost.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/
Batching¶
Q: "When would you use the Batch API vs real-time inference?"¶
Tags: mid · common · scenario · source: standard cost-optimization scenario; Anthropic/OpenAI batch API docs 2026
Answer outline: - Batch API: send a batch of requests, get results back within 24 hours, pay 50% of normal price. Both Anthropic and OpenAI offer this in 2026. - Use it for any non-interactive work — overnight evaluation runs, document processing pipelines, embedding generation, offline classification, eval grading, summarization of historical content, fine-tune dataset cleaning. - Skip it for: anything user-facing, anything in an agent's inner loop, anything where the result blocks downstream work in <24h. - Batch stacks with caching. Batch + cached prefix on the API provider = up to 95% total savings on the cached portion. - The architectural pattern: route requests by SLA at ingress. SLA <1s → real-time. SLA hours-to-days → batch queue. SLA >24h → batch API + further offline optimization. - Numbers to drop: "Batch API: 50% off, 24h SLA. Batch + cache stack: up to 95% savings. Typical use: eval grading at 10k-1M examples."
Common follow-ups: - "What's the failure mode of mixing real-time and batch in the same product?" - "How would you redesign a feature to use batch?"
Traps: - Trying to use Batch for user-facing requests. The 24h SLA is hard. - Forgetting to handle batch failures and retries — partial-batch failure is a real operational concern.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "What's continuous batching and how does it differ from static batching?"¶
Tags: senior · common · conceptual · source: vLLM / Anyscale continuous batching blog; Spheron LLM serving optimization 2026; reported in inference-infra interviews
Answer outline: - Static batching: collect N requests, run them all to completion in parallel, return all results at once. The whole batch waits for the longest sequence to finish. GPU utilization tanks on variable-length outputs because most sequences finish early and the GPU compute is wasted on padding. - Continuous batching (in-flight batching, iteration-level scheduling): at every token step, the scheduler evicts finished sequences and slots new ones into their freed slots. No request waits for the slowest one. Used by vLLM, TGI, TensorRT-LLM. - Combined with PagedAttention (KV cache stored in non-contiguous physical blocks like OS paging), vLLM achieves up to 23× throughput improvement over naive serving while reducing p50 latency. - The trade-off: continuous batching prefers many medium-length requests over a few very long ones. Very long-context requests can crowd out the cache and degrade throughput. - Numbers to drop: "vLLM continuous batching + PagedAttention: 23× throughput, lower p50", "speculative decoding stacked on top: 1.5-2.8× additional speedup"
Common follow-ups: - "What's PagedAttention?" - "Why is continuous batching memory-bound, not compute-bound?" - "How would you tune max batch size?"
Traps: - Calling continuous batching "the same as dynamic batching" — dynamic batching usually means request-level dynamic batching, which still suffers from the longest-sequence-blocks-all problem. Continuous batching operates at the token level. - Forgetting the KV cache memory pressure. Many concurrent sequences with long contexts will OOM before they max out compute.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "What is PagedAttention?"¶
Tags: senior · common · conceptual · source: vLLM SOSP 2023 paper; Spheron 2026 LLM serving guide; reported in inference-infra interviews
Answer outline: - PagedAttention is the KV cache memory management algorithm introduced by vLLM. It maps each sequence's KV cache through a logical block table to non-contiguous physical blocks in GPU memory, applying the OS virtual-memory paging principle to attention. - Problem it solves: naive KV cache stores per-sequence KV in contiguous GPU memory pre-allocated to the maximum sequence length. Fragmentation is enormous — a sequence of 100 tokens occupies a block sized for 2048. Memory waste up to 60-80% on variable workloads. - Solution: chop the KV cache into fixed-size pages (block_size=16 typical). Each sequence gets a block table mapping logical positions to physical pages. Pages can be shared (for prefix caching across sequences with the same system prompt), garbage-collected, and allocated lazily. - Combined with continuous batching, this is what gives vLLM its order-of-magnitude throughput win over naive HuggingFace generate loops. - Trade-off: implementation complexity. The attention kernel needs to read non-contiguous pages, requiring a custom CUDA kernel. - Numbers to drop: "block_size=16 typical", "KV cache fragmentation cut from 60-80% to <4%", "near-linear scaling with batch size up to KV cache memory limit"
Common follow-ups: - "Why is the KV cache so memory-hungry?" - "What other inference engines implement this idea?" - "How does prefix-cache sharing work across requests?"
Traps: - Conflating PagedAttention with continuous batching. They're complementary: continuous batching is the scheduling trick, PagedAttention is the memory trick.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/00_ai_foundation/02_tokens_embeddings_context/
Speculative decoding¶
Q: "What is speculative decoding and how does it speed up inference?"¶
Tags: senior · common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026; NVIDIA Technical Blog; vLLM blog
Answer outline: - Speculative decoding uses a small, cheap draft model to propose K candidate next tokens, then verifies them in parallel with the large target model in a single forward pass. - If the target model's distribution agrees with the draft on the first M ≤ K tokens, you've effectively generated M tokens for the cost of 1 target forward pass plus M cheap draft passes. - The arithmetic: when draft acceptance rate is ~70-80%, you get 1.5-2.5× speedup on the decode phase. vLLM reports up to 2.8× on summarization, up to 2.5× with Eagle 3. - It works because LLM decode is memory-bandwidth-bound, not compute-bound. Verifying K candidate tokens in one forward pass costs only marginally more than verifying one — most of the cost is loading the weights. - Caveats: speculative decoding helps decode, not prefill. It's not a universal speedup — in high-QPS, compute-bound settings the extra speculation work can actually slow you down. Tune speculation length (K=3-8 typical) per workload. - Variants: draft-model speculative (use a 1B model as draft for a 70B target), self-speculative (use early layers as the draft), prompt-lookup decoding (use the prompt itself as draft, great for summarization), Medusa / Eagle (multiple decoding heads to draft in parallel). - Numbers to drop: "draft model size: 5-15% of target", "K=3-8 typical", "acceptance rate goal: ≥70%", "speedup: 1.5-2.8× on decode-heavy workloads"
Common follow-ups: - "When does it not help?" - "How do you pick the draft model?" - "Why is acceptance rate so important?"
Traps: - Claiming speculative decoding always helps. In compute-bound (high-batch) regimes it can hurt. - Confusing the draft model with the target. The draft's only job is to predict tokens fast; the target's job is to verify and stay correct. - Forgetting that the output is mathematically identical to non-speculative decoding (assuming temperature ≤ 1 and correct rejection sampling). No quality trade-off, only speed.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "Walk me through tuning speculative decoding for a chat application."¶
Tags: staff · occasional · scenario · source: AWS Trainium speculative decoding blog 2026; standard staff-level inference probe
Answer outline: - Step 1 — confirm you're memory-bandwidth-bound. Profile target model decode. If GPU SM utilization is low (<30-40%) while memory bandwidth utilization is high, you're a candidate. - Step 2 — pick the draft. Same model family is best (Llama-1B drafts for Llama-70B target). The draft should be small enough that drafting K tokens costs ≪ one target forward pass. - Step 3 — calibrate K. Start K=4. Higher K helps when acceptance is high; reduces gain when acceptance is low. Sweep K ∈ {2, 4, 8} on a representative eval and pick the elbow. - Step 4 — measure on your workload. Chat with short prompts and short responses sees less benefit than long-document Q&A or summarization (more decode tokens to amortize the draft). - Step 5 — watch the batch-size interaction. At low batch (B=1-4), speculative decoding shines. At high batch (B=64+), the extra draft work can hurt — disable speculation above your QPS crossover point. - Step 6 — if available, try EAGLE / Medusa as alternatives — they often outperform draft-model speculative on chat workloads because there's no extra model to host. - Numbers to drop: "acceptance rate target: 70-80%", "K=4 default", "expected speedup: 1.5-2× on B=1 decode-heavy workloads, 0-1.2× at high batch"
Common follow-ups: - "What changes for batch >32?" - "How does Eagle differ from draft-model speculative?" - "Where would speculative decoding hurt?"
Traps: - Enabling speculation universally without measuring batch-size crossover. - Picking a draft from a completely different model family — acceptance rate craters.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/02_ai_infrastructure/02_inference_serving_systems/
Latency engineering¶
Q: "What's the difference between TTFT, TPOT, and end-to-end latency?"¶
Tags: mid · very-common · conceptual · source: Anyscale LLM latency metrics docs; BentoML LLM inference handbook 2026
Answer outline: - TTFT (Time to First Token): from request submission to the first generated token landing on the client. Dominated by prefill — processing the input prompt through the model. Scales with prompt length. - TPOT (Time Per Output Token): the average time between successive tokens during streaming. Dominated by decode — one forward pass per token. Memory-bandwidth-bound on most hardware. - End-to-end (E2E) latency: full request duration. E2E ≈ TTFT + TPOT × output_token_count. - They optimize differently. TTFT improves with shorter prompts, prompt caching, faster prefill (FlashAttention, longer batches), better network. TPOT improves with smaller models, quantization, speculative decoding, better serving hardware (H100 → H200 → B200). - For UX, the rule of thumb: TTFT p95 < 500ms feels instant, TPOT < 50ms feels conversational. Both matter; users perceive TTFT first and TPOT throughout. - Numbers to drop: "TTFT p95 target: <500ms for chat", "TPOT p95 target: <50ms for streaming chat", "Claude Haiku 4.5: ~600ms TTFT on medium prompts (2026)"
Common follow-ups: - "Which would you optimize first for a customer-support chatbot?" - "How does prompt caching affect TTFT vs TPOT?"
Traps: - Conflating "latency" with end-to-end. Users react to TTFT first; long responses are tolerable if the first token arrives fast. - Forgetting that streaming helps perceived latency without changing E2E.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "Your p99 latency spiked while p50 is flat. What's happening?"¶
Tags: senior · common · debugging · source: Anyscale LLM latency benchmarking; standard senior production-debug probe; Medium 'P99 Problem' post 2026
Answer outline: - Classic "tail-at-scale" signal. The median user is fine; the worst 1% is suffering. Causes are almost always at the infrastructure layer, not the model layer. - Top suspects in priority order: - Queue contention during traffic bursts: continuous batching has a max batch size; over capacity, requests queue. Fix: autoscale, add a backpressure mechanism, drop the lowest-priority traffic. - Long-context outliers: a single 100k-token request can monopolize the KV cache and slow all others. Fix: gate on max input length, isolate long-context to a separate pool. - Slow downstream tool calls: in agent workflows, a single tool timeout pushes the whole trajectory to p99. Fix: per-tool timeouts, parallel calls where possible. - GC pauses / driver stalls on the serving host. Less common but worth checking. - Cache misses on cold tenants: the p50 user has a warm prompt cache; the p99 is hitting cold-start every time. Fix: pre-warm, lengthen TTL. - Investigation steps: pull the latency histogram, identify the modal failure (length-driven? tenant-driven? time-of-day?), then trace one specific p99 request through the pipeline. - The rule of thumb: p50 is for dashboards, p95 is for SLAs, p99 is for sleeping at night. A flat p50 with a spiking p99 means your system has a tail problem, not a model problem. - Numbers to drop: "p50/p99 ratio: 3-5× is normal, 10× means something is broken", "track p99 by tenant, by input length, by time bucket"
Common follow-ups: - "How would you instrument to catch this faster?" - "What's the first dashboard you'd check?"
Traps: - Blaming "the model" — the model latency distribution is usually narrow; the variance is at the infra layer. - Reaching for fixes before identifying which slice of traffic is in the tail.
Related cross-cutting: Production patterns, Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/01_ai_engineering/03_agent_observability_debugging/
Q: "Your p95 end-to-end latency is 4s and the target is 1s. What do you change?"¶
Tags: senior · very-common · scenario · source: 2026 latency-design loop
Answer outline: - Don't guess — decompose the 4s into stages first: network + auth, retrieval (embed query + ANN + rerank), prefill (TTFT), decode (TPOT × output length), tool calls, post-processing. You can't cut what you haven't attributed. Trace one slow request end to end. - Then attack the dominant stage: - Decode dominates (long outputs): cap/shorten output, stream so perceived latency drops, smaller model on the hot path, speculative decoding. - Prefill/TTFT dominates (long prompt): prompt caching for the static prefix, shorten/compress context, dynamic assembly. - Retrieval dominates: cut rerank candidate count, tune ANN params, cache embeddings, parallelize retrieve + first model stage. - A tool call dominates: parallelize independent calls, timeout + fallback, cache tool results. - Perceived vs actual: stream the first token. TTFT < 500ms feels instant even if E2E is 2s. Often you ship streaming + a TTFT cut and the "4s" complaint disappears without ever hitting 1s E2E. - Set the budget explicitly: e.g. retrieval 300ms / TTFT 500ms / decode 1.2s / overhead 200ms, and hold each stage to it. - The senior tell: separate perceived from actual latency, and decompose before optimizing. - Numbers to drop: "TTFT p95 <500ms feels instant", "streaming changes perception, not E2E", "speculative decoding ~2× decode on accept", "prompt cache cuts TTFT on long static prefixes"
Common follow-ups: - "Which stage would you instrument first?" - "Can you hit 1s without swapping to a smaller model?"
Traps: - Reaching for a smaller model before attributing where the 4s actually goes. - Ignoring streaming — the cheapest perceived-latency win there is. - Optimizing p50 when the complaint is the p95/p99 tail.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/02-latency-anatomy.md, learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "How do you cut TTFT for a long-context RAG application?"¶
Tags: senior · common · design · source: standard senior RAG-cost probe; reported across 2026 AI infra loops
Answer outline: - TTFT for long-context is dominated by prefill — the model processing your retrieved documents. The bigger the context, the longer the user waits before seeing anything. - Lever 1 — prompt caching on the static portion. If your system prompt + tool definitions are stable, cache them; that prefix processes once instead of every call. Saves ~20-40% of prefill time depending on what fraction is cacheable. - Lever 2 — retrieval-side compression. Don't dump 20 chunks if 5 will do. Use a reranker to pick top-K, then squeeze K with a cheap summarizer if needed. Cuts prefill linearly with token count. - Lever 3 — streaming the response. TTFT is when first token arrives. Even a long total response feels fast if you start streaming early. - Lever 4 — chunked prefill. Modern engines (vLLM with chunked prefill, TensorRT-LLM) interleave prefill and decode, so you can start emitting tokens before the full prompt is processed. Helps perceived TTFT. - Lever 5 — smaller / faster model for the answer. If retrieval is the heavy lift, a smaller model can summarize the retrieved context fine. Routing helps here. - Lever 6 — prefetch retrieval in parallel with classification or planning. The product can show "searching..." while retrieval and LLM prefill happen concurrently. - Numbers to drop: "20-chunk context → 5-chunk: ~75% prefill reduction", "chunked prefill: 30-50% TTFT improvement at long contexts", "prompt cache hit: 20-90% TTFT improvement depending on cache hit fraction"
Common follow-ups: - "What if the user needs all 20 chunks?" - "Trade-off between fewer chunks and answer quality?"
Traps: - Optimizing TPOT when TTFT is the user-perceived problem (or vice versa). - Caching aggressively without thinking about cache key — long-context retrieval results often vary per query.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/01_ai_engineering/08_rag_system_design/
Q: "Walk me through your latency budget for a voice agent."¶
Tags: staff · occasional · design · source: voice/realtime AI engineer loop probe; reported in 2026 voice-AI interviews
Answer outline: - The user-perceived latency budget for natural voice conversation is ~700-1000ms end-to-end from end-of-user-speech to start-of-agent-speech. Above ~1.2s, the interaction feels broken. - Allocate the budget: ~150ms VAD + endpoint detection, ~250ms ASR (speech-to-text), ~300-400ms LLM (TTFT + first chunk of text), ~150-200ms TTS first chunk. Total: ~850-1000ms. - LLM is where most of the budget gets eaten. Tactics: small model on the fast path, prompt caching for stable instructions, speculative decoding, streaming TTS that starts speaking on first token rather than waiting for full response. - Cheat with overlap. Start TTS on the first chunk of LLM output. Start ASR while VAD is still finishing. Latency budgets in voice are about overlap, not sequential addition. - Backchannel during slow turns: emit "let me check that" while the LLM finishes — keeps the human engaged. - Numbers to drop: "E2E voice budget: 700-1000ms", "LLM TTFT budget: 300-400ms", "small models (Haiku 4.5, gpt-4o-mini) hit this; large models often don't without speculative decoding"
Common follow-ups: - "What model do you pick for the LLM in a voice agent?" - "How do you handle tool calls that take 2+ seconds?"
Traps: - Designing voice latency as sequential. The win is overlap. - Skipping streaming TTS — the LLM finishing the full response before TTS starts is a 500-1000ms penalty.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/05_ai_specializations/00_realtime_voice_agents/
Smaller models & distillation¶
Q: "Walk me through replacing a frontier API with a smaller fine-tuned model. What are the trade-offs?"¶
Tags: senior · very-common · design · source: standard senior cost-arbitrage probe; reported across 2026 AI engineer loops
Answer outline: - Goal: 50-95% cost cut on a hot path. This works only for narrow tasks — replacing GPT-4o "in general" with a 7B doesn't. - Pipeline: (1) scope the task tightly, (2) generate teacher labels by running the frontier model on 50k-500k production-like prompts, (3) clean + dedupe + spot-check (1-2% manual), (4) fine-tune a 7B or 13B open-weight base (LoRA/QLoRA), (5) eval with side-by-side against teacher on 500-1000 examples, (6) deploy behind a router with confidence-based fallback to the frontier model. - Trade-offs: lose some generalization (the student is specialist, not generalist), incur ongoing ops (GPU serving, base-model upgrade cycles, re-tuning), face latency wins (small model on H100/H200 is often 5-10× faster than frontier API round-trip). - Quality gap: on the narrow task, expect 90-97% agreement with teacher. Off-task, the student degrades hard. Route only the in-scope traffic. - Cost gap: frontier per call ~$0.005-0.05; self-hosted 7B per call \(0.0001-0.001 at decent utilization. Break-even ~50k requests/day for most setups when you account for FTE ops cost. - Numbers to drop: "50k-500k teacher generations", "agreement target: ≥95% on in-scope eval", "\)0.005/call → $0.0005/call typical cut"
Common follow-ups: - "What's the failure mode if the router misclassifies?" - "How do you know the student is still good after the base model upgrades?" - "When would you not do this?"
Traps: - Scoping too broadly. The student must be specialized. - Skipping the fallback. Without an escape hatch, student failures bubble to users. - Forgetting FTE ops cost when computing savings.
Related cross-cutting: Cost & latency, Architecture choices, Fine-tuning vs alternatives
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/00_ai_foundation/06_adaptation_compression/
Q: "Distillation vs quantization for inference cost — which would you do first?"¶
Tags: senior · common · scenario · source: standard cost-comparison probe; reported in 2026 inference loops
Answer outline: - Quantization first. Quantization is cheap (hours to apply, no new training run), preserves model capability mostly intact (INT8 ~lossless, INT4 ~1-3% loss), and works on any model — your existing tuned model included. - Distillation is more powerful (a 5-10× smaller model is a much bigger throughput win than 4-bit weights) but requires a full training run, labeled data, and the student is specialized. - The senior answer combines them: quantize first to get the easy 2-4× speedup; if you still need more, distill to a smaller architecture and quantize that. - Decision tree: throughput need <2× → quantize, done. 2-5× → quantize + try smaller off-the-shelf model. >5× → distillation pipeline. - Numbers to drop: "INT4 quantization: 2-4× throughput, hours of work, ~1-3% quality loss. Distillation: 5-10× throughput, weeks of work, 5-15% quality loss without careful curation."
Common follow-ups: - "When does quantization saturate?" - "How do you maintain a distilled student over base-model upgrades?"
Traps: - Reaching for distillation when quantization would have sufficed. Distillation is a much bigger commitment. - Quantizing without an eval. Even INT4 can degrade by 5-10% on some models or tasks.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/00_ai_foundation/06_adaptation_compression/
Prompt-level cost control¶
Q: "Your prompt is 8000 tokens. How would you shrink it?"¶
Tags: mid · common · debugging · source: standard senior prompt-ops probe; reported across 2026 prompt-engineering loops
Answer outline: - Audit first. Print the prompt and label each section: system instructions, tool definitions, retrieved context, few-shot examples, user message. Which sections are stable across calls (cache candidates)? Which vary? - Cuts to make: - Remove verbose tool definitions — keep parameter schemas tight; cut prose descriptions to one line each. - Drop redundant few-shot examples — if 5 examples produce similar outputs, 2-3 are enough. Curate, don't accumulate. - Compress retrieved context — rerank to top-K, summarize chunks if needed. - Strip duplicated instructions ("be helpful" appears twice). - Use structured output schemas to avoid asking for the format in prose. - Restructure to maximize caching: stable content first (system prompt, tools, fewshot), variable content last (retrieved context, user message). Set the cache breakpoint right before the variable part. - Measure: track input tokens per call before/after. Run eval to confirm no quality drop. A 30-50% prompt token reduction with stable quality is typical. - For dynamic context — RAG, conversation history — compact aggressively. Conversation summary instead of full history; top-K reranked chunks instead of dump-all. - Numbers to drop: "30-50% prompt token reduction without quality drop is typical", "5 → 2 fewshot examples saves ~1-3k tokens", "verbose tool prose → schema-only: 200-500 tokens per tool"
Common follow-ups: - "How do you know the cuts didn't degrade quality?" - "Walk me through your audit process."
Traps: - Cutting blindly without an eval. You optimize for tokens and pay in regressions. - Forgetting that some "long prompts" are just bad. The cut isn't compression — it's removing things that shouldn't have been there.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/00_ai_foundation/07_prompting_fundamentals/
Q: "How would you cut tokens per request by assembling context dynamically instead of shipping a fixed prompt?"¶
Tags: senior · common · design · source: 2026 cost-optimization loop; context-engineering probe
Answer outline: - The lever before compression: compression shrinks what you decided to include; dynamic assembly decides what to include per request. A fixed "full system prompt + top-k chunks on every call" pays the maximum bill on the easy queries that needed almost none of it. - Build a context router: a cheap classifier labels the query (intent, difficulty, needs-tools?) and assembles only the needed pieces — short prompt + 1 chunk for FAQ, full prompt + 5 chunks for multi-hop; tool defs and few-shot loaded only when the route needs them. - Order it right: assemble first (drop whole sections the query never needed), compress second (trim redundancy from the remainder). Run compression on a fixed prompt and you're tuning a number the router could have halved for free. - Same guardrail as model routing: measure quality per route, and keep a fallback that escalates to fuller context when classifier confidence is low — a wrong route starves a hard query of evidence (the same failure as cutting a citation-bearing paragraph). - Caching interaction: keep the static prefix stable across routes so per-route assembly doesn't fragment the prompt cache; vary only the suffix. - Numbers to drop: "40-60% average context reduction with no quality loss", "router cost ~$0.001/req, ~50-500ms", "confidence-gated fallback to full context"
Common follow-ups: - "What signal does the router use?" (intent keywords, conversation depth, classifier confidence — not length alone) - "How does this interact with prompt caching?" (keep the prefix static; only the tail varies)
Traps: - Routing on a single signal (request length). Real prompts don't split cleanly by length. - No fallback — a confident wrong route starves the query of evidence. - Letting per-route assembly fragment the cacheable prefix and killing your cache hit rate.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/08-prompt-compression.md, learning/01_ai_engineering/09_advanced_rag_patterns/
Q: "How would you reduce output tokens, not just input?"¶
Tags: senior · common · design · source: standard senior cost probe; output-token cost is 4-5× input in 2026 pricing
Answer outline: - Output tokens are 4-5× more expensive than input on most providers, so trimming output is high-leverage. - Use structured output (JSON schemas, Pydantic, response_format) to enforce concise machine-parseable answers — the model can't waffle in prose. - Set explicit length constraints in the prompt: "Answer in 1-2 sentences" or "Maximum 50 words". Combined with max_tokens, this works most of the time. - Use max_tokens as a hard ceiling. Calibrate to the 99th percentile of needed outputs — anything beyond is almost certainly waste. - For multi-step tasks, have the model emit only the next action instead of a full plan. The plan re-emerges through iterations. - Avoid asking the model to "explain its reasoning" unless you actually use the reasoning — chain-of-thought prose is expensive. - Numbers to drop: "output tokens cost 4-5× input on Claude/OpenAI 2026", "JSON output: typically 30-50% fewer tokens than prose", "max_tokens calibrated to p99 of needed length cuts 10-20% of output spend"
Common follow-ups: - "What if the model still rambles despite the length constraint?" - "Is chain-of-thought ever worth its cost?"
Traps: - Saving input tokens while output bloats. The cost is on output. - Cutting max_tokens too aggressively and truncating real answers.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/00_ai_foundation/07_prompting_fundamentals/
End-to-end design questions¶
Q: "Design a cost-optimized customer-support chatbot for 100k requests/day."¶
Tags: senior · very-common · design · source: standard senior design round; reported across 2026 AI engineer loops
Answer outline: - Start with a cost target. At 100k req/day with a 1k-input / 200-output call at frontier pricing ($4/M input + \(20/M output), you're looking at ~\)1.2k/day = \(36k/month. Cost target: cut by 70-85% to ~\)5-10k/month. - Architecture: ingress router → cache check (exact + semantic) → tier routing → LLM serving → response. - Caching layer: exact-match cache for repeat questions (FAQs hit 30-50% of traffic); semantic cache (threshold 0.95) for paraphrases. Add provider-side prompt caching for the system prompt + tool definitions. Expected combined cache hit rate: 40-60%. - Router: small classifier (1-3B, ~\(0.001/call) labels intent. Simple intents (FAQ, status check) → fine-tuned 7B self-hosted (or Haiku tier). Complex intents (account dispute, multi-step troubleshoot) → Sonnet/Opus tier. - Serving: vLLM with continuous batching + PagedAttention for self-hosted. AWQ INT4 quantization for the 7B. Add speculative decoding for the long-tail (long response cases). - Async / batch: any nightly summary, eval grading, FAQ-mining work → batch API at 50% off. - Eval gate: side-by-side eval on a held-out 500-example set vs. the un-optimized stack. Roll back any layer that drops quality more than the SLA allows. - Numbers to drop: "Cache hit 40-60%, router cheap-tier 60-70% of remainder, self-hosted at 30% utilization → ~\)0.001/call effective", "total spend: ~$3-8k/month at 100k/day"
Common follow-ups: - "What's the failure mode of each layer?" - "How do you handle a sudden traffic spike?" - "Where would you cut further?"
Traps: - Designing all layers at once without a measurement plan. - Skipping the eval gate — every layer must come with a way to verify it didn't break quality. - Not naming what can't be cached (anything personalized, anything time-sensitive).
Related cross-cutting: Cost & latency, Architecture choices, Production patterns
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/01_ai_engineering/12_model_vendor_strategy/
Q: "Your LLM bill doubled this month. Walk me through the investigation."¶
Tags: senior · common · debugging · source: standard senior cost-incident probe; reported in 2026 AI engineer loops
Answer outline: - Don't reach for fixes. Build the cost story first. - Pull the invoice breakdown by model, by endpoint, by tenant. Where is the spike? One model? One product surface? One customer? - Compare to baseline. Same traffic shape as last month, or did volume change? Same prompt length, or did average input/output tokens grow? Per-call cost or per-day cost rising faster? - Top suspects in priority: - Prompt creep: the system prompt grew because someone added tools / fewshot / instructions without checking token impact. Diff the prompt against last month's version. - Cache eviction: provider TTL changed, or your prompt's stable prefix is no longer byte-identical (someone added a timestamp, a UUID, formatting changed). Cache hit rate dropped. - Tenant spike: a single customer / API key is hammering. Often a runaway agent loop or a misconfigured batch. - Model swap: a deploy switched a route from Haiku to Sonnet. The 10× per-token cost rolled out silently. - Output bloat: max_tokens raised, or the prompt no longer constrains length tightly. - Fix in order: stop the bleeding (rate-limit the noisy tenant, revert the model swap, restore cache breakpoints). Then plan the structural fix. - Postmortem: add a per-endpoint cost dashboard, alarms on per-day cost growth >20% WoW, and a regression test on prompt token count in CI. - Numbers to drop: "expect month-over-month variance ~10-15% from traffic alone — anything >50% is a regression", "track $/request as the leading indicator, total $ as the lagging"
Common follow-ups: - "How would you detect this before it doubled the bill?" - "What dashboard would you build?"
Traps: - Jumping to fixes before identifying the cost driver. - Blaming the model provider when the prompt grew silently in your code.
Related cross-cutting: Cost & latency, Production patterns
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/04_ai_product_evals/00_ai_evals_release_gates/, learning/01_ai_engineering/05_ai_incident_operations/
Q: "You have a $10k/month LLM budget and a feature that needs frontier-tier reasoning. How do you make it work?"¶
Tags: senior · common · scenario · source: standard senior cost-constraint probe; reported in 2026 AI startup loops
Answer outline: - \(10k/month at frontier pricing buys ~2-3M frontier-model calls — enough for a moderate-traffic feature but not unlimited. Plan accordingly. - First, separate "needs frontier reasoning" from "uses frontier model". Most user requests can be screened, summarized, or routed by cheaper models; the frontier handles only the hard core. - Build a router with three tiers: (1) cache returns ~30-50% of repeat traffic for ~free, (2) cheap-tier model handles 60-70% of remaining traffic at ~10× cheaper, (3) frontier model handles the hard 30-40% of the remainder. Effective frontier usage: ~15-25% of total traffic. - Aggressively cache the frontier path. Anthropic prompt caching on the stable instructions saves 90% on the cacheable portion of frontier calls. - For each frontier call, constrain output via structured output and max_tokens. Output is 4-5× more expensive than input — don't let the model ramble. - Build observability: per-feature cost dashboard, per-tenant cost limits, alarms before you hit the budget. Run cost-aware evals (does this prompt cut spend by X% without quality regression?). - Reserve 10-20% of the budget as headroom for traffic spikes — running at 100% utilization means a bad day knocks the feature offline. - Numbers to drop: "\)10k → ~2M frontier calls", "route 15-25% of traffic to frontier after router + cache", "headroom: 10-20% of budget"
Common follow-ups: - "What if traffic doubles unexpectedly?" - "When would you push back on the product team?"
Traps: - Designing for full frontier usage and praying. The budget will blow. - No observability. You can't manage a budget you can't see broken down by feature/tenant.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/01_ai_engineering/12_model_vendor_strategy/
Hardware & serving infra¶
Q: "When would you pick H100 over A100 — or H200 over H100 — for LLM serving?"¶
Tags: staff · occasional · conceptual · source: GPU choice question from inference-infra loops 2026
Answer outline: - A100: still the workhorse for medium-sized serving. 80 GB HBM2e, ~2 TB/s memory bandwidth, ~\(1-2/hour amortized cloud. - H100: better tensor cores (FP8 support), 80 GB HBM3, ~3.35 TB/s bandwidth — about 1.5-2× A100 throughput on LLM inference. Cost ~\)3-4/hour cloud. - H200: same compute as H100 but 141 GB HBM3e, ~4.8 TB/s. The memory bandwidth advantage is huge for LLM decode (memory-bandwidth-bound) — 30-50% throughput uplift on the same model. - B200: next-gen Blackwell. Bigger memory, faster bandwidth, even more FP8/FP4 throughput. Premium pricing. - Decision: pick A100 if you're cost-sensitive and serving a small model with low QPS. Pick H100/H200 for medium-to-large serving where bandwidth matters (which is most LLM inference). Pick B200 for frontier-scale workloads. - The actual right answer often involves more A100s instead of fewer H100s if your workload is well-parallelizable and cost per request is the metric. - Numbers to drop: "A100: ~2 TB/s. H100: 3.35 TB/s. H200: 4.8 TB/s. Bandwidth ratio = throughput ratio on decode."
Common follow-ups: - "Why does memory bandwidth matter so much for decode?" - "When does FP8 actually help?"
Traps: - Picking the newest GPU automatically. The cost per request is the right metric, not raw throughput.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "Why is LLM inference memory-bandwidth-bound and not compute-bound?"¶
Tags: senior · common · conceptual · source: standard senior inference-perf probe; vLLM / FlashAttention papers
Answer outline: - LLM decode requires loading every weight of the model from HBM into the compute units for each generated token. Compute per token is tiny (one matmul stack); memory traffic is enormous. - Arithmetic intensity (FLOPs / bytes-loaded) for autoregressive decode is very low — typically 1-2 FLOPs per byte loaded. GPUs are designed for ~100-200 FLOPs per byte. So the GPU sits idle on compute while HBM bandwidth saturates. - Prefill is different. Processing a long prompt has high arithmetic intensity (every token attends to every other, lots of compute per byte) — prefill is often compute-bound. - Implications: (1) increasing batch size during decode helps because you reuse loaded weights across more requests, raising arithmetic intensity. This is why continuous batching wins so much. (2) Quantization (smaller weights = less HBM traffic per token) yields almost-linear speedup on decode. (3) Memory bandwidth differences between GPU generations (H100 vs H200) translate near-linearly to decode throughput. - The reasoning behind speculative decoding: verifying K candidate tokens in one pass loads weights once and does K compute steps — pushes arithmetic intensity up by K. Speedup proportional to acceptance rate × K. - Numbers to drop: "arithmetic intensity for decode: ~1-2 FLOPs/byte. GPU ridge: ~100+ FLOPs/byte. You're nowhere near peak compute."
Common follow-ups: - "If decode is memory-bound, does FLOPs FP8 help?" - "How does this change for very small batches vs very large?"
Traps: - Calling it "compute-bound" because the GPU looks busy. The bottleneck is the memory subsystem; the SMs are starved.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/02_ai_infrastructure/05_agent_performance_economics/