Skip to content

Inference Serving Engines — Interview Questions

The "you decided to self-host — now what runs the model" round. Different from cost-latency-optimization.md (techniques: caching, batching, speculative decoding) and mlops-deployment.md (rollout patterns). This file is the engine choice and serving internals interview: vLLM vs TensorRT-LLM vs SGLang vs TGI, why TGI fell behind, when TensorRT-LLM is worth the compile pain, how PagedAttention and continuous batching interact at the engine level, what to monitor on a serving node, and how to size capacity.

The senior tell is naming the specific workload shape that justifies each engine — "vLLM by default; TensorRT-LLM only when the model is frozen and we need the last 20-30% of throughput; SGLang when shared-prefix workloads dominate".


Engine selection

Q: "Compare vLLM, TensorRT-LLM, SGLang, and TGI."

Tags: senior · very-common · conceptual · source: Yotta Labs 2026 inference engine comparison; n1n.ai engine survey 2026; standard senior infra probe

Answer outline: - vLLM: open-source, PyTorch-based, the 2026 default. Continuous batching + PagedAttention as core innovations. Wide model support (Llama, Qwen, Mistral, Mixtral, DeepSeek, Gemma, etc.), no compilation step, fast iteration. Under heavy load, keeps the GPU 85-92% busy. Best for "we have a fleet of models and want to ship now". - TensorRT-LLM (NVIDIA): compiled inference engine, lowest latency and highest throughput when the model is fixed. Requires a build step (28+ min cold compile typical), tight NVIDIA-only coupling, less flexibility. Best when one model serves at very high QPS and the team can pay the compile/tuning cost. - SGLang: open-source, focuses on structured-generation and prefix caching. Outperforms vLLM by ~29% on smaller models (7B-8B) on H100; gap narrows on 70B+. Particularly strong when workloads share prefixes (RAG with stable system prompt, multi-step agents). Best when shared-prefix is a meaningful share of traffic. - TGI (Hugging Face Text Generation Inference): now officially in maintenance mode. HF themselves recommend vLLM or SGLang. Skip TGI for new deployments. - 2026 stance: start with vLLM. Switch to TensorRT-LLM when you've validated the model is stable and the throughput delta is worth the ops cost. Switch to SGLang for shared-prefix-heavy workloads. - Numbers to drop: "vLLM TTFT p50 at 10 concurrent: ~120ms. SGLang: ~112ms. TensorRT-LLM: ~105ms.", "TensorRT-LLM compile time: ~28 min cold; vLLM start ~62s", "SGLang 7B advantage: ~29% throughput on H100; closes to 3-5% at 70B+"

Common follow-ups: - "Why is TGI dying?" - "When does TensorRT-LLM actually pay off?" - "Walk me through SGLang's prefix-caching advantage."

Traps: - Recommending TGI in 2026. Outdated. - Defaulting to TensorRT-LLM. The compile cost is real.

Related cross-cutting: Cost & latency, Architecture choices Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Q: "When does TensorRT-LLM win over vLLM?"

Tags: staff · common · scenario · source: Yotta Labs / Spheron benchmarks 2026; standard staff-tier serving probe

Answer outline: - TensorRT-LLM wins when all of these hold: - Model is frozen: not changing for months. Otherwise you re-compile every change. - NVIDIA hardware: TensorRT-LLM is NVIDIA-only. Won't run on AMD MI300X, Trainium, etc. - Throughput-critical workload: high sustained QPS where the last 20-30% throughput matters financially. - Team has the ops capacity: TensorRT-LLM compilation, version management, and tuning take real engineer time. - Typical wins: 20-30% throughput uplift, 10-15% TTFT improvement, especially on large models (70B+) at high concurrency. - The compile cost: 28+ min cold compile per model+precision+max-batch combination. Re-compile on every change. Storage and CI pipeline implications. - Hybrid in production: vLLM for development and rare-traffic models; TensorRT-LLM for the highest-QPS production model only. - Decision rule: if your annual GPU savings from the 20-30% efficiency uplift exceed the ops cost of maintaining a TensorRT-LLM build pipeline, go for it. Otherwise stay on vLLM. - Numbers to drop: "20-30% throughput uplift typical at high concurrency", "compile time: 28+ min", "ops cost: 0.25-0.5 FTE engineer for a serious TensorRT-LLM stack"

Common follow-ups: - "What if the model is going to change in a month?" - "How does FP8 fit in?"

Traps: - Compiling TensorRT-LLM for a model that's about to be replaced.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Q: "Why does SGLang win on shared-prefix workloads?"

Tags: staff · common · conceptual · source: SGLang prefix caching docs; n1n.ai 2026 engine comparison

Answer outline: - SGLang implements aggressive prefix caching — automatic KV cache sharing across requests that share the same prompt prefix. - For workloads where the prefix is stable (RAG with one system prompt + retrieved chunks varying, agents with consistent tool definitions, batch evaluation), the cache hit on the prefix prevents re-computing K and V projections for those tokens. Big savings. - vLLM has prefix caching too (added in mid-2024+), but SGLang's implementation is more aggressive and was designed for this from day one — handles automatic detection of shared prefixes without explicit hints. - Result: on shared-prefix workloads, SGLang's TTFT can be 30-50% lower than vLLM and throughput proportionally higher. - For workloads with diverse prompts (every request has a unique prefix), the advantage shrinks to near zero. The gain is workload-dependent. - 2026 trend: vLLM is closing the gap. SGLang's edge on prefix caching is shrinking as vLLM matures. - Numbers to drop: "shared-prefix TTFT improvement: 30-50% over naive serving", "RAG with stable system prompt: 60-80% prefix cache hit rate", "agent workloads: similar"

Common follow-ups: - "How does prefix caching interact with continuous batching?" - "What's a workload where prefix caching doesn't help?"

Traps: - Claiming SGLang dominates everywhere. Workload-specific.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Engine internals

Q: "How does continuous batching work at the engine level?"

Tags: senior · very-common · conceptual · source: vLLM / Anyscale continuous batching blog; standard senior internals probe

Answer outline: - The scheduler operates at the token level, not the request level. Each iteration of the generation loop: - Pick a batch of in-flight requests to run the next decode step on. - Run one forward pass through the model for the entire batch (each sequence contributes its current position's input). - Update each sequence with its newly-generated token. - For any sequence that finished (EOS or max-tokens), evict it; for new requests in the queue, slot them in. - Each iteration's batch is dynamic — old sequences leave as they finish; new ones join immediately. No request waits for the slowest sequence in its original batch. - Key constraint: KV cache memory. All concurrent sequences' KV caches must fit. PagedAttention solves this by storing KV cache in non-contiguous pages, avoiding fragmentation. - Throughput improvement vs static batching: 20-30× on workloads with variable output lengths. Static batching wastes most of its GPU time on padding while waiting for the longest sequence. - Tunable: max_num_batched_tokens (per-iteration token budget), max_num_seqs (concurrent sequence cap). Higher = more throughput but more KV cache pressure. - Numbers to drop: "vLLM continuous batching + PagedAttention: ~23× throughput vs naive", "max_num_seqs: 50-200 typical per GPU at 7B FP16", "max_num_batched_tokens: 2048-8192 typical"

Common follow-ups: - "What stops you from increasing batch size indefinitely?" - "How does this interact with speculative decoding?"

Traps: - Confusing continuous batching with naive dynamic batching. Token-level vs request-level scheduling is the key.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Q: "Explain prefill vs decode and why they're treated differently."

Tags: senior · common · conceptual · source: standard senior inference-internals probe; 2026 LLM serving guides

Answer outline: - Prefill: processing the user's input prompt. All N input tokens go through the model in a single (or chunked) forward pass. High arithmetic intensity — each weight is loaded once and used across N positions. Compute-bound (FLOPs-limited). - Decode: generating one output token at a time. Each step loads the entire model + KV cache for the single new token. Low arithmetic intensity. Memory-bandwidth-bound. - Implications: - Prefill latency scales O(N²) due to attention; long prompts have expensive prefill. - Decode latency per token is roughly constant (depends on memory bandwidth and KV cache size, not prompt length). - Different optimization techniques apply: prefill benefits from FlashAttention and quantization (compute-side wins); decode benefits from KV cache compression, GQA, smaller models, speculative decoding. - Engine handling: - vLLM and SGLang interleave prefill and decode in continuous batching. - Chunked prefill (newer): split a long prefill into chunks, interleave with decode tokens. Reduces head-of-line blocking when a new long-context request lands and would otherwise starve the decode pipeline. vLLM and SGLang support; TensorRT-LLM does too. - Disaggregated prefill/decode: separate GPU pools for prefill (compute-optimized) and decode (memory-bandwidth-optimized). Cutting-edge 2026 setups; complex to operate, big throughput wins. - Numbers to drop: "prefill: O(N²) attention compute, often <1s at moderate context", "decode: ~constant per-token, dominated by memory bandwidth", "chunked prefill: 30-50% TTFT improvement on long-context requests under load"

Common follow-ups: - "What's chunked prefill?" - "When would you disaggregate prefill and decode?"

Traps: - Treating prefill and decode the same. Different bottlenecks, different optimizations.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/00_ai_foundation/02_tokens_embeddings_context/


Q: "What is chunked prefill and when does it help?"

Tags: staff · common · conceptual · source: vLLM chunked prefill docs 2026; Spheron LLM serving optimization 2026

Answer outline: - Chunked prefill: split a long prompt's prefill phase into multiple smaller chunks; interleave each chunk with decode steps for other in-flight requests. - Why: under continuous batching, a single 100k-token prefill request would monopolize the GPU for seconds, blocking all decode-phase requests. Latency for everyone else spikes. With chunked prefill, you process the long prefill in pieces, alternating with decode work, so other users keep getting tokens. - Win: smoother p99 latency under mixed workloads (long-context + short interactive both in flight). TTFT for the long-context request slightly worse; TTFT for the short interactive much better. - Lose: pure throughput on a single very-long-context request can be slightly lower than one shot prefill due to overhead. - Tunable: prefill_chunk_size. Larger = closer to one-shot prefill (lower overhead, more blocking); smaller = better interleaving (higher overhead, smoother latency). - Modern engines (vLLM, SGLang, TensorRT-LLM) all support; usually enabled by default in 2026. - Numbers to drop: "prefill_chunk_size: 512-2048 tokens typical", "p99 TTFT improvement: 30-50% under mixed workloads", "single-request throughput overhead: 5-10%"

Common follow-ups: - "Why does this matter for production?" - "What's a workload where you'd disable chunked prefill?"

Traps: - Saying chunked prefill is always on. Tunable; sometimes off for pure-throughput workloads.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Q: "What's disaggregated prefill and decode?"

Tags: staff · occasional · conceptual · source: 2026 inference research and emerging production patterns

Answer outline: - Run prefill on one pool of GPUs (compute-optimized: high FLOPs); run decode on a separate pool (memory-bandwidth-optimized: high HBM bandwidth, possibly different hardware tier). - Idea: prefill and decode have opposite bottleneck profiles. Same GPU does both, but one phase always wastes the other's binding resource. - Architecture: - Prefill pool computes the KV cache for a request, then transfers the KV cache to a decode-pool node via fast interconnect (NVLink, InfiniBand). - Decode pool serves token-by-token generation against the transferred KV cache. - Routing layer manages where each request goes. - Wins: - Better hardware utilization per dollar. - Independent scaling — prefill and decode pools size separately to their respective traffic shape. - Lower p99 latency on mixed workloads. - Costs: - Operational complexity. Two pools to manage. - KV cache transfer overhead (mitigated by fast interconnect; meaningful on commodity networking). - More moving parts → more failure modes. - 2026 reality: cutting-edge, used by frontier labs and some hyperscalers. Most production teams use a single pool with vLLM continuous batching. Adopt only when scale clearly justifies the complexity. - Numbers to drop: "10-30% efficiency gain at the system level when scale justifies", "operational cost: significant — usually 1+ FTE on serving infra"

Common follow-ups: - "What hardware tier for prefill vs decode?" - "What's the bottleneck of KV cache transfer?"

Traps: - Recommending disaggregation for small/medium setups. Overkill.

Related cross-cutting: Cost & latency, Architecture choices Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Tuning & sizing

Q: "Walk me through tuning vLLM for a production workload."

Tags: senior · common · design · source: vLLM tuning docs 2026; standard senior serving probe

Answer outline: - Process: - Profile the workload: average prompt length, output length, QPS, max-concurrency targets. These drive all subsequent decisions. - Pick the right hardware: H100/H200 for high-throughput; A100 for budget; consumer cards for low-volume. - Pick the precision: BF16 for safety; FP8 on H100+ for ~1.7× throughput; INT4 (AWQ/GPTQ) for cost-critical with small quality hit. - Tune the key flags: - --max-num-seqs: concurrent sequence cap. Higher = more throughput, more KV cache memory. Start at 64-128, increase until KV cache utilization peaks. - --max-num-batched-tokens: per-iteration token budget. Affects prefill/decode balance. 2048-8192 typical. - --max-model-len: hard cap on sequence length. Drives KV cache memory allocation. - --gpu-memory-utilization: fraction of GPU memory to use (default 0.9). Lower if OOM, higher to push harder. - --enable-chunked-prefill: usually on by default in 2026. - --enable-prefix-caching: turn on if your workload has shared prefixes. - Benchmark: measure TTFT p50/p95/p99 and tokens/sec at target concurrency. Compare against your SLO. - Iterate: bump batch size, check OOM, check tail latency. If p99 spikes, lower concurrency cap. - Special features: AWQ/GPTQ INT4 weights, speculative decoding (provide a draft model), LoRA adapters served per-tenant, KV cache compression. - Monitoring: vLLM exposes Prometheus metrics (vllm:e2e_request_latency_seconds, vllm:gpu_cache_usage_perc, etc.) — scrape these. - Numbers to drop: "max-num-seqs: 64-200 typical per H100 on 7B FP16", "gpu-memory-utilization: 0.85-0.95", "iteration: 3-5 tuning rounds to converge"

Common follow-ups: - "What if you OOM at higher batch sizes?" - "When do you turn off prefix caching?"

Traps: - One-shot tuning. Workloads shift; re-tune periodically.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Q: "How do you size a vLLM cluster for a target QPS?"

Tags: senior · common · design · source: standard senior infra-sizing probe; 2026 AI engineer loops

Answer outline: - Inputs: model size, target QPS, average input/output tokens, p95 TTFT SLO, hardware choice. - Per-replica throughput estimation: - Run a synthetic benchmark on one GPU at your target prompt/response shapes. - Measure tokens/sec sustained at acceptable tail latency. - Divide target tokens/sec by per-replica throughput → required replica count. - Headroom: add 30-50% buffer over peak traffic. GPUs are slow to spin up. - HA: minimum 2-3 replicas for redundancy even at low QPS. - Multi-region / multi-AZ: replicas spread across zones; load balancer aware. - Autoscaling: based on queue depth or tokens-in-flight (see mlops-deployment.md), not raw CPU. Scale-up lag is 10-30 min on cold GPU instances; pre-warm a reserve pool. - Cost check: estimate $/request from GPU \(/hour ÷ per-replica throughput. Compare against API-tier costs (\)0.001-0.05/call typical). Self-hosting wins meaningfully only above ~50k req/day after accounting for ops cost. - Numbers to drop: "single H100 on 7B FP16 + GQA: 1000-3000 tokens/sec sustained at concurrency 64", "headroom: 30-50% over peak", "break-even vs API: ~50k req/day typical"

Common follow-ups: - "What's your buffer strategy for traffic spikes?" - "When does self-hosting not pay off?"

Traps: - No buffer. Spike kills the cluster. - Sizing for average instead of peak.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/02_ai_infrastructure/04_ml_platform_operations/


Specialized features

Q: "How does vLLM serve multiple LoRA adapters efficiently?"

Tags: staff · common · conceptual · source: vLLM multi-LoRA docs 2026; standard staff-tier serving probe

Answer outline: - vLLM supports loading N LoRA adapters atop the same base model and routing requests to the appropriate adapter at runtime. Saves dramatically on hardware vs serving N separate full models. - How: adapters live in GPU memory; per-request, the engine applies the relevant adapter's deltas during forward pass. Latency overhead is small (~5-15% vs base-only). - Use cases: - Multi-tenant SaaS where each customer has their own fine-tuned adapter on a shared base. - A/B testing of adapter variants without deploying separate clusters. - Per-task specialization (one adapter for code, another for chat, another for summarization). - Limits: - Concurrent adapter count: typically up to 8-32 active adapters in memory; beyond that, adapter loading from disk becomes a bottleneck. - All adapters must share the same base model and rank-config compatibility. - Operational pattern: hot adapters always loaded; cold adapters swapped in on first request (cold-start latency 100ms-1s). - Numbers to drop: "concurrent active adapters: 8-32 typical per node", "per-request overhead vs base-only: 5-15%", "adapter cold-start: 100ms-1s"

Common follow-ups: - "What's the cold-start cost?" - "How do you decide which adapters stay hot?"

Traps: - Loading every adapter on every node. Wasteful.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/00_ai_foundation/06_adaptation_compression/


Q: "How does speculative decoding integrate with the serving engine?"

Tags: senior · common · conceptual · source: vLLM speculative decoding docs 2026; covered also in cost-latency-optimization.md

Answer outline: - vLLM, TensorRT-LLM, and SGLang all support speculative decoding. The engine configuration takes a draft model + target model + speculation length. - Engine handles: drafting (sampling K candidate tokens from the draft model), verification (running the target model on those K candidates in parallel), accepting tokens up to the first rejection, and looping. - Quality: bit-identical to non-speculative output (assuming correct rejection sampling). It's an exact algorithm, not an approximation. - Win: 1.5-2.8× decode speedup typical on workloads where the draft model has high agreement with the target. - Caveats: - Only helps decode. Prefill is unchanged. - Helps less at high batch sizes (compute becomes the bottleneck, not memory). - Requires a compatible draft model — usually a smaller checkpoint from the same model family. - Alternative variants supported by engines: Medusa (multiple decoding heads in the target model itself), EAGLE-3 (a learned drafter integrated into the target). vLLM supports both. See cost-latency-optimization.md for the algorithm details. - Numbers to drop: "1.5-2.8× decode speedup at low batch", "diminishing benefit past batch 32", "draft model: 5-15% of target's parameter count typical"

Common follow-ups: - "When does it not help?" - "Walk me through Medusa vs draft-model speculative."

Traps: - Treating speculative decoding as always-on. Batch-size-sensitive.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/02_ai_infrastructure/05_agent_performance_economics/


Q: "How does FP8 affect inference serving?"

Tags: staff · common · conceptual · source: NVIDIA H100/H200 FP8 docs; 2026 serving guides

Answer outline: - FP8 = 8-bit floating point. Half the memory of FP16/BF16, faster matmul on H100+ which has FP8 tensor-core support. - Two FP8 formats: E4M3 (better for weights) and E5M2 (better for activations). - Speedups: ~1.5-1.7× throughput vs BF16 on the same H100. Comparable to INT8 but with better numerical properties for LLMs. - Quality: typically near-lossless on most benchmarks (within 0-1%). Better than INT4 (~1-3% loss); slightly behind BF16 on very sensitive tasks. - Engine support: - TensorRT-LLM: best FP8 support, often the deploy target for FP8 production. - vLLM: FP8 weight quantization since 2024; FP8 KV cache support in newer versions. - SGLang: comparable. - Use case: production frontier models on H100/H200/B200 where every percent of throughput matters and quality is acceptable. - Caveats: requires Hopper (H100) or newer; A100 doesn't have native FP8 tensor cores. Mostly a 2026+ deployment story. - Numbers to drop: "FP8 vs BF16 on H100: 1.5-1.7× throughput", "quality loss: 0-1% typical, model-dependent", "memory: 0.5× BF16"

Common follow-ups: - "FP8 vs INT8 — which?" - "How does FP8 KV cache work?"

Traps: - Recommending FP8 on A100. Not supported in hardware.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/00_ai_foundation/06_adaptation_compression/


Operational concerns

Q: "What do you monitor on a vLLM serving node?"

Tags: senior · common · design · source: vLLM Prometheus metrics docs; standard senior ops probe

Answer outline: - vLLM exposes a Prometheus endpoint with ~25 native metrics. Key ones to scrape and dashboard: - Request-level: TTFT (vllm:time_to_first_token_seconds), end-to-end (vllm:e2e_request_latency_seconds), generation throughput (tokens/sec). - Batch state: number of running requests, waiting queue depth, swapped requests (KV cache evicted to CPU). - KV cache: cache hit rate (prefix cache), cache utilization (vllm:gpu_cache_usage_perc). - Scheduling: preemptions (when a request is paused to free up KV cache for higher-priority), recompute events. - GPU: utilization, memory used, memory free. - Alarms: - Queue depth growing → scale up. - KV cache utilization >95% sustained → either lower max-num-seqs or scale up. - p99 TTFT > SLO for 15 min → investigate (load? long-context outlier? regression?). - Preemption rate > 5% → KV cache pressure, lower concurrency or upgrade hardware. - Aggregation: per-replica + cluster-wide dashboards. Per-model if you serve multiple. - Trace integration: vLLM emits OpenTelemetry spans for each request; tie to the broader observability stack (observability-tracing.md). - Numbers to drop: "vLLM emits ~25 metrics natively", "KV cache utilization alarm: >95% for 5 min", "preemption rate target: <1-5%"

Common follow-ups: - "What's a preemption?" - "How do you tie this back to user-perceived latency?"

Traps: - Only CPU/RAM monitoring. Misses LLM-specific signals.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/01_ai_engineering/03_agent_observability_debugging/


Q: "How do you handle a node failure in a vLLM cluster?"

Tags: senior · common · scenario · source: standard senior reliability probe; 2026 AI serving loops

Answer outline: - Failure detection: - Health check fails (deep check: model produces a valid response on a synthetic prompt; not just process-alive). - Load balancer evicts the node from the active pool. - In-flight requests on that node fail; clients retry against other nodes (idempotent if the request is just generation; trickier for stateful agents). - Capacity impact: if you sized N replicas with headroom for failure, the surviving replicas absorb the load. Without headroom, the cluster degrades; scale-up triggered. - Recovery: - Mark node as failed; start replacement (autoscaler spins up new instance). - On commodity cloud, GPU instance provisioning is 10-30 min. Pre-warmed reserve pool helps. - New node loads the model (~30s-2min depending on size and storage path). - Health checks pass → node joins the pool. - For long-running agent requests: agents with durable state can resume on a different replica. Stateless requests just retry. See agents-debugging-production.md for the agent-side handling. - Postmortem: every node failure logged. Pattern: is one hardware SKU failing more? Is one region degraded? Is a particular workload triggering crashes? - Numbers to drop: "GPU instance provisioning: 10-30 min cold; pre-warm reserve to cut to seconds", "replica replacement target: <30 min including model load", "headroom: 30-50% over peak so single-node failure doesn't cascade"

Common follow-ups: - "What if many nodes fail at once?" - "How long are in-flight requests stuck?"

Traps: - No headroom. Single failure cascades into outage.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/01_ai_engineering/04_resilient_agent_systems/


Q: "Walk me through serving multiple models on the same cluster."

Tags: senior · common · design · source: standard senior multi-model serving probe; 2026 AI serving loops

Answer outline: - Two patterns: - Per-model pools: one replica set per model. Routing layer dispatches by model ID. Simple, clean, but each model needs its own minimum replica count for HA → expensive at low traffic per model. - Shared pool with model swapping: replicas can serve any model; load the requested model on demand (cached if recently used). Throughput-efficient but cold-start cost on first request. - For 2-5 hot models, per-model pools are usually right. For 50+ infrequent models (e.g., per-customer fine-tunes), shared pool with caching wins. - Multi-LoRA (covered earlier) is a special case: same base, many adapters, all served from one replica. - Routing: gateway layer maps (tenant/request_type) → model_id → cluster. Cost attribution: model-level + tenant-level dashboards. - Storage: model weights in fast shared storage (NFS / S3 with fast local cache). Load times matter for cold-swap. - Numbers to drop: "per-model pool: minimum 2-3 replicas per model for HA — expensive at scale", "shared pool with model cache: cold-load ~30s-2min, fits 5-20 models in cache", "multi-LoRA: same base, 8-32 adapters per replica"

Common follow-ups: - "How do you handle a rarely-used model?" - "When does model swapping cost more than it saves?"

Traps: - One model per cluster always. Wastes GPU.

Related cross-cutting: Cost & latency, Architecture choices Related module: learning/02_ai_infrastructure/02_inference_serving_systems/


Scenario / debugging

Q: "Your vLLM cluster TTFT spiked. Walk me through diagnosis."

Tags: senior · common · debugging · source: standard senior serving-incident probe; 2026 AI serving loops

Answer outline: - Step 1 — confirm scope. Per-node or cluster-wide? Per-model? Per-tenant? - Step 2 — check the usual suspects: - Queue depth growing: load exceeded capacity. Scale up; meanwhile shed load on the lowest-priority traffic. - KV cache pressure: utilization at 95%+; preemptions firing. Lower max-num-seqs or shed long-context traffic. - Long-context outlier: one or more requests with 100k+ tokens blocking the pipeline. Chunked prefill should be on; check it is. Consider routing long-context to a separate pool. - Slow downstream: a tool the agent calls is slow. Not really a serving problem; check downstream health. - Hardware degradation: one GPU running slow (thermal throttling, memory ECC errors). Check GPU health metrics; evict the node. - Recent deploy: rolled out new model / new vLLM version / new config. Roll back. - Step 3 — fix the immediate problem (rollback, scale, load shed). - Step 4 — root cause + regression test. Add the failure pattern to the eval suite. - Numbers to drop: "diagnose-and-mitigate target: 15 min for SEV-1", "rollback as default for any deploy-correlated incident"

Common follow-ups: - "How do you tell load from infra-side from a regression?" - "What metric would have caught this sooner?"

Traps: - Trying to optimize before diagnosing.

Related cross-cutting: Production patterns, Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/01_ai_engineering/03_agent_observability_debugging/