Inference Serving Engines — Interview Questions¶
The "you decided to self-host — now what runs the model" round. Different from cost-latency-optimization.md (techniques: caching, batching, speculative decoding) and mlops-deployment.md (rollout patterns). This file is the engine choice and serving internals interview: vLLM vs TensorRT-LLM vs SGLang vs TGI, why TGI fell behind, when TensorRT-LLM is worth the compile pain, how PagedAttention and continuous batching interact at the engine level, what to monitor on a serving node, and how to size capacity.
The senior tell is naming the specific workload shape that justifies each engine — "vLLM by default; TensorRT-LLM only when the model is frozen and we need the last 20-30% of throughput; SGLang when shared-prefix workloads dominate".
Engine selection¶
Q: "Compare vLLM, TensorRT-LLM, SGLang, and TGI."¶
Tags: senior · very-common · conceptual · source: Yotta Labs 2026 inference engine comparison; n1n.ai engine survey 2026; standard senior infra probe
Answer outline: - vLLM: open-source, PyTorch-based, the 2026 default. Continuous batching + PagedAttention as core innovations. Wide model support (Llama, Qwen, Mistral, Mixtral, DeepSeek, Gemma, etc.), no compilation step, fast iteration. Under heavy load, keeps the GPU 85-92% busy. Best for "we have a fleet of models and want to ship now". - TensorRT-LLM (NVIDIA): compiled inference engine, lowest latency and highest throughput when the model is fixed. Requires a build step (28+ min cold compile typical), tight NVIDIA-only coupling, less flexibility. Best when one model serves at very high QPS and the team can pay the compile/tuning cost. - SGLang: open-source, focuses on structured-generation and prefix caching. Outperforms vLLM by ~29% on smaller models (7B-8B) on H100; gap narrows on 70B+. Particularly strong when workloads share prefixes (RAG with stable system prompt, multi-step agents). Best when shared-prefix is a meaningful share of traffic. - TGI (Hugging Face Text Generation Inference): now officially in maintenance mode. HF themselves recommend vLLM or SGLang. Skip TGI for new deployments. - 2026 stance: start with vLLM. Switch to TensorRT-LLM when you've validated the model is stable and the throughput delta is worth the ops cost. Switch to SGLang for shared-prefix-heavy workloads. - Numbers to drop: "vLLM TTFT p50 at 10 concurrent: ~120ms. SGLang: ~112ms. TensorRT-LLM: ~105ms.", "TensorRT-LLM compile time: ~28 min cold; vLLM start ~62s", "SGLang 7B advantage: ~29% throughput on H100; closes to 3-5% at 70B+"
Common follow-ups: - "Why is TGI dying?" - "When does TensorRT-LLM actually pay off?" - "Walk me through SGLang's prefix-caching advantage."
Traps: - Recommending TGI in 2026. Outdated. - Defaulting to TensorRT-LLM. The compile cost is real.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "When does TensorRT-LLM win over vLLM?"¶
Tags: staff · common · scenario · source: Yotta Labs / Spheron benchmarks 2026; standard staff-tier serving probe
Answer outline: - TensorRT-LLM wins when all of these hold: - Model is frozen: not changing for months. Otherwise you re-compile every change. - NVIDIA hardware: TensorRT-LLM is NVIDIA-only. Won't run on AMD MI300X, Trainium, etc. - Throughput-critical workload: high sustained QPS where the last 20-30% throughput matters financially. - Team has the ops capacity: TensorRT-LLM compilation, version management, and tuning take real engineer time. - Typical wins: 20-30% throughput uplift, 10-15% TTFT improvement, especially on large models (70B+) at high concurrency. - The compile cost: 28+ min cold compile per model+precision+max-batch combination. Re-compile on every change. Storage and CI pipeline implications. - Hybrid in production: vLLM for development and rare-traffic models; TensorRT-LLM for the highest-QPS production model only. - Decision rule: if your annual GPU savings from the 20-30% efficiency uplift exceed the ops cost of maintaining a TensorRT-LLM build pipeline, go for it. Otherwise stay on vLLM. - Numbers to drop: "20-30% throughput uplift typical at high concurrency", "compile time: 28+ min", "ops cost: 0.25-0.5 FTE engineer for a serious TensorRT-LLM stack"
Common follow-ups: - "What if the model is going to change in a month?" - "How does FP8 fit in?"
Traps: - Compiling TensorRT-LLM for a model that's about to be replaced.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "Why does SGLang win on shared-prefix workloads?"¶
Tags: staff · common · conceptual · source: SGLang prefix caching docs; n1n.ai 2026 engine comparison
Answer outline: - SGLang implements aggressive prefix caching — automatic KV cache sharing across requests that share the same prompt prefix. - For workloads where the prefix is stable (RAG with one system prompt + retrieved chunks varying, agents with consistent tool definitions, batch evaluation), the cache hit on the prefix prevents re-computing K and V projections for those tokens. Big savings. - vLLM has prefix caching too (added in mid-2024+), but SGLang's implementation is more aggressive and was designed for this from day one — handles automatic detection of shared prefixes without explicit hints. - Result: on shared-prefix workloads, SGLang's TTFT can be 30-50% lower than vLLM and throughput proportionally higher. - For workloads with diverse prompts (every request has a unique prefix), the advantage shrinks to near zero. The gain is workload-dependent. - 2026 trend: vLLM is closing the gap. SGLang's edge on prefix caching is shrinking as vLLM matures. - Numbers to drop: "shared-prefix TTFT improvement: 30-50% over naive serving", "RAG with stable system prompt: 60-80% prefix cache hit rate", "agent workloads: similar"
Common follow-ups: - "How does prefix caching interact with continuous batching?" - "What's a workload where prefix caching doesn't help?"
Traps: - Claiming SGLang dominates everywhere. Workload-specific.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Engine internals¶
Q: "How does continuous batching work at the engine level?"¶
Tags: senior · very-common · conceptual · source: vLLM / Anyscale continuous batching blog; standard senior internals probe
Answer outline:
- The scheduler operates at the token level, not the request level. Each iteration of the generation loop:
- Pick a batch of in-flight requests to run the next decode step on.
- Run one forward pass through the model for the entire batch (each sequence contributes its current position's input).
- Update each sequence with its newly-generated token.
- For any sequence that finished (EOS or max-tokens), evict it; for new requests in the queue, slot them in.
- Each iteration's batch is dynamic — old sequences leave as they finish; new ones join immediately. No request waits for the slowest sequence in its original batch.
- Key constraint: KV cache memory. All concurrent sequences' KV caches must fit. PagedAttention solves this by storing KV cache in non-contiguous pages, avoiding fragmentation.
- Throughput improvement vs static batching: 20-30× on workloads with variable output lengths. Static batching wastes most of its GPU time on padding while waiting for the longest sequence.
- Tunable: max_num_batched_tokens (per-iteration token budget), max_num_seqs (concurrent sequence cap). Higher = more throughput but more KV cache pressure.
- Numbers to drop: "vLLM continuous batching + PagedAttention: ~23× throughput vs naive", "max_num_seqs: 50-200 typical per GPU at 7B FP16", "max_num_batched_tokens: 2048-8192 typical"
Common follow-ups: - "What stops you from increasing batch size indefinitely?" - "How does this interact with speculative decoding?"
Traps: - Confusing continuous batching with naive dynamic batching. Token-level vs request-level scheduling is the key.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "Explain prefill vs decode and why they're treated differently."¶
Tags: senior · common · conceptual · source: standard senior inference-internals probe; 2026 LLM serving guides
Answer outline: - Prefill: processing the user's input prompt. All N input tokens go through the model in a single (or chunked) forward pass. High arithmetic intensity — each weight is loaded once and used across N positions. Compute-bound (FLOPs-limited). - Decode: generating one output token at a time. Each step loads the entire model + KV cache for the single new token. Low arithmetic intensity. Memory-bandwidth-bound. - Implications: - Prefill latency scales O(N²) due to attention; long prompts have expensive prefill. - Decode latency per token is roughly constant (depends on memory bandwidth and KV cache size, not prompt length). - Different optimization techniques apply: prefill benefits from FlashAttention and quantization (compute-side wins); decode benefits from KV cache compression, GQA, smaller models, speculative decoding. - Engine handling: - vLLM and SGLang interleave prefill and decode in continuous batching. - Chunked prefill (newer): split a long prefill into chunks, interleave with decode tokens. Reduces head-of-line blocking when a new long-context request lands and would otherwise starve the decode pipeline. vLLM and SGLang support; TensorRT-LLM does too. - Disaggregated prefill/decode: separate GPU pools for prefill (compute-optimized) and decode (memory-bandwidth-optimized). Cutting-edge 2026 setups; complex to operate, big throughput wins. - Numbers to drop: "prefill: O(N²) attention compute, often <1s at moderate context", "decode: ~constant per-token, dominated by memory bandwidth", "chunked prefill: 30-50% TTFT improvement on long-context requests under load"
Common follow-ups: - "What's chunked prefill?" - "When would you disaggregate prefill and decode?"
Traps: - Treating prefill and decode the same. Different bottlenecks, different optimizations.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/00_ai_foundation/02_tokens_embeddings_context/
Q: "What is chunked prefill and when does it help?"¶
Tags: staff · common · conceptual · source: vLLM chunked prefill docs 2026; Spheron LLM serving optimization 2026
Answer outline:
- Chunked prefill: split a long prompt's prefill phase into multiple smaller chunks; interleave each chunk with decode steps for other in-flight requests.
- Why: under continuous batching, a single 100k-token prefill request would monopolize the GPU for seconds, blocking all decode-phase requests. Latency for everyone else spikes. With chunked prefill, you process the long prefill in pieces, alternating with decode work, so other users keep getting tokens.
- Win: smoother p99 latency under mixed workloads (long-context + short interactive both in flight). TTFT for the long-context request slightly worse; TTFT for the short interactive much better.
- Lose: pure throughput on a single very-long-context request can be slightly lower than one shot prefill due to overhead.
- Tunable: prefill_chunk_size. Larger = closer to one-shot prefill (lower overhead, more blocking); smaller = better interleaving (higher overhead, smoother latency).
- Modern engines (vLLM, SGLang, TensorRT-LLM) all support; usually enabled by default in 2026.
- Numbers to drop: "prefill_chunk_size: 512-2048 tokens typical", "p99 TTFT improvement: 30-50% under mixed workloads", "single-request throughput overhead: 5-10%"
Common follow-ups: - "Why does this matter for production?" - "What's a workload where you'd disable chunked prefill?"
Traps: - Saying chunked prefill is always on. Tunable; sometimes off for pure-throughput workloads.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "What's disaggregated prefill and decode?"¶
Tags: staff · occasional · conceptual · source: 2026 inference research and emerging production patterns
Answer outline: - Run prefill on one pool of GPUs (compute-optimized: high FLOPs); run decode on a separate pool (memory-bandwidth-optimized: high HBM bandwidth, possibly different hardware tier). - Idea: prefill and decode have opposite bottleneck profiles. Same GPU does both, but one phase always wastes the other's binding resource. - Architecture: - Prefill pool computes the KV cache for a request, then transfers the KV cache to a decode-pool node via fast interconnect (NVLink, InfiniBand). - Decode pool serves token-by-token generation against the transferred KV cache. - Routing layer manages where each request goes. - Wins: - Better hardware utilization per dollar. - Independent scaling — prefill and decode pools size separately to their respective traffic shape. - Lower p99 latency on mixed workloads. - Costs: - Operational complexity. Two pools to manage. - KV cache transfer overhead (mitigated by fast interconnect; meaningful on commodity networking). - More moving parts → more failure modes. - 2026 reality: cutting-edge, used by frontier labs and some hyperscalers. Most production teams use a single pool with vLLM continuous batching. Adopt only when scale clearly justifies the complexity. - Numbers to drop: "10-30% efficiency gain at the system level when scale justifies", "operational cost: significant — usually 1+ FTE on serving infra"
Common follow-ups: - "What hardware tier for prefill vs decode?" - "What's the bottleneck of KV cache transfer?"
Traps: - Recommending disaggregation for small/medium setups. Overkill.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Tuning & sizing¶
Q: "Walk me through tuning vLLM for a production workload."¶
Tags: senior · common · design · source: vLLM tuning docs 2026; standard senior serving probe
Answer outline:
- Process:
- Profile the workload: average prompt length, output length, QPS, max-concurrency targets. These drive all subsequent decisions.
- Pick the right hardware: H100/H200 for high-throughput; A100 for budget; consumer cards for low-volume.
- Pick the precision: BF16 for safety; FP8 on H100+ for ~1.7× throughput; INT4 (AWQ/GPTQ) for cost-critical with small quality hit.
- Tune the key flags:
- --max-num-seqs: concurrent sequence cap. Higher = more throughput, more KV cache memory. Start at 64-128, increase until KV cache utilization peaks.
- --max-num-batched-tokens: per-iteration token budget. Affects prefill/decode balance. 2048-8192 typical.
- --max-model-len: hard cap on sequence length. Drives KV cache memory allocation.
- --gpu-memory-utilization: fraction of GPU memory to use (default 0.9). Lower if OOM, higher to push harder.
- --enable-chunked-prefill: usually on by default in 2026.
- --enable-prefix-caching: turn on if your workload has shared prefixes.
- Benchmark: measure TTFT p50/p95/p99 and tokens/sec at target concurrency. Compare against your SLO.
- Iterate: bump batch size, check OOM, check tail latency. If p99 spikes, lower concurrency cap.
- Special features: AWQ/GPTQ INT4 weights, speculative decoding (provide a draft model), LoRA adapters served per-tenant, KV cache compression.
- Monitoring: vLLM exposes Prometheus metrics (vllm:e2e_request_latency_seconds, vllm:gpu_cache_usage_perc, etc.) — scrape these.
- Numbers to drop: "max-num-seqs: 64-200 typical per H100 on 7B FP16", "gpu-memory-utilization: 0.85-0.95", "iteration: 3-5 tuning rounds to converge"
Common follow-ups: - "What if you OOM at higher batch sizes?" - "When do you turn off prefix caching?"
Traps: - One-shot tuning. Workloads shift; re-tune periodically.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Q: "How do you size a vLLM cluster for a target QPS?"¶
Tags: senior · common · design · source: standard senior infra-sizing probe; 2026 AI engineer loops
Answer outline:
- Inputs: model size, target QPS, average input/output tokens, p95 TTFT SLO, hardware choice.
- Per-replica throughput estimation:
- Run a synthetic benchmark on one GPU at your target prompt/response shapes.
- Measure tokens/sec sustained at acceptable tail latency.
- Divide target tokens/sec by per-replica throughput → required replica count.
- Headroom: add 30-50% buffer over peak traffic. GPUs are slow to spin up.
- HA: minimum 2-3 replicas for redundancy even at low QPS.
- Multi-region / multi-AZ: replicas spread across zones; load balancer aware.
- Autoscaling: based on queue depth or tokens-in-flight (see mlops-deployment.md), not raw CPU. Scale-up lag is 10-30 min on cold GPU instances; pre-warm a reserve pool.
- Cost check: estimate $/request from GPU \(/hour ÷ per-replica throughput. Compare against API-tier costs (\)0.001-0.05/call typical). Self-hosting wins meaningfully only above ~50k req/day after accounting for ops cost.
- Numbers to drop: "single H100 on 7B FP16 + GQA: 1000-3000 tokens/sec sustained at concurrency 64", "headroom: 30-50% over peak", "break-even vs API: ~50k req/day typical"
Common follow-ups: - "What's your buffer strategy for traffic spikes?" - "When does self-hosting not pay off?"
Traps: - No buffer. Spike kills the cluster. - Sizing for average instead of peak.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/02_ai_infrastructure/04_ml_platform_operations/
Specialized features¶
Q: "How does vLLM serve multiple LoRA adapters efficiently?"¶
Tags: staff · common · conceptual · source: vLLM multi-LoRA docs 2026; standard staff-tier serving probe
Answer outline: - vLLM supports loading N LoRA adapters atop the same base model and routing requests to the appropriate adapter at runtime. Saves dramatically on hardware vs serving N separate full models. - How: adapters live in GPU memory; per-request, the engine applies the relevant adapter's deltas during forward pass. Latency overhead is small (~5-15% vs base-only). - Use cases: - Multi-tenant SaaS where each customer has their own fine-tuned adapter on a shared base. - A/B testing of adapter variants without deploying separate clusters. - Per-task specialization (one adapter for code, another for chat, another for summarization). - Limits: - Concurrent adapter count: typically up to 8-32 active adapters in memory; beyond that, adapter loading from disk becomes a bottleneck. - All adapters must share the same base model and rank-config compatibility. - Operational pattern: hot adapters always loaded; cold adapters swapped in on first request (cold-start latency 100ms-1s). - Numbers to drop: "concurrent active adapters: 8-32 typical per node", "per-request overhead vs base-only: 5-15%", "adapter cold-start: 100ms-1s"
Common follow-ups: - "What's the cold-start cost?" - "How do you decide which adapters stay hot?"
Traps: - Loading every adapter on every node. Wasteful.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/00_ai_foundation/06_adaptation_compression/
Q: "How does speculative decoding integrate with the serving engine?"¶
Tags: senior · common · conceptual · source: vLLM speculative decoding docs 2026; covered also in cost-latency-optimization.md
Answer outline:
- vLLM, TensorRT-LLM, and SGLang all support speculative decoding. The engine configuration takes a draft model + target model + speculation length.
- Engine handles: drafting (sampling K candidate tokens from the draft model), verification (running the target model on those K candidates in parallel), accepting tokens up to the first rejection, and looping.
- Quality: bit-identical to non-speculative output (assuming correct rejection sampling). It's an exact algorithm, not an approximation.
- Win: 1.5-2.8× decode speedup typical on workloads where the draft model has high agreement with the target.
- Caveats:
- Only helps decode. Prefill is unchanged.
- Helps less at high batch sizes (compute becomes the bottleneck, not memory).
- Requires a compatible draft model — usually a smaller checkpoint from the same model family.
- Alternative variants supported by engines: Medusa (multiple decoding heads in the target model itself), EAGLE-3 (a learned drafter integrated into the target). vLLM supports both. See cost-latency-optimization.md for the algorithm details.
- Numbers to drop: "1.5-2.8× decode speedup at low batch", "diminishing benefit past batch 32", "draft model: 5-15% of target's parameter count typical"
Common follow-ups: - "When does it not help?" - "Walk me through Medusa vs draft-model speculative."
Traps: - Treating speculative decoding as always-on. Batch-size-sensitive.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "How does FP8 affect inference serving?"¶
Tags: staff · common · conceptual · source: NVIDIA H100/H200 FP8 docs; 2026 serving guides
Answer outline: - FP8 = 8-bit floating point. Half the memory of FP16/BF16, faster matmul on H100+ which has FP8 tensor-core support. - Two FP8 formats: E4M3 (better for weights) and E5M2 (better for activations). - Speedups: ~1.5-1.7× throughput vs BF16 on the same H100. Comparable to INT8 but with better numerical properties for LLMs. - Quality: typically near-lossless on most benchmarks (within 0-1%). Better than INT4 (~1-3% loss); slightly behind BF16 on very sensitive tasks. - Engine support: - TensorRT-LLM: best FP8 support, often the deploy target for FP8 production. - vLLM: FP8 weight quantization since 2024; FP8 KV cache support in newer versions. - SGLang: comparable. - Use case: production frontier models on H100/H200/B200 where every percent of throughput matters and quality is acceptable. - Caveats: requires Hopper (H100) or newer; A100 doesn't have native FP8 tensor cores. Mostly a 2026+ deployment story. - Numbers to drop: "FP8 vs BF16 on H100: 1.5-1.7× throughput", "quality loss: 0-1% typical, model-dependent", "memory: 0.5× BF16"
Common follow-ups: - "FP8 vs INT8 — which?" - "How does FP8 KV cache work?"
Traps: - Recommending FP8 on A100. Not supported in hardware.
Related cross-cutting: Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/00_ai_foundation/06_adaptation_compression/
Operational concerns¶
Q: "What do you monitor on a vLLM serving node?"¶
Tags: senior · common · design · source: vLLM Prometheus metrics docs; standard senior ops probe
Answer outline:
- vLLM exposes a Prometheus endpoint with ~25 native metrics. Key ones to scrape and dashboard:
- Request-level: TTFT (vllm:time_to_first_token_seconds), end-to-end (vllm:e2e_request_latency_seconds), generation throughput (tokens/sec).
- Batch state: number of running requests, waiting queue depth, swapped requests (KV cache evicted to CPU).
- KV cache: cache hit rate (prefix cache), cache utilization (vllm:gpu_cache_usage_perc).
- Scheduling: preemptions (when a request is paused to free up KV cache for higher-priority), recompute events.
- GPU: utilization, memory used, memory free.
- Alarms:
- Queue depth growing → scale up.
- KV cache utilization >95% sustained → either lower max-num-seqs or scale up.
- p99 TTFT > SLO for 15 min → investigate (load? long-context outlier? regression?).
- Preemption rate > 5% → KV cache pressure, lower concurrency or upgrade hardware.
- Aggregation: per-replica + cluster-wide dashboards. Per-model if you serve multiple.
- Trace integration: vLLM emits OpenTelemetry spans for each request; tie to the broader observability stack (observability-tracing.md).
- Numbers to drop: "vLLM emits ~25 metrics natively", "KV cache utilization alarm: >95% for 5 min", "preemption rate target: <1-5%"
Common follow-ups: - "What's a preemption?" - "How do you tie this back to user-perceived latency?"
Traps: - Only CPU/RAM monitoring. Misses LLM-specific signals.
Related cross-cutting: Production patterns
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/01_ai_engineering/03_agent_observability_debugging/
Q: "How do you handle a node failure in a vLLM cluster?"¶
Tags: senior · common · scenario · source: standard senior reliability probe; 2026 AI serving loops
Answer outline:
- Failure detection:
- Health check fails (deep check: model produces a valid response on a synthetic prompt; not just process-alive).
- Load balancer evicts the node from the active pool.
- In-flight requests on that node fail; clients retry against other nodes (idempotent if the request is just generation; trickier for stateful agents).
- Capacity impact: if you sized N replicas with headroom for failure, the surviving replicas absorb the load. Without headroom, the cluster degrades; scale-up triggered.
- Recovery:
- Mark node as failed; start replacement (autoscaler spins up new instance).
- On commodity cloud, GPU instance provisioning is 10-30 min. Pre-warmed reserve pool helps.
- New node loads the model (~30s-2min depending on size and storage path).
- Health checks pass → node joins the pool.
- For long-running agent requests: agents with durable state can resume on a different replica. Stateless requests just retry. See agents-debugging-production.md for the agent-side handling.
- Postmortem: every node failure logged. Pattern: is one hardware SKU failing more? Is one region degraded? Is a particular workload triggering crashes?
- Numbers to drop: "GPU instance provisioning: 10-30 min cold; pre-warm reserve to cut to seconds", "replica replacement target: <30 min including model load", "headroom: 30-50% over peak so single-node failure doesn't cascade"
Common follow-ups: - "What if many nodes fail at once?" - "How long are in-flight requests stuck?"
Traps: - No headroom. Single failure cascades into outage.
Related cross-cutting: Production patterns
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/01_ai_engineering/04_resilient_agent_systems/
Q: "Walk me through serving multiple models on the same cluster."¶
Tags: senior · common · design · source: standard senior multi-model serving probe; 2026 AI serving loops
Answer outline: - Two patterns: - Per-model pools: one replica set per model. Routing layer dispatches by model ID. Simple, clean, but each model needs its own minimum replica count for HA → expensive at low traffic per model. - Shared pool with model swapping: replicas can serve any model; load the requested model on demand (cached if recently used). Throughput-efficient but cold-start cost on first request. - For 2-5 hot models, per-model pools are usually right. For 50+ infrequent models (e.g., per-customer fine-tunes), shared pool with caching wins. - Multi-LoRA (covered earlier) is a special case: same base, many adapters, all served from one replica. - Routing: gateway layer maps (tenant/request_type) → model_id → cluster. Cost attribution: model-level + tenant-level dashboards. - Storage: model weights in fast shared storage (NFS / S3 with fast local cache). Load times matter for cold-swap. - Numbers to drop: "per-model pool: minimum 2-3 replicas per model for HA — expensive at scale", "shared pool with model cache: cold-load ~30s-2min, fits 5-20 models in cache", "multi-LoRA: same base, 8-32 adapters per replica"
Common follow-ups: - "How do you handle a rarely-used model?" - "When does model swapping cost more than it saves?"
Traps: - One model per cluster always. Wastes GPU.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/
Scenario / debugging¶
Q: "Your vLLM cluster TTFT spiked. Walk me through diagnosis."¶
Tags: senior · common · debugging · source: standard senior serving-incident probe; 2026 AI serving loops
Answer outline: - Step 1 — confirm scope. Per-node or cluster-wide? Per-model? Per-tenant? - Step 2 — check the usual suspects: - Queue depth growing: load exceeded capacity. Scale up; meanwhile shed load on the lowest-priority traffic. - KV cache pressure: utilization at 95%+; preemptions firing. Lower max-num-seqs or shed long-context traffic. - Long-context outlier: one or more requests with 100k+ tokens blocking the pipeline. Chunked prefill should be on; check it is. Consider routing long-context to a separate pool. - Slow downstream: a tool the agent calls is slow. Not really a serving problem; check downstream health. - Hardware degradation: one GPU running slow (thermal throttling, memory ECC errors). Check GPU health metrics; evict the node. - Recent deploy: rolled out new model / new vLLM version / new config. Roll back. - Step 3 — fix the immediate problem (rollback, scale, load shed). - Step 4 — root cause + regression test. Add the failure pattern to the eval suite. - Numbers to drop: "diagnose-and-mitigate target: 15 min for SEV-1", "rollback as default for any deploy-correlated incident"
Common follow-ups: - "How do you tell load from infra-side from a regression?" - "What metric would have caught this sooner?"
Traps: - Trying to optimize before diagnosing.
Related cross-cutting: Production patterns, Cost & latency
Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/01_ai_engineering/03_agent_observability_debugging/