05. Serving and inference¶

⏱️ Estimated time: 24 min | Level: advanced

ELI5 callback: In our chain, the kitchen trains, the prep station prepares, the recipe book stores, the serving counter serves, and the quality inspector checks. Same restaurant chain, different platform layer. See.

Inference is product delivery under a clock¶

Training can be slow. Inference usually cannot. Users feel prediction latency directly inside the product. That makes serving a systems problem first. See. The serving counter is where model quality meets user patience. The kitchen may build a great model that still serves badly. The prep station must keep hot features ready before requests arrive. The recipe book should tell the server exactly which version is approved. The quality inspector later measures latency, errors, and business lift. So what to do? Start with latency budget and traffic shape. Know request rate, payload size, concurrency, and peak bursts. Know whether the path is synchronous, asynchronous, or batch. Simple, no? Serving design begins with workload reality. Then choose the serving stack that matches that reality.

client → gateway → feature fetch → model server → response
   │         │            │              │
   │         └── auth ────┘              │
   │                                    │
   └──────────── latency budget ────────┘
                 │
                 v

Serving exists to satisfy product latency and reliability goals.
Workload shape should choose architecture, not habit.
Keep budgets visible for every hop.
Separate online, async, and batch paths clearly.
Good serving starts before picking TorchServe or Triton.
Now watch. Model servers differ mainly in optimization focus.

Model servers package runtimes, batching, and device control¶

TorchServe fits many PyTorch workflows with handlers and management APIs. Triton supports multiple frameworks and strong GPU scheduling primitives. vLLM focuses on LLM throughput with paged attention and token-efficient serving. The right choice depends on model type and traffic pattern. See. One stack rarely wins every workload. Classic tabular models may live fine behind simple HTTP services. Vision and deep learning stacks often benefit from Triton optimization. LLM workloads need careful KV-cache and token batching behavior. So what to do? Benchmark with your real prompt sizes and concurrency. Synthetic microbenchmarks can lie politely. Also measure cold-start time and model load time. Those delays matter during autoscaling events. Now watch. Device utilization is the hidden serving bottleneck.

request queue
    │
    ├── dynamic batching
    ├── runtime handler
    ├── GPU execution
    └── post-process output
             ↓

Match server choice to framework and traffic reality.
Benchmark throughput and tail latency together.
Include startup costs in capacity planning.
Handler code can dominate latency for small models.
Runtime choice is architecture, not branding.
Simple, no? Benchmark the full path, not the headline claim.

Batching, caching, and autoscaling shape performance¶

Dynamic batching improves throughput by combining nearby requests. But batching adds wait time, so latency budgets must stay explicit. Caching can avoid repeated inference for identical or similar inputs. Feature caching often helps more than prediction caching. See. Tail latency decides user trust during traffic spikes. Autoscaling needs the right signals. CPU usage alone may miss GPU saturation or queue growth. Queue depth, tokens per second, and model load count help more. So what to do? Scale on the bottleneck you actually observe. Also separate scale-out from scale-up decisions. More replicas help only if the model fits and traffic partitions well. Bigger GPUs help only if utilization justifies the cost. Now watch. Latency optimization is budget allocation hop by hop.

incoming traffic
      │
      ├── queue
      ├── batch window
      ├── infer
      └── cache / respond
             ↓

Tune batch windows against tail latency, not average alone.
Scale on queue pain, device pain, or token pain explicitly.
Keep warm capacity for bursty traffic.
Cache where hit rate is meaningful and safe.
Measure p95 and p99, not only p50.
See. Throughput without latency control is a trap.

LLM inference adds memory pressure and token economics¶

LLM serving is different because sequence length changes compute cost. Prompt size, output size, and concurrency all interact. KV-cache management becomes central to throughput. Continuous batching improves utilization across uneven token streams. See. Tokens are the real unit of work here. Prompt caching can reduce repeated prefix cost for common contexts. Quantization may lower cost, but it can alter quality. Speculative decoding can increase speed when model pairings fit. So what to do? Measure cost per generated token and business outcome. That keeps optimization grounded. Also protect the system from giant prompts and runaway generations. Admission controls and truncation policies matter. Now watch. Guardrails are part of serving, not an afterthought.

prompt → prefill → KV cache
   │         │         │
   ├─────────┴─────────┤
   │ continuous batching
   │ decode tokens
   └── stop / stream
            ↓

Optimize for tokens per second and quality together.
Memory often limits scale before compute does.
Put hard bounds on prompt and output size.
Streaming UX can hide some latency, not all cost.
LLM inference needs traffic and safety policies together.
Simple, no? Long prompts are expensive feelings.

Reliability needs rollback, observability, and graceful failure¶

Inference systems fail through timeouts, overload, bad inputs, and bad releases. Design fallback behavior for each failure class. Some products can use stale predictions or rules. Some must fail closed for safety. See. Reliability choices are business choices wearing system clothes. So what to do? Define error budgets and fallback modes explicitly. Add circuit breakers around dependent feature services. Track queue age, device health, model load failures, and output anomalies. Keep canary rollout paths short and reversible. Use one-click traffic shift back to the previous stable model. Also protect downstream systems from model-induced burst patterns. Bad retries can multiply damage. Now the serving stack becomes an actual platform, not a demo.

bad deploy / overload
        │
        ├── fallback path
        ├── alert + dashboard
        ├── rollback traffic
        └── postmortem data
                ↓

Plan graceful degradation before first traffic spike.
Tie every alert to rollback or mitigation steps.
Keep dependent services within the same reliability review.
Observe outputs, not only infrastructure health.
Reliable serving is disciplined operations plus good defaults.
See. Production kindness means predictable failure behavior.

Where this lives in the wild¶

A search team serves rankers behind Triton because GPU batching beats custom glue.
A support assistant team uses vLLM to improve LLM token throughput under bursty traffic.
A fraud service keeps a rules fallback because some paths cannot wait for model recovery.
A marketplace team scales on queue depth instead of CPU because GPU saturation was hidden.
A recommendation stack caches hot features more aggressively than model outputs for better savings.

Pause and recall¶

Why is inference design driven first by latency budget and traffic shape?
How can dynamic batching help and hurt at the same time?
Why does LLM serving care so much about token and memory economics?
What makes graceful degradation essential for production inference?

Interview Q&A¶

Q: How would you design a model serving system? A: I would start with request patterns, latency SLOs, feature dependencies, model server choice, autoscaling signals, and rollback behavior. Common wrong answer to avoid: I would put the model behind an API and optimize later.

Q: When would you choose Triton, TorchServe, or vLLM? A: Choose based on framework mix, batching needs, GPU optimization features, and whether LLM token scheduling is central. Common wrong answer to avoid: Whichever tool is newest on social media is best.

Q: How do you improve inference throughput safely? A: Use batching, caching, warm pools, right-sized hardware, and bottleneck-based autoscaling while guarding tail latency. Common wrong answer to avoid: Increase concurrency until the graphs look exciting.

Q: What is special about LLM inference? A: Prompt length, output length, KV-cache memory, continuous batching, and token pricing make throughput and cost behavior very different. Common wrong answer to avoid: LLM serving is the same as any REST service with bigger JSON.

Apply now (5 min)¶

Write a 100 ms latency budget for one inference endpoint. Split it into gateway, feature fetch, model compute, and response formatting. Then mark which hop currently has no direct metric. Add one fallback behavior for feature miss and one for model timeout. Finally, choose the autoscaling signal you would trust most. That five-minute exercise reveals real serving priorities.

Bridge. Models serving. But is the new model actually better? → 06