Inference Serving Systems¶

The chapters in this module, in reading order.

#	Chapter
00	Inference & Serving Engines — The Five-Year-Old Version
01	Inference bottleneck — why the obvious server disappoints instantly
02	Autoregressive decode cost — why one more token keeps getting more expensive
03	KV cache mechanics — what the prep station stores and why memory becomes the bill
04	Continuous batching — how modern engines keep the burners busy every decode step
05	Paged attention — how block-based KV memory defeats fragmentation
06	Speculative decoding — how a sous chef guesses ahead so the main chef moves in chunks
07	Tensor parallelism — how one giant layer gets sliced across many GPUs
08	Serving frameworks — vLLM, TGI, TensorRT-LLM, and Triton in practical contrast
09	ONNX Runtime optimization — graph fusion and portable serving beyond one narrow stack
10	Quantized serving — running smaller weights without pretending quality stays identical
11	Streaming token delivery — how the plating line makes latency feel shorter than the full answer time
12	Load testing and benchmarking — how to measure serving honestly instead of believing one lucky run
13	Honest admission — the open questions and messy tradeoffs in inference optimization