Home / Applied AI / 02. AI Infrastructure / 02. Inference Serving Systems Inference Serving Systems¶ The chapters in this module, in reading order. # Chapter 00 Inference & Serving Engines — The Five-Year-Old Version 01 Inference bottleneck — why the obvious server disappoints instantly 02 Autoregressive decode cost — why one more token keeps getting more expensive 03 KV cache mechanics — what the prep station stores and why memory becomes the bill 04 Continuous batching — how modern engines keep the burners busy every decode step 05 Paged attention — how block-based KV memory defeats fragmentation 06 Speculative decoding — how a sous chef guesses ahead so the main chef moves in chunks 07 Tensor parallelism — how one giant layer gets sliced across many GPUs 08 Serving frameworks — vLLM, TGI, TensorRT-LLM, and Triton in practical contrast 09 ONNX Runtime optimization — graph fusion and portable serving beyond one narrow stack 10 Quantized serving — running smaller weights without pretending quality stays identical 11 Streaming token delivery — how the plating line makes latency feel shorter than the full answer time 12 Load testing and benchmarking — how to measure serving honestly instead of believing one lucky run 13 Honest admission — the open questions and messy tradeoffs in inference optimization