00. Inference & Serving Engines — The Five-Year-Old Version¶

You now know what LLMs are. Here we learn how the big kitchen serves them fast enough for real customers.

Imagine a giant restaurant kitchen. The kitchen is the GPU. An order ticket is one user request. The cooks do not make one huge meal at once. They plate one spoon at a time. That is how token generation feels.

Now what is the problem? If every new spoon makes the cooks reread the full order again, service becomes painfully slow. If the prep area is messy, chefs keep bumping into each other. If the kitchen waits for a perfect full tray, small orders sit around getting cold. If one huge banquet blocks the line, short snack orders suffer too. See.

A good serving engine is really a smart kitchen manager. It remembers chopped ingredients. It groups compatible tickets. It finds empty burners. It lets a fast sous chef prepare a likely next step. It starts the plating line early, so the customer sees movement before the whole meal is done. Simple, no?

This module is about that kitchen discipline. We will see why naive serving is slow. Then we will see how the prep station stores reusable state. Then we will see how the batch window groups many tickets. Then we will study paged memory, speculative decoding, parallelism, frameworks, optimization, streaming, and honest benchmarking.

One picture for the whole module¶

customer sends order ticket
           │
           ▼
┌─────────────────────────┐
│ kitchen receives prompt │
└────────────┬────────────┘
             │
             ▼
┌─────────────────────────┐
│ prep station fills up   │  saved K,V state
└────────────┬────────────┘
             │
      batch window opens
             │
   ┌─────────┴─────────┐
   ▼                   ▼
short ticket       long ticket
   │                   │
   └──────┬───────┬────┘
          ▼       ▼
      sous chef  head chef
          │       │
          └──→ plating line ──→ customer sees tokens

Keep this picture in your head. Every later file will call back to it.

The placeholders you will see called back¶

Placeholder	Meaning
kitchen	The GPU or inference device doing the work.
order ticket	One inference request with a prompt and generated output.
prep station	KV-cache memory holding reusable past context.
batch window	The small scheduling window where compatible requests get grouped.
sous chef	A draft model proposing tokens for speculative decoding.
plating line	The streaming path that sends partial tokens to the client.

A tiny serving example¶

Suppose one order ticket has a 2,000-token prompt. Suppose the answer will be 200 tokens long. A naive server recomputes attention against all past tokens every step. That means work grows as the answer grows.

Step 1 looks at 2,000 prompt tokens. Step 2 looks at 2,001 tokens. Step 3 looks at 2,002 tokens. Step 200 looks at 2,199 tokens.

Total token-attention comparisons become roughly:

2,000 + 2,001 + 2,002 + ... + 2,199
average 2,099.5 × 200
about 419,900 token positions touched

So what to do? Cache old work. Pack memory better. Schedule continuously. Send output on the plating line early.

Top resources¶

vLLM docs — best practical explanation of continuous batching and paged attention in one place.
Hugging Face TGI docs — useful for understanding production server knobs, routers, and metrics.
NVIDIA TensorRT-LLM docs — best for kernel fusion, quantization, and multi-GPU deployment details.
ONNX Runtime GenAI docs — strong guide to portable graph execution and provider backends.
Anthropic prompt caching docs — clear product-facing explanation of cached prefixes and pricing effects.
Any good roofline-model explainer — helps you see when decode is compute-bound or memory-bound.

What's coming¶

01-inference-bottleneck.md — why a straightforward server wastes time and bandwidth immediately.
02-autoregressive-decode-cost.md — why token-by-token decoding becomes expensive so fast.
03-kv-cache-mechanics.md — what the cache stores, and why it saves compute but eats memory.
04-continuous-batching.md — how modern engines keep the kitchen busy without fixed batches.
05-paged-attention.md — how paged memory avoids fragmentation and dead space.
06-speculative-decoding.md — how a sous chef can guess tokens and speed up service.
07-tensor-parallelism.md — how one giant model gets split across many GPUs.
08-serving-frameworks.md — how vLLM, TGI, TensorRT-LLM, and Triton differ in practice.
09-onnx-runtime-optimization.md — how exported graphs get fused and deployed broadly.
10-quantized-serving.md — how smaller weight formats cut memory and cost.
11-streaming-token-delivery.md — how the plating line streams useful output early.
12-load-testing-benchmarking.md — how to measure throughput and latency honestly.
13-honest-admission.md — what we still do not fully know about inference optimization.

Bridge. First, watch the kitchen fail in the simplest possible way. One naive server, one set of order tickets, and far too much repeated work. → 01-inference-bottleneck.md