10. PagedAttention and serving — pack memory like a systems engineer¶
~12 min read. The thing that says: the model can fit, yet the allocator can still waste the site.
Built on the ELI5 in 00-eli5.md. the site constraint — GPU memory at the construction site — is easier to manage when KV cache is stored in reusable pages instead of demanding one big contiguous lot.
1) The fragmentation problem before the clever fix¶
Suppose three requests arrive.
One will stop at 800 tokens.
One will stop at 2300.
One will stop at 3900.
Now imagine your system reserves one large contiguous KV block for each.
Maybe it reserves up to 4096 tokens for safety.
Then the reserved capacity is 4096 × 3 = 12,288 token slots.
Actual use is only 800 + 2300 + 3900 = 7000 token slots.
Waste is 12,288 - 7000 = 5288 token slots.
That is ugly.
The model did nothing wrong.
The allocator wasted the site.
See the picture.
contiguous reservation
┌────────────request A────────────┐ used 800, empty tail huge
┌────────────request B────────────┐ used 2300, empty tail medium
┌────────────request C────────────┐ used 3900, empty tail small
waste = empty tails you still reserved
So what to do? Do not demand one giant open lot per request. Use smaller reusable sections. That is the core intuition behind PagedAttention.
2) PagedAttention as virtual memory for KV cache¶
Look. Operating systems solved a similar problem long ago. Virtual memory breaks storage into fixed-size pages. The pages need not sit contiguously. PagedAttention applies that mindset to KV cache. A request gets logical blocks. Those blocks point to physical pages scattered across memory. So allocation becomes flexible. Fragmentation drops. Prefix sharing also becomes easier. Here is the mental model.
logical tokens for one request
[page 0]→[page 7]→[page 2]→[page 11]
physical GPU pages
┌────┐ ┌────┐ ┌────┐ ┌────┐
│p 2│ │p 7│ │p11│ │p 0│ ... anywhere
└────┘ └────┘ └────┘ └────┘
Simple, no?
The request sees a clean sequence.
The allocator sees reusable blocks.
That is why PagedAttention helps mixed workloads.
Different request lengths stop causing such absurd dead space.
Now redo the earlier example with 256-token pages.
Request A uses 800 tokens.
It needs 4 pages, or 1024 slots.
Waste is 224.
Request B uses 2300 tokens.
It needs 9 pages, or 2304 slots.
Waste is 4.
Request C uses 3900 tokens.
It needs 16 pages, or 4096 slots.
Waste is 196.
Total waste is 224 + 4 + 196 = 424 token slots.
Compare that with 5288 before.
Much better use of the site constraint.
Notice the bound.
With pages, waste is limited to the last partial page per request.
Not the whole unused tail.
That turns a wild allocator problem into a bounded one.
Yes?
That is a systems win.
3) The serving mental models engineers must keep straight¶
First mental model.
Prefill and decode are different beasts.
Prefill processes the prompt in parallel.
It is more compute-bound.
Decode adds one token at a time.
It is more memory-bound.
So the same model can stress different resources in different phases.
Second mental model.
Quantized weights are not the same as quantized KV cache.
A 4-bit checkpoint may still use bf16 cache pages.
So lower weight memory does not guarantee low runtime memory.
Third mental model.
Concurrency is multiplication.
One request that uses 2.5 GiB of cache is manageable.
Sixteen such requests are not small.
Traffic multiplies the bill.
Fourth mental model.
Winning deployment is not always the smartest model.
Sometimes the best product system is the model that packs well, schedules well, and hits latency targets.
That is the boring truth.
And boring truth pays the cloud bill.
Fifth mental model.
Latency SLOs and throughput targets fight each other.
If you chase only single-request speed, GPU utilization can stay poor.
If you chase only packing, tail latency can blow up.
Serving is balance, not slogans.
4) Why PagedAttention improves throughput, not just neatness¶
Less fragmentation means more live requests fit. More live requests means better scheduler freedom. Better scheduler freedom usually means higher throughput. Prefix sharing can also avoid repeated work when many requests start the same way. That matters for chat templates, system prompts, and common prefixes. So PagedAttention is not cosmetic. It changes utilization. It changes tail latency. It changes cost per token served. And it pairs naturally with continuous batching. New requests can enter without demanding fresh giant contiguous blocks. That makes serving feel much less fragile. So yes, the site constraint is still the boss. PagedAttention just organizes the site like a good foreman. Neat sections. Reusable space. Less waste. More output per GPU. Good serving is choreography. Not only raw model IQ. Not only raw FLOPs. Also allocator discipline. Also scheduler discipline. That is where deployments quietly win.
Where this lives in the wild¶
- vLLM PagedAttention — the flagship serving design that uses paged KV blocks for better utilization.
- NVIDIA TensorRT-LLM paged KV cache — brings page-style cache management into high-performance GPU inference stacks.
- SGLang prefix caching — benefits from block-oriented KV management when prompts share long prefixes.
- LMDeploy block-based cache manager — improves memory packing for mixed request lengths during serving.
- Hugging Face Text Generation Inference — serving teams reason about prefill, decode, and cache pressure even when weights are quantized.
Pause and recall¶
- Why do contiguous KV allocations waste memory under variable request lengths?
- In the worked example, how much waste dropped after moving to 256-token pages?
- Why is decode more memory-bound than prefill?
- Why do quantized weights not guarantee cheap serving by themselves?
Interview Q&A¶
Q1. Why PagedAttention not one large contiguous KV reservation per request? Because fixed pages slash fragmentation and let the allocator reuse scattered memory efficiently across mixed-length traffic. Common wrong answer to avoid: "Contiguous allocation is always better because GPUs dislike indirection in every case." Q2. Why prefill not decode as the main memory bottleneck? Because prefill is usually dominated by prompt-side computation, while decode repeatedly reads and extends cached history. Common wrong answer to avoid: "Prefill and decode stress the exact same resources in the exact same way." Q3. Why prefix sharing not only raw page reuse? Because shared prefixes also avoid duplicated cache storage and repeated prompt work for common leading tokens. Common wrong answer to avoid: "Prefix caching helps only tokenizer speed." Q4. Why smarter allocator not bigger model by default? Because deployment wins come from usable throughput and stable latency, not only from benchmark intelligence. Common wrong answer to avoid: "The best model on paper is automatically the best serving choice."
Apply now (5 min)¶
Quick exercise.
Assume four requests reserve 2048 tokens each contiguously.
Actual usage is 600, 900, 1300, and 1700.
Compute total waste.
Then repeat with 256-token pages.
Estimate the new waste.
Sketch from memory.
Draw one row for logical pages and one row for physical pages.
Mark one shared prefix.
Then write four short labels: prefill, decode, cache, concurrency.
Under them, note which one is compute-heavy and which one is memory-heavy.
Add one final note: fragmentation is waste, not intelligence.
Bridge. Good. The model now fits and serves much better. Next we ask a different question: how do we teach new behavior without retraining the whole blueprint warehouse? → 11-lora.md