05. Paged attention — how block-based KV memory defeats fragmentation¶

~17 min read. This is the memory layout idea that made high-throughput serving practical.

Built on the ELI5 in 00-eli5.md. The prep station cannot be one giant fixed countertop, because live order tickets grow and shrink unpredictably inside the moving batch window.

1) The picture: contiguous slots waste space¶

Suppose each request gets one contiguous KV region.

That sounds neat.

It fails under real traffic.

Some tickets are short.

Some explode into long chats.

Some finish early and leave holes in the middle.

GPU KV memory
┌──────┬────────────┬───┬──────────┬────┐
│ reqA │   hole     │B  │   reqC   │hole│
└──────┴────────────┴───┴──────────┴────┘

See. Total free memory may be large, but not in one clean chunk. A long new request may still fail to fit. That is fragmentation. The kitchen has empty table area, but it is chopped into useless shapes.

2) PagedAttention turns one big region into small blocks¶

So what to do? Store each request’s cache in fixed-size blocks. Then map logical token positions to physical blocks. This feels like virtual memory.

logical tokens for reqX:   [0..15][16..31][32..47][48..63]
                                │      │      │      │
physical blocks on GPU:        [17]   [04]   [29]   [11]

The request thinks it has a continuous sequence. The engine knows the blocks are scattered. Attention uses a page table to gather the right K and V blocks. The prep station becomes modular. Short tickets use few blocks. Long tickets append new blocks as needed. Freed blocks can be reused anywhere. Simple, no?

3) Worked example: fragmentation savings¶

Suppose a naive engine reserves room for 2,048 tokens per request. Suppose average live request length is only 600 tokens. Waste per request is:

2,048 - 600 = 1,448 tokens
waste fraction = 1,448 / 2,048 ≈ 70.7%

Now use 16-token blocks. A 600-token request needs:

ceil(600 / 16) = 38 blocks
physical capacity = 38 × 16 = 608 tokens
waste = 8 tokens
waste fraction = 8 / 608 ≈ 1.3%

Look at the jump. From about 71% waste to about 1% tail waste. That is why paged memory matters so much.

4) How attention still works over scattered blocks¶

Now what is the problem? If blocks are scattered, how can attention read them as one sequence? The answer is an indirection layer. The engine gathers block pointers in sequence order, then launches kernels that walk those pointers.

query q_t
   │
   ├── page table says: block 17, then 04, then 29, then 11
   └── kernel gathers K,V in logical order for attention

This adds lookup complexity. But it avoids massive fragmentation. The extra metadata is tiny compared with saved KV space. That trade is worth it in production.

5) What paged attention solves, and what it does not¶

Paged attention is a memory-management win. It does not change the autoregressive contract. It does not remove KV-cache cost. It does not fix serial generation by itself.

What it does do is let the kitchen host many live tickets without wasting most of the prep station. That made continuous batching much more practical. But one single long decode stream is still serial. So what to do next? Try to predict a few future spoons with a smaller sous chef, then verify them quickly with the main chef. That is speculative decoding.

Where this lives in the wild¶

vLLM — PagedAttention is the signature design that lets mixed-length requests reuse GPU memory blocks efficiently.
Anyscale model endpoints with vLLM — tenant mixes with very different context lengths benefit from lower fragmentation.
Open-source chat hosting on RunPod or Modal — paged KV memory improves concurrency on limited GPU VRAM.
Perplexity-style long-answer clusters — variable follow-up lengths make block-based cache management much safer than fixed reservations.
Character.AI conversation servers — thousands of uneven chat sessions create exactly the holey-memory pattern paged attention targets.

Pause and recall¶

Why can a GPU show plenty of free KV memory and still fail a large request?
What does the page table map in paged attention?
In the worked example, how much waste remained after switching to 16-token blocks?
What important problem does paged attention not solve by itself?

Interview Q&A¶

Q: Why use paged KV storage instead of one contiguous reservation per request?

A: Because request lengths vary wildly in production. Fixed reservations waste memory and fragment badly as requests arrive, grow, and finish.

Common wrong answer to avoid: "Because blocks are mathematically faster to attend over." The main win is memory efficiency, not simpler attention math.

Q: Why does a page table not ruin all the gains with lookup overhead?

A: Because the metadata is small compared with the KV memory saved, and kernels can gather blocks efficiently. The fragmentation savings dominate.

Common wrong answer to avoid: "Indirection always makes it slower overall." Small pointer overhead can be far cheaper than huge wasted memory.

Q: Why does paged attention pair naturally with continuous batching?

A: Because dynamic scheduling constantly adds and removes live requests. Block-based storage handles that churn far better than contiguous layouts.

Common wrong answer to avoid: "They are unrelated ideas." Scheduler churn is exactly what creates fragmentation pressure.

Q: Why is paged attention not the final speed answer for one request?

A: Because the stream is still autoregressive and serial. Memory layout got better, but each next token still depends on previous ones.

Common wrong answer to avoid: "Paged attention makes decoding parallel." It mainly fixes memory utilization.

Apply now (5 min)¶

Assume each request naively reserves 4,096 tokens but averages only 900 tokens. Compute the wasted fraction. Then repeat with 32-token blocks and ceiling division. Compare the leftover tail waste. Sketch from memory:

the holey contiguous memory bar,
the logical-to-physical block mapping,
and the waste calculation.

Bridge. We packed memory better, but token generation is still serial. Next we use a sous chef: a smaller model drafts several likely tokens so the main model can verify chunks instead of stepping one token at a time. → 06-speculative-decoding.md