03. KV cache mechanics — what the prep station stores and why memory becomes the bill¶
~17 min read. This is the biggest practical trade: less recompute, more memory pressure.
Built on the ELI5 in 00-eli5.md. The prep station — reusable state beside the stove — saves chopping work, but every saved tray occupies precious space in the kitchen.
1) Picture first: what is actually cached¶
At each transformer layer, attention creates queries, keys, and values. During decode, old queries are not reused. Old keys and values are. So we cache K and V for past positions.
new token x_t
│
├──→ Q_t use once now
├──→ K_t save to cache
└──→ V_t save to cache
past cache = [K_1 ... K_{t-1}], [V_1 ... V_{t-1}]
See. The prep station is not a vague memory blob. It is a stack of past keys and values, layer by layer. When token t arrives, we append K_t and V_t. Then attention reads the whole saved stack. That is the core mechanic.
2) Per-token memory math with a worked example¶
Now count the trays. A simple KV-cache formula is:
- bytes per token = 2 × layers × kv_heads × head_dim × bytes_per_value
Why the leading 2? One copy for K. One copy for V. Worked example. Suppose:
- layers = 32
- kv_heads = 8
- head_dim = 128
- precision = fp16 = 2 bytes
Then per-token cache is:
- 2 × 32 × 8 × 128 × 2
- = 131,072 bytes
- ≈ 128 KB per token
For a 4,000-token context:
- 128 KB × 4,000
- = 512,000 KB
- ≈ 500 MB
One order ticket can occupy half a gigabyte. Simple, no?
3) Per-layer budget helps you think clearly¶
Sometimes total numbers feel abstract. Break them per layer. Using the same example, per-layer per-token cache is:
- 2 × 8 × 128 × 2
- = 4,096 bytes
- = 4 KB
For 4,000 tokens, one layer uses:
- 4 KB × 4,000 = 16 MB
Across 32 layers:
- 16 MB × 32 = 512 MB
This per-layer view matters operationally. It helps you estimate why a model with more layers, more KV heads, or longer context grows the prep station so fast. The kitchen is not running out of compute first. It often runs out of memory shelves.
4) Concurrency multiplies the cache bill brutally¶
Now what is the problem? Production kitchens serve many tickets together. If one request uses 500 MB of prep station memory, then 20 live requests use about 10 GB. That is before counting model weights, activations, fragmentation, and allocator overhead.
request A cache ─┐
request B cache ─┼──→ total prep station load on one GPU
request C cache ─┤
request D cache ─┘
Worked example. Suppose weights take 14 GB on a 24 GB GPU. Available room for cache and runtime is roughly 10 GB. If each long chat uses 500 MB, you fit about 20 such chats only in the ideal case. Real systems fit fewer. This is why long-context concurrency feels expensive.
5) The lifecycle: build, append, evict, and sometimes reuse¶
A good serving engine manages the prep station actively. During prefill, it builds the initial cache. During decode, it appends one token at a time. When a request finishes, it frees those blocks. Some systems also reuse prompt prefixes across similar tickets.
The cache saves recompute. It does not come free. It complicates batching. It complicates memory allocation. It complicates eviction. That is why naive contiguous allocation breaks down quickly. If tickets have different lengths and finish at different times, the prep station becomes fragmented. So what to do? Serve many live requests, but store their cache in smaller movable blocks. That is the path to continuous batching and paged attention.
Where this lives in the wild¶
-
ChatGPT-style long conversations — every active chat holds KV state so the backend can continue from the existing context instantly.
-
GitHub Copilot chat with large code context — big repositories inflate per-request cache size even when output is short.
-
vLLM-hosted open models on Anyscale endpoints — cache accounting determines how many concurrent tenants fit per GPU.
-
Character.AI multi-turn rooms — many simultaneous sessions turn KV memory into the primary capacity limit.
-
Perplexity follow-up questions — retained context creates faster continuation but consumes prep-station budget across sessions.
Pause and recall¶
-
Why are keys and values cached, but not old queries?
-
In the worked example, how much KV memory did one 4,000-token request consume?
-
Why is the per-layer view helpful for reasoning about memory?
- What new operational problem appears after KV cache removes recompute?
Interview Q&A¶
Q: Why cache K and V, not the full attention output of every past step? A: Because future attention needs past keys and values to compare against each new query. Full old outputs are not the reusable primitive for later steps. Common wrong answer to avoid: "Because K and V are smaller." The main reason is structural reuse in future attention.
Q: Why can KV cache improve latency while making throughput planning harder? A: Because it removes repeated compute but creates a large per-request memory footprint. Capacity now depends heavily on context length and concurrency mix. Common wrong answer to avoid: "Caching only helps, never hurts." Memory pressure becomes the next bottleneck.
Q: Why think in per-token cache bytes instead of only total model size? A: Because concurrency depends on how much extra memory each live request adds. Model weights are mostly fixed. KV cache grows with user traffic. Common wrong answer to avoid: "If the model fits, serving is easy." Request state can consume the rest of the GPU quickly.
Q: Why does long context reduce concurrency even when answers are short? A: Because the prep station is already full before generation begins. Short output does not undo the large cache created by the prompt prefix. Common wrong answer to avoid: "Only output length determines cost during serving." Prompt-built cache can dominate capacity.
Apply now (5 min)¶
Take one model you know and estimate its KV-cache bytes per token. Use layers, KV heads, head dimension, and precision. Then multiply by 2,000 and 8,000 tokens. Notice how quickly the prep station fills. Sketch from memory: - the Q, K, V split, - the per-token formula, - and the concurrency multiplication idea.
Bridge. A full prep station is useful only if the kitchen keeps reusing it efficiently. Next we study continuous batching, where the batch window reopens every decode step instead of freezing traffic into rigid groups. → 04-continuous-batching.md