08. KV cache memory — the bill that grows with traffic¶
~12 min read. The thing that says: fitting the weights does not mean the request will fit.
Built on the ELI5 in 00-eli5.md. the site constraint — GPU memory at the construction site — still bites even after the field notes are compressed, because runtime scaffolding keeps growing during generation.
1) Why weights are only half the serving story¶
Engineers quantize weights. The model finally loads. Everyone feels victorious. Then someone raises context length. Or concurrency. Or both. And the GPU says no. Why? Because autoregressive generation stores past keys and values. That storage is the KV cache. It grows as the conversation grows. So the cost has two personalities. Weights are fixed cost. KV cache is traffic-shaped cost. That sentence should stay in your head. The compressed the field notes help the fixed part. They do not remove the runtime growth part. So the true memory picture is:
┌──────────────────────┐
│ fixed weights │ mostly steady after model load
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ KV cache per request │ grows with tokens and concurrency
└──────────┬───────────┘
│
▼
total GPU pressure
Look. This is why a model can fit at startup and still OOM later. The startup check only tested weights. Production traffic tests the whole site. That is the site constraint in action.
2) The formula, one factor at a time¶
Use this serving formula.
KV bytes = seq_len × batch × layers × kv_heads × head_dim × 2 × bytes
Some teams write batch.
Some say concurrency.
Same practical idea here.
Now read each term slowly.
seq_len is active context length.
batch is how many requests are alive together.
layers matters because every layer keeps its own cache.
kv_heads matters because each KV head stores history.
head_dim is width per head.
The factor 2 is for keys and values.
bytes depends on cache precision.
bf16 and fp16 use 2 bytes.
So what grows when traffic grows?
seq_len grows.
batch grows.
Sometimes both grow together.
That is why long-context high-concurrency serving becomes brutal.
The model weights stay calm.
The runtime memory does not.
See the growth shape.
context tokens ──→ 2K 4K 8K
cache per request │── │──── │────────
concurrency 8 │──── │────────│────────────────
concurrency 16 │────────│────────────────│────────────────────────
Simple, no?
More traffic means more scaffolding on the same site.
One shortcut for interviews.
Double seq_len.
You double cache.
Double batch.
You double cache again.
That is why product decisions around context and concurrency are infra decisions too.
3) Worked example: 70B-style cache math¶
Take a 70B-style serving setup.
Use these values.
layers = 80
kv_heads = 8
head_dim = 128
seq_len = 8192
batch = 1
bytes = 2 for bf16
Now multiply.
KV bytes = 8192 × 1 × 80 × 8 × 128 × 2 × 2
8192 × 80 = 655,360
655,360 × 8 = 5,242,880
5,242,880 × 128 = 671,088,640
671,088,640 × 2 × 2 = 2,684,354,560 bytes
That is about 2.5 GiB per request.
Now increase concurrency.
At 8 concurrent requests:
2.5 GiB × 8 ≈ 20 GiB
At 16 concurrent requests:
2.5 GiB × 16 ≈ 40 GiB
See how rude that is.
Nothing changed in the weights.
Only traffic changed.
And the site filled up.
That is why people say, "Weights fit, but serving still does not."
They are talking about the site constraint plus KV growth.
4) What teams do after they learn this lesson¶
First, they stop treating quantized weights as the whole victory. Second, they model traffic. Third, they reduce one factor in the formula. How?
- Use GQA or MQA to reduce
kv_heads. - Use PagedAttention to reduce fragmentation around cache allocation.
- Use sliding windows to cap active
seq_len. - Use KV-cache quantization when quality allows it.
- Limit concurrency when SLOs or GPU size demand it.
Notice the mindset.
Nobody is chanting magic words.
They are changing the memory equation.
That is real engineering.
Also remember one trap.
Quantized weights do not guarantee quantized KV cache.
Many deployments store cache in fp16 or bf16.
So your tiny weight file can sit beside a very expensive runtime cache.
That is why the field notes are not enough.
You must also manage the scaffolding.
And yes, this is why smart model selection matters.
A slightly smaller or more cache-efficient model can win the product.
Not because it is smartest on paper.
Because it survives real traffic.
Where this lives in the wild¶
- Anthropic Claude API — long context plus many simultaneous chats makes KV budgeting a first-class serving concern.
- OpenAI ChatGPT-style serving — every active conversation grows runtime cache even when weights are fixed.
- NVIDIA TensorRT-LLM — exposes KV-cache controls because serving pressure often shifts from weights to runtime memory.
- vLLM scheduler — packs requests while tracking growing KV blocks across mixed context lengths.
- Mistral API long-context deployment — benefits from cache-aware architecture choices when concurrency rises.
Pause and recall¶
- Why can a quantized model still OOM during generation?
- Which terms in the formula scale with traffic directly?
- Why does the factor
2appear in KV-cache memory? - What does "weights are fixed cost; KV cache is traffic-shaped cost" mean?
Interview Q&A¶
Q1. Why lower weight precision not enough for long-context serving? Because weights are only the fixed bill; KV cache still grows with sequence length, layers, heads, and concurrency. Common wrong answer to avoid: "Once weights fit, runtime memory problems are basically solved." Q2. Why concurrency not only context length in cache planning? Because each active request keeps its own keys and values, so concurrent traffic multiplies the cache bill. Common wrong answer to avoid: "The same cache can be fully reused across unrelated requests." Q3. Why GQA or MQA not just smaller batch size forever? Because architecture-level KV reduction preserves more throughput than permanently throttling product traffic. Common wrong answer to avoid: "Serving teams should simply process one request at a time." Q4. Why quantized weights not quantized KV cache by default? Because many stacks quantize weights for loading efficiency but still keep runtime KV tensors in fp16 or bf16 for quality and kernel support. Common wrong answer to avoid: "If the model file is 4-bit, everything in memory becomes 4-bit automatically."
Apply now (5 min)¶
Quick exercise.
Use layers=32, kv_heads=8, head_dim=128, seq_len=4096, batch=4, bytes=2.
Compute the KV cache in bytes.
Then convert roughly to GiB.
After that, double only concurrency.
State which term changed and by how much total memory changed.
Sketch from memory.
Write the full KV formula.
Box the traffic terms.
Underline the fixed term layers.
Then draw one note saying: "Compressed field notes do not save the site if scaffolding keeps piling up."
Bridge. Good. The cache is too large. So next we attack one major factor directly: what if many query heads share fewer keys and values? → 09-mqa-gqa.md