11. KV cache memory — speed has a storage bill¶

~8 min read. Cache cuts recomputation, but every saved key and value costs bytes.

Built on the ELI5 in 00-eli5.md. The memory shortcut reuses past keys and values across decode steps. The answer sheet length decides how long that stored history becomes.

1) Why cache memory becomes the serving constraint¶

KV cache feels cheap at first.

Each step stores only one new slice.

But that slice appears in every layer.

It appears for keys and values.

It appears for every active sequence.

So memory keeps growing with context length.

That growth can cap batch size.

That growth can cap maximum context.

That growth can decide whether a model fits at all.

The speed trick is real.

The storage bill is real too.

The exam rule stays unchanged.

The projection lens still makes fresh Q, K, and V.

The parallel graders still split channels by head.

The memory shortcut is the new budget item.

2) The memory formula, one factor at a time¶

The full cache formula is short.

memory = 2 × layers × B × H × T × d_head × bytes_per_element

The factor 2 counts keys and values.

layers matters because each layer keeps its own cache.

B matters because each active sequence needs separate history.

H and d_head describe per-head storage width.

T tracks how long the used answer sheet has become.

bytes_per_element depends on precision.

fp32 uses four bytes.

bf16 and fp16 use two bytes.

Lower-bit cache formats use less.

The stored shapes are still familiar.

cache_k : [B, H, T, d_head]

cache_v : [B, H, T, d_head]

Count one tensor first.

Then double it for K and V.

Then multiply by bytes.

That is the whole recipe.

context     cache memory (7B model, bf16)
 512  │██                           0.25 GB
1024  │████                         0.50 GB
2048  │████████                     1.00 GB
4096  │████████████████             2.00 GB
8192  │████████████████████████████ 4.00 GB
      └────────────────────────────
       linear growth ──→

3) One worked example with actual numbers¶

Take a 7B-style setup.

layers = 32

B = 1

H = 32

T = 4096

d_head = 128

bytes_per_element = 2 for bf16

Now multiply carefully.

memory = 2 × 32 × 1 × 32 × 4096 × 128 × 2

2 × 32 = 64

64 × 32 = 2048

2048 × 4096 = 8,388,608

8,388,608 × 128 = 1,073,741,824

1,073,741,824 × 2 = 2,147,483,648 bytes

That is about 2 GiB.

That number is for one active sequence.

It excludes model weights.

It excludes temporary buffers.

It excludes scheduler overhead.

It counts only the memory shortcut.

The lesson is blunt.

Longer context means heavier cache.

More active sequences mean heavier cache.

More layers mean heavier cache.

4) What scales badly, and how teams respond¶

Cache memory grows linearly with T.

Double context length.

You double cache size.

Double batch size.

You double cache size again.

That is why long-context serving gets tight quickly.

A GPU must hold weights, cache, and working buffers together.

Throughput becomes a packing problem.

Not just a math problem.

Teams reduce the bill in a few common ways.

Quantized KV cache — fewer bytes per stored value.
Paged attention — less fragmentation while many requests share memory.
Sliding-window attention — cap the active answer sheet length.
Multi-query attention — many query heads share fewer key-value heads.
Grouped-query attention — several query heads share one KV group.

Those tricks change memory economics directly.

They do not remove the formula.

They change one factor inside it.

Where this lives in the wild¶

vLLM PagedAttention — paging reduces cache fragmentation while many long requests decode together.
Mistral 7B — sliding-window attention limits active history and therefore limits cache growth.
TensorRT-LLM — lower-precision KV paths help serving teams cut cache memory pressure.
Claude-style long-context systems — huge context windows demand careful cache budgeting, not only faster kernels.
Gemini-style long-context systems — very long prompts imply major storage planning around active KV history.

Pause and recall¶

Why does the formula start with a factor of 2?
Which term tracks the active answer sheet length directly?
Why do H and d_head both matter in the total?
What resource trade are we making when we enable cache?

Interview Q&A¶

Q1. Why can KV cache become the main bottleneck during long-context serving? Because cache memory grows with layers, heads, tokens, and batch size before counting other buffers. Common wrong answer to avoid: Saying long context is only a compute problem.

Q2. What does bytes_per_element represent in the formula? It represents storage precision such as fp16, bf16, int8, or int4. Common wrong answer to avoid: Saying it depends on vocabulary size.

Q3. Why do grouped-query and multi-query attention reduce KV memory? They store fewer key-value groups while keeping many query heads. Common wrong answer to avoid: Saying those methods only change the FFN.

Apply now (5 min)¶

Quick exercise.

Use layers=24, B=2, H=16, T=2048, d_head=64, and bytes=2.

Compute total cache memory step by step.

Then name the term that doubles when the answer sheet doubles.

Sketch from memory.

Write the full formula.

Underline the factor 2 for keys and values.

Circle the factors tied to the parallel graders.

Box the factor tied to context length.

Add one note explaining why the memory shortcut speeds decode but raises memory pressure.

Bridge. Good. Next we separate prompt processing from token-by-token decoding. → 12-prefill-vs-decode.md