Skip to content

09. MQA and GQA — share the heavy memory parts

~11 min read. The thing that says: many workers can ask questions without carrying separate copies of the same paperwork.

Built on the ELI5 in 00-eli5.md. the site constraint — GPU memory at the construction site — gets lighter when many query heads share fewer key-value records instead of storing one set per head.


1) Why attention heads do not need equal KV storage

Start with standard multi-head attention.

Each query head gets its own K and V head.

Quality is good.

Memory cost is high.

Why high?

Because the KV cache stores history per KV head.

If you have 64 heads, you store 64 K histories and 64 V histories.

That becomes expensive fast.

Now ask a sharper serving question.

Do query heads really need separate key and value storage every time?

Not always.

Sometimes many query heads can share the same K and V information.

That is the idea behind MQA and GQA.

They reduce cache without removing many query heads.

So attention richness stays more intact than you may fear.

And the site constraint gets relief exactly where serving hurts.

2) The sharing patterns: MHA, GQA, and MQA

See the three pictures.

MHA
Q1→K1,V1   Q2→K2,V2   Q3→K3,V3   Q4→K4,V4

GQA
Q1→K1,V1   Q2→K1,V1   Q3→K2,V2   Q4→K2,V2

MQA
Q1→K1,V1   Q2→K1,V1   Q3→K1,V1   Q4→K1,V1

Simple, no?

MHA is the baseline.

No sharing.

GQA is the middle ground.

A small group of query heads shares one KV pair.

MQA is the aggressive version.

All query heads share one KV pair.

So the savings come from one factor.

Fewer KV heads.

That directly shrinks the KV-cache formula from the last topic.

And notice the elegance.

You are not rewriting the field notes here.

You are reducing paperwork stored on the construction site.

That is why this topic sits under the site constraint.

3) One memory savings calculation

Take a model with 64 query heads.

Assume head_dim = 128.

Assume seq_len = 8192.

Assume bytes = 2 for bf16.

For one KV head, cached K and V bytes are:

8192 × 128 × 2 × 2 = 4,194,304 bytes

That is about 4 MiB per layer per request.

Now compare three cases.

MHA uses 64 KV heads.

64 × 4 MiB = 256 MiB per layer per request.

GQA with 8 KV heads uses:

8 × 4 MiB = 32 MiB per layer per request.

That is 1/8 of baseline.

MQA with 1 KV head uses:

1 × 4 MiB = 4 MiB per layer per request.

That is 1/64 of baseline.

Look how direct that is.

If the model has many layers, multiply those savings again.

So GQA is already a huge memory win.

MQA is even smaller.

But quality trade-offs are not identical.

That is why engineers do not always choose the maximum compression.

4) Why GQA became the practical favorite

GQA became a common compromise because it keeps more head diversity than MQA.

At the same time, it still cuts cache aggressively.

So you get a very good quality-memory trade.

That matters in real serving.

A model that is slightly more memory-efficient can serve more users per GPU.

That changes cost.

Latency.

And product capacity.

So what do teams usually prefer?

If quality is fragile, GQA is often the safer middle road.

If memory is desperate, MQA can be very attractive.

If memory is abundant and maximum flexibility matters, standard MHA is still the richest setup.

But for many modern LLM deployments, GQA wins the adult argument.

Good enough quality.

Much smaller cache.

That is why you see it so often.

And remember the bigger lesson.

Not every memory problem is solved by smaller weights from the blueprint.

Sometimes the site fills up because repeated KV copies are wasteful.

Sharing fixes that.


Where this lives in the wild

  • Meta Llama 3 — uses GQA to keep serving memory lower than full multi-head KV storage.
  • Mistral 7B — uses grouped-query attention as a strong quality-memory compromise for deployment.
  • Falcon 40B — uses multi-query attention to shrink KV-cache cost aggressively.
  • NVIDIA TensorRT-LLM — benefits directly when fewer KV heads reduce runtime cache pressure.
  • vLLM serving stacks — can pack more traffic when model architecture already cuts KV head count.

Pause and recall

  1. What changes in the cache formula when moving from MHA to GQA?
  2. Why does MQA give the maximum KV-memory reduction?
  3. Why is GQA often preferred over MQA in practice?
  4. What does 1/8× or 1/64× refer to here exactly?

Interview Q&A

Q1. Why GQA not full MHA for most serving-heavy LLMs? Because GQA preserves much of the quality benefit of multiple query heads while cutting KV-cache memory sharply. Common wrong answer to avoid: "GQA is mainly for making the FFN smaller."

Q2. Why MQA not always the default if 1/64× sounds amazing? Because the most aggressive sharing can cost more quality, so some models prefer the middle ground of grouped sharing. Common wrong answer to avoid: "More compression is always the correct architectural choice."

Q3. Why fewer KV heads not fewer query heads? Because the memory pain comes from stored keys and values, while many query heads can still read shared KV information. Common wrong answer to avoid: "Queries dominate KV-cache memory, not keys and values."

Q4. Why architecture change not only scheduler tricks for cache pressure? Because changing kv_heads attacks the formula itself, while scheduling alone cannot erase per-head storage requirements. Common wrong answer to avoid: "Continuous batching fully removes the need for GQA or MQA."


Apply now (5 min)

Quick exercise.

Assume 32 query heads and head_dim=128.

Compute relative KV-cache size for MHA, GQA with 4 KV heads, and MQA.

Write the ratios only.

Then explain in one line why GQA is called a middle ground.

Sketch from memory.

Draw four query heads.

Show separate KV pairs for MHA.

Show groups for GQA.

Show one shared KV pair for MQA.

Write this note below: "Sharing paperwork reduces the load from the site constraint."


Bridge. Good. Even with fewer KV heads, serving can still waste memory when request lengths vary. So next we study fragmentation and why paged storage changed inference systems. → 10-paged-attention-serving.md