10. Multi-head attention — parallel crews with different habits¶

One spotlight beam cannot watch every clue well. So we split the job.

Built on the ELI5 in 00-eli5.md. The spotlight beam — one token consulting others — now becomes several parallel beams that notice different patterns.

Mental model — one beam is overloaded¶

One token may need many kinds of context. Syntax. Entities. Negation. Long-range references. Timing words. One head has one scorecard. So it must compress many questions into one pattern. That is too much pressure. See the sentence:

The horse crossed the road because it was tired.

For token it, one clue asks:

which noun does "it" refer to?

Another clue asks:

what attribute matters for "was tired"?

A third clue asks:

which nearby words shape grammar?

So what to do? Do not force one crew to do all jobs. Use multiple heads. Each head gets its own Q, K, V projections. Each head learns its own habit.

same tokens
   |
   +--> head 1 : maybe tracks entity links
   +--> head 2 : maybe tracks attribute links
   +--> head 3 : maybe tracks syntax
   +--> head 4 : maybe tracks negation

That is the point of multi-head attention.

Why simple rescues fail¶

Failed rescue 1 — make one head wider¶

A wider head has more dimensions. Good. But it still produces one scorecard. One attention pattern cannot cleanly separate every job.

Failed rescue 2 — add more layers only¶

More layers help. But each layer still begins with overloaded heads. If the first routing is muddy, later layers inherit the mess.

Failed rescue 3 — hope the FFN fixes it later¶

The FFN mixes features per token. It does not create new cross-token routing. If attention missed the clue, the FFN cannot invent it. So yes, those tricks help a bit. They do not solve the routing bottleneck.

Picture first — parallel crews over the same input¶

All heads read the same residual stream. But they project it into different subspaces.

input X
  |
  +--> W_Q^(1), W_K^(1), W_V^(1) -> head 1 output
  +--> W_Q^(2), W_K^(2), W_V^(2) -> head 2 output
  +--> W_Q^(3), W_K^(3), W_V^(3) -> head 3 output
  +--> W_Q^(4), W_K^(4), W_V^(4) -> head 4 output
                         |
                         v
                concatenate -> W_O -> mixed output

Same tokens. Different geometric views. That is the whole trick.

Formula — split, attend, concatenate, mix¶

Let d_model = 8. Let number of heads h = 2. Then each head gets:

d_k = d_v = d_model / h = 4

For head i:

Q_i = X W_Q^(i)
K_i = X W_K^(i)
V_i = X W_V^(i)
head_i = softmax(Q_i K_i^T / sqrt(d_k)) V_i

Then combine heads:

MultiHead(X) = Concat(head_1, head_2) W_O

So the model does two things. First, it lets each head build its own scorecard. Second, it lets W_O mix the head outputs back together.

Worked setup — same token, two different views¶

Use a toy context around it.

[horse] [crossed] [road] [because] [it] [was] [tired]

Now focus on query token it. The input vector for it has 8 dimensions.

x_it = [0.6, 0.2, 0.1, 0.4, 0.7, 0.3, 0.5, 0.2]

Both heads receive the same x_it. But they use different projection matrices. So the resulting query vectors differ.

head 1 query q1_it -> entity-style subspace
head 2 query q2_it -> attribute-style subspace

That is why the same token can ask different questions at once.

Worked numerical example — head 1 tracks the likely referent¶

Suppose head 1 builds this scorecard for it.

head 1 weights from it:
animal  : 0.70
road    : 0.05
because : 0.10
was     : 0.05
tired   : 0.10

ASCII picture:

head 1
it
 +--> animal   0.70
 +--> road     0.05
 +--> because  0.10
 +--> was      0.05
 +--> tired    0.10

So head 1 strongly prefers the entity token. Now give those tokens 4-d value vectors.

v_animal  = [0.8, 0.1, 0.0, 0.2]
v_road    = [0.1, 0.7, 0.1, 0.0]
v_because = [0.2, 0.2, 0.6, 0.1]
v_was     = [0.1, 0.1, 0.2, 0.8]
v_tired   = [0.3, 0.6, 0.1, 0.2]

Weighted sum for head 1:

o1 = 0.70*v_animal + 0.05*v_road + 0.10*v_because + 0.05*v_was + 0.10*v_tired
   ≈ [0.61, 0.20, 0.09, 0.21]

Head 1 output keeps the referent clue strong.

Worked numerical example — head 2 tracks the attribute clue¶

Suppose head 2 builds a different scorecard.

head 2 weights from it:
animal  : 0.20
tired   : 0.55
road    : 0.05
was     : 0.15
because : 0.05

ASCII picture:

head 2
it
 +--> animal   0.20
 +--> tired    0.55
 +--> road     0.05
 +--> was      0.15
 +--> because  0.05

Using the same value vectors, the weighted sum becomes:

o2 = 0.20*v_animal + 0.55*v_tired + 0.05*v_road + 0.15*v_was + 0.05*v_because
   ≈ [0.33, 0.41, 0.08, 0.29]

Now a different clue dominates. Head 2 says, "the tiredness relation matters."

Concatenate outputs, then apply `W_O`¶

Each head returns a 4-d vector. Concatenate them.

Concat(o1, o2)
= [0.61, 0.20, 0.09, 0.21, 0.33, 0.41, 0.08, 0.29]

For a toy example, let W_O be the identity matrix. Then:

final = Concat(o1, o2) W_O
      = [0.61, 0.20, 0.09, 0.21, 0.33, 0.41, 0.08, 0.29]

In a real model, W_O mixes the heads. But even this toy case shows the key idea. One output now carries both clues.

head 1 kept referent evidence
head 2 kept attribute evidence
final vector carries both

What multi-head buys you¶

It does not mean each head gets a human job title. Heads are learned features. They can overlap. They can be messy. Still, the architecture gives the model room to specialize. That is the main win. One beam tries to be a generalist. Many beams can divide the labour.

Where this lives in the wild¶

OpenAI GPT-4 class models use many attention heads so different patterns can coexist in one layer.
GitHub Copilot relies on multi-head attention to track syntax, names, and distant code references together.
Google Gemini uses multi-head style routing so prompts can mix entities, tools, and instructions.
Anthropic Claude long documents benefit when some heads track local phrasing and others track far links.
Meta Llama architecture reports head counts explicitly because serving cost and model quality depend on them.

Interview Q&A¶

Q: Why is one attention head often not enough? A: Because one scorecard must compress syntax, entities, negation, and long-range links into one pattern. Common wrong answer to avoid: "Just make the head wider." Width adds capacity, not independent routing patterns. Q: What changes across heads? A: Each head has its own W_Q, W_K, and W_V, so the same token is viewed in different subspaces. Q: What does W_O do after concatenation? A: It mixes the head outputs back into the model dimension. Common wrong answer to avoid: "W_O chooses the best head and drops the rest." No. It learns a linear mix. Q: Does every head learn a neat interpretable role? A: No. Some heads are interpretable. Many are mixed and distributed.

Apply now (5 min)¶

Take the sentence The horse crossed the road because it was tired. Invent two heads for token it. Make head 1 mostly attend to the referent. Make head 2 mostly attend to the attribute clue. Write one 4-d value vector per important token. Compute two weighted sums. Then concatenate them. Sketch from memory: Draw the forked diagram input -> head 1/head 2 -> concat -> W_O.

Bridge. Now we stop isolating parts and walk the entire chain end to end in 11-full-pipeline.md.