10. Causal masking — blocking the future¶

If the decoder can peek ahead, training becomes a fake exam. See the cheat clearly.

Built on the ELI5 in 00-eli5.md. The causal mask — future-blocking rule — now becomes a matrix you can inspect, debug, and trust.

Mental model — why future peeking breaks training¶

A decoder should predict the next token. So at position t, it may use only tokens <= t. Not t + 1. Not tomorrow's answer. Training gives the whole sequence at once. That is efficient. But it creates temptation. If token 3 may attend to token 4, the model can read the answer key. Loss looks fantastic. Learning is fake. Then inference arrives. Token 4 does not exist yet. The cheat path disappears. Generation quality falls. So what to do? Keep parallel training. Insert a visibility rule inside the social bench. That rule says:

look left: yes
look at self: yes
look right: no

That is causal masking.

positions:   1    2    3    4
1 can see    ✓    X    X    X
2 can see    ✓    ✓    X    X
3 can see    ✓    ✓    ✓    X
4 can see    ✓    ✓    ✓    ✓

Lower triangle allowed. Upper triangle blocked. Simple rule. Massive consequence.

The formula — lower-triangular masking¶

Start with raw attention logits:

S = QK^T / sqrt(d_k)

Now add a mask matrix M:

A = softmax(S + M)
output = A V

For causal masking:

M[i, j] = 0      if j <= i
M[i, j] = -inf   if j > i

Blocked positions get -inf before softmax. So their attention probability becomes zero. For sequence length 4, the allow matrix is:

[1 0 0 0]
[1 1 0 0]
[1 1 1 0]
[1 1 1 1]

The logit-mask version is:

[ 0   -inf -inf -inf]
[ 0    0   -inf -inf]
[ 0    0    0   -inf]
[ 0    0    0    0  ]

See. The math is tiny. The bug surface is not. One wrong triangle direction and the model trains on lies.

Worked example setup — a 4-token sequence¶

Use a tiny sequence:

t1 = I
t2 = love
t3 = samosa
t4 = EOS

Suppose one attention head produces this raw score matrix:

S =
[4.0, 3.0, 2.0, 1.0]
[1.0, 4.0, 3.0, 2.0]
[1.0, 2.0, 4.0, 6.0]
[0.0, 1.0, 2.0, 4.0]

Row i is the query at position i. Column j is the key at position j. Now watch three attempts.

Attempt 1 — no mask, so the model cheats¶

At position 3, the query is samosa. Its raw scores are:

[1.0, 2.0, 4.0, 6.0]

Without a mask, softmax sees all four positions. Approximate exponentials are:

[2.72, 7.39, 54.60, 403.43]

The sum is:

468.14

So the softmax weights are:

≈ [0.006, 0.016, 0.117, 0.861]

query = samosa
  +--> I       0.006
  +--> love    0.016
  +--> samosa  0.117
  +--> EOS     0.861   <-- future leak

See the problem. The third token mostly reads the fourth token. Training loss will look brilliant. Because the answer key is sitting there. But generation time cannot provide that token.

Attempt 2 — wrong mask orientation, so history disappears¶

Now imagine an implementation bug. You accidentally keep the upper triangle instead of the lower triangle. The buggy allow matrix becomes:

[1 1 1 1]
[0 1 1 1]
[0 0 1 1]
[0 0 0 1]

What happens at position 2? Token love can no longer see I. It may see only itself and the future. That destroys autoregressive meaning.

love query
  X--> I
  +--> love
  +--> samosa
  +--> EOS

So yes, a mask bug can be subtle. The code runs. Loss moves. But the model is learning the wrong game.

Attempt 3 — correct causal mask, honest next-token learning¶

Return to the correct lower-triangular mask. At position 3, raw scores were:

[1.0, 2.0, 4.0, 6.0]

After masking:

[1.0, 2.0, 4.0, -inf]

Exponentials over allowed positions are:

[2.72, 7.39, 54.60, 0]

The sum is:

64.71

So the softmax weights become:

≈ [0.042, 0.114, 0.844, 0.000]

query = samosa
  +--> I       0.042
  +--> love    0.114
  +--> samosa  0.844
  X--> EOS     0.000 blocked

Now the model must predict EOS from past context. That is honest training. Row 2 also becomes clean. Raw row 2 is:

[1.0, 4.0, 3.0, 2.0]

Masked row 2 is:

[1.0, 4.0, -inf, -inf]

Softmax becomes:

≈ [0.047, 0.953, 0.000, 0.000]

So token 2 attends only to legal history. No cheating. No mismatch between train and generation.

Why unmasked training looks good but generation fails¶

Teacher forcing feeds the full gold sequence during training. Without the causal mask, position t may read token t + 1. So the model solves an easier task than deployment requires. Example:

The answer is 42

If the query at is can attend to 42, loss collapses. At generation time, 42 is missing. The hidden state loses that shortcut. The model hesitates, loops, or guesses badly. This is why a broken mask often gives a suspicious pattern:

training loss: excellent
validation loss: suspiciously good
real generation: disappointing

The problem is not optimization. The problem is task leakage.

Batches contain sequences of different lengths. So shorter sequences get padded. Padding tokens are fake tokens. They should not receive attention mass. Example batch:

seq A = [I, love, samosa, EOS]
seq B = [I, eat, PAD, PAD]

For sequence B, columns 3 and 4 are padding. All queries must block them. The padding mask is:

valid keys = [1, 1, 0, 0]
logit mask = [0, 0, -inf, -inf]

This mask is not about time. It is about fake tokens. Different purpose. Same implementation style.

Combined masks — causal plus padding¶

In a decoder, both rules may apply together. Rule 1 says do not look right. Rule 2 says do not look at PAD. So the final logit mask is the sum of both masks. For sequence B, the combined allow matrix is:

[1 0 0 0]
[1 1 0 0]
[1 1 0 0]
[1 1 0 0]

Rows 3 and 4 exist only because the tensor has length 4. But columns 3 and 4 are padding keys, so no row should attend to them. Future positions are still blocked too. What should you debug in code? - triangle direction - diagonal included or excluded - PAD columns blocked - safe handling of -inf - cache-time mask shape These bugs are boring. They are also production killers.

Where this lives in the wild¶

GPT-style chat systems depend on causal masks so each generated token uses only the prompt and earlier generated tokens.
GitHub Copilot-style code completion needs the same rule because the assistant must predict from the prefix, not from future file contents.
Hugging Face decoder implementations build causal and padding masks together for GPT-2, LLaMA, and Mistral style models.
Serving stacks like vLLM must keep mask logic correct even with KV cache and paged attention, because one mask bug leaks or hides context at scale.
T5 and BART decoders still use masked target-side self-attention even though they also cross-attend to a fully visible encoder.

Interview Q&A¶

Q: Why do decoders need a causal mask during training? A: Because training sees the full gold sequence in parallel, and without masking each position can read future tokens. That turns next-token prediction into a cheating problem instead of the real deployment problem. Common wrong answer to avoid: "Only to make computation faster." The main reason is correctness, not speed. Q: Where exactly is the mask applied? A: To the attention logits before softmax. Blocked positions get -inf, so after softmax they receive zero probability. Q: Difference between a causal mask and a padding mask? A: Causal mask blocks future positions based on time order. Padding mask blocks fake tokens based on batch padding. One enforces autoregression. The other enforces data validity. Common wrong answer to avoid: "They are the same thing." They often combine, but they solve different problems. Q: Why can a model with a broken mask show low training loss but poor generation? A: Because the model learned with information that will not exist at inference time. The train task and the runtime task silently diverged.

Apply now (5 min)¶

Take a 4-token sequence of your own. For example:

[Data] [beats] [opinions] [today]

Do three things. 1. Draw the 4 x 4 lower-triangular allow matrix from memory. 2. Invent one row of raw scores, like [1, 3, 2, 5], and mask it for position 3. 3. Compute the softmax roughly and verify the future token gets probability zero. Then sketch from memory: - the unmasked picture - the correctly masked picture - the combined causal-plus-padding picture If you can explain why the model looks smart in training when unmasked but becomes weak in generation, you own this file.

Bridge. The causal mask keeps decoding honest. The next bottleneck is speed. Without a cache, the model keeps recomputing yesterday's notes at every step. Read 11-kv-cache.md next.