09. Causal masking — blocking the future in decoders¶

If a decoder can peek ahead, the exam is fake. See the cheating rule clearly.

Built on the ELI5 in 00-eli5.md. The scorecard — softmax weights over visible tokens — now needs a hard rule that hides tomorrow's answer.

Mental model — the decoder must read only leftward¶

A decoder predicts the next token. So position t may use tokens <= t. Not token t + 1. Not the future. Training gives the whole sequence at once. That is fast. That is also dangerous. Without a mask, each row of the spotlight beam can read the answer key. Loss drops. But learning is dishonest. Then generation starts. The future token is not present yet. The cheat path vanishes. Quality collapses. So what to do? Keep parallel training. But block rightward visibility inside attention. See the picture:

query at position 1 -> may see [1]
query at position 2 -> may see [1 2]
query at position 3 -> may see [1 2 3]
query at position 4 -> may see [1 2 3 4]

That is causal masking.

Picture first — the lower-triangular rule¶

For four tokens, the allow matrix is lower triangular.

positions      1   2   3   4
row 1 sees     ✓   X   X   X
row 2 sees     ✓   ✓   X   X
row 3 sees     ✓   ✓   ✓   X
row 4 sees     ✓   ✓   ✓   ✓

Same thing as a matrix:

allow =
[1 0 0 0]
[1 1 0 0]
[1 1 1 0]
[1 1 1 1]

And as a logit mask:

mask =
[ 0   -inf -inf -inf]
[ 0    0   -inf -inf]
[ 0    0    0   -inf]
[ 0    0    0    0  ]

Lower triangle allowed. Upper triangle blocked. One triangle direction. Huge consequence.

Formula — block logits before softmax¶

Start with raw attention scores.

S = QK^T / sqrt(d_k)

Then add the causal mask.

A = softmax(S + M)
output = A V

The rule is simple.

M[i, j] = 0      if j <= i
M[i, j] = -inf   if j > i

Why -inf? Because softmax turns it into zero probability.

exp(-inf) = 0

So blocked columns get exactly zero weight. That means no future leak reaches the weighted sum. See. We do not delete tokens. We delete their visibility.

Worked example setup — four tokens, one head¶

Take this tiny training sequence:

t1 = I
t2 = love
t3 = samosa
t4 = EOS

Suppose one head produces these raw scores.

S =
[4.0, 3.0, 2.0, 1.0]
[1.0, 4.0, 3.0, 2.0]
[1.0, 2.0, 4.0, 6.0]
[0.0, 1.0, 2.0, 4.0]

Row i is one query token. Column j is one key token. Now mask row by row.

Worked numerical example — apply the mask row by row¶

Row 1¶

Raw row:

[4.0, 3.0, 2.0, 1.0]

After causal mask:

[4.0, -inf, -inf, -inf]

Softmax:

[1.000, 0.000, 0.000, 0.000]

ASCII view:

query t1
 +--> t1   1.000
 X--> t2   0.000
 X--> t3   0.000
 X--> t4   0.000

Row 2¶

Raw row:

[1.0, 4.0, 3.0, 2.0]

After causal mask:

[1.0, 4.0, -inf, -inf]

Exponentials over visible positions:

[2.72, 54.60, 0, 0]

Softmax:

[0.047, 0.953, 0.000, 0.000]

ASCII view:

query t2
 +--> t1   0.047
 +--> t2   0.953
 X--> t3   0.000
 X--> t4   0.000

Row 3¶

Raw row:

[1.0, 2.0, 4.0, 6.0]

After causal mask:

[1.0, 2.0, 4.0, -inf]

Exponentials over visible positions:

[2.72, 7.39, 54.60, 0]

Softmax:

[0.042, 0.114, 0.844, 0.000]

ASCII view:

query t3
 +--> t1   0.042
 +--> t2   0.114
 +--> t3   0.844
 X--> t4   0.000

Row 4¶

Raw row:

[0.0, 1.0, 2.0, 4.0]

After causal mask:

[0.0, 1.0, 2.0, 4.0]

Nothing is blocked here. Softmax:

[0.015, 0.041, 0.112, 0.831]

ASCII view:

query t4
 +--> t1   0.015
 +--> t2   0.041
 +--> t3   0.112
 +--> t4   0.831

Three scenarios — no mask, wrong mask, correct mask¶

Scenario 1 — no mask¶

Look at row 3 without masking.

raw row 3 = [1.0, 2.0, 4.0, 6.0]
softmax   = [0.006, 0.016, 0.117, 0.861]

So token samosa mostly reads EOS. That is tomorrow's answer. Training loss looks excellent. Generation later will fail.

Scenario 2 — wrong triangle orientation¶

Suppose a bug keeps the upper triangle.

buggy allow =
[1 1 1 1]
[0 1 1 1]
[0 0 1 1]
[0 0 0 1]

Now row 2 cannot see row 1. History disappears. Only self and future remain. The code still runs. The model learns the wrong game.

Scenario 3 — correct causal mask¶

Keep the lower triangle. Future columns go to -inf. Softmax gives zero there. Now each token learns from available history only. That matches generation-time reality. See the contrast:

no mask      -> cheat path exists
wrong mask   -> history path breaks
right mask   -> training matches decoding

Why loss can look good while generation fails¶

Teacher forcing shows the full target sequence during training. Without causal masking, the decoder can copy future clues. That makes next-token prediction too easy. So cross-entropy falls for the wrong reason. At inference, token t+1 is missing. The model must rely on left context only. If training never enforced that rule, generation becomes shaky. You see repetition. You see incoherence. You see brittle continuations. Good training loss. Bad real behavior. Classic cheating symptom.

Causal masks block the future. Padding masks block fake tokens. Suppose a batch has:

A: I love chai PAD PAD
B: We ship today PAD PAD PAD

The PAD tokens are not real words. They were added only to make shapes match. So what to do? Hide PAD columns from attention.

real token  -> may see real tokens
real token  -> must not see PAD tokens

In practice, many systems combine both masks. One mask says, "do not look right." The other says, "do not look at padding garbage."

Where this lives in the wild¶

OpenAI ChatGPT decoder pretraining needs causal masks so next-token learning stays honest.
GitHub Copilot code completion must not peek at future tokens during training batches.
Anthropic Claude long-form generation still relies on left-to-right masking inside decoder attention.
Meta Llama training stacks combine causal masks with padding masks for packed batches.
Google Gemini decoder-style text generation uses the same future-blocking rule during autoregressive training.

Interview Q&A¶

Q: Why do we set blocked logits to -inf before softmax? A: Because softmax then gives exactly zero probability to blocked positions. Common wrong answer to avoid: "We remove those tokens from the sequence." No. The tokens stay. Only visibility changes. Q: Why can training loss improve without causal masking? A: Because the model reads future tokens and solves an easier, dishonest task. Common wrong answer to avoid: "Lower loss always means better generation." Not here. The task itself became invalid. Q: Is a padding mask the same as a causal mask? A: No. Padding masks hide fake batch filler. Causal masks hide future positions. Q: Does the last token need masking? A: No future exists to its right, so the last row often stays fully visible.

Apply now (5 min)¶

Take a 4-token sentence of your own. Write a 4 x 4 raw score matrix. Now draw the lower-triangular causal mask. Set blocked scores to -inf. Compute one masked softmax row by hand. Then say, in one line, why unmasked training is cheating. Sketch from memory: Draw the ✓/X visibility table for positions 1 to 4.

Bridge. One masked spotlight is still too limited. Next, we split the work across parallel crews with different habits in 10-multi-head-attention.md.

09. Causal masking — blocking the future in decoders¶

Mental model — the decoder must read only leftward¶

Picture first — the lower-triangular rule¶

Formula — block logits before softmax¶

Worked example setup — four tokens, one head¶

Worked numerical example — apply the mask row by row¶

Row 1¶

Row 2¶

Row 3¶

Row 4¶

Three scenarios — no mask, wrong mask, correct mask¶

Scenario 1 — no mask¶

Scenario 2 — wrong triangle orientation¶

Scenario 3 — correct causal mask¶

Why loss can look good while generation fails¶

Padding masks — related, but not the same¶

Where this lives in the wild¶

Interview Q&A¶

Apply now (5 min)¶