02. Causal mask — the autoregressive contract¶
~10 min read. One matrix. One rule. The entire discipline of decoder training.
Built on the ELI5 in 00-eli5.md. The exam rule — each position sees only itself and earlier positions — becomes a lower- triangular visibility matrix in this topic.
1) The contract comes before the formula¶
Picture the exam hall first.
Each student may read only completed answers behind them.
Nobody reads tomorrow’s answer. That classroom rule is our exam rule.
Inside one answer sheet, token t uses tokens <= t only.
So what to do? We encode that rule as visibility.
Rows are queries. Columns are keys. 1 means allowed. 0 means blocked.
For T = 4, we get:
That is the full autoregressive contract.
If query 2 reads column 3, the contract is broken.
Then the parallel graders receive leaked evidence.
Then every projection lens is solving the wrong task.
Later, the memory shortcut will still keep only past material.
So training must follow the same rule now. Simple, no?
2) This matrix is lower triangular¶
Now say it in matrix language. The causal mask is lower triangular.
Lower triangle stays open. Upper triangle gets blocked. NumPy gives it directly.
Output: Look.np.tril is exactly the decoder stencil.
We lay this stencil over the score matrix. Allowed cells keep their scores.
Blocked cells get a huge negative number.
That means the exam rule is not a slogan.
It is a matrix operation. A tiny ASCII picture helps.
scores [T,T] mask [T,T] masked scores [T,T]
+ + + + 1 0 0 0 + -inf -inf -inf
+ + + + + 1 1 0 0 -> + + -inf -inf
+ + + + 1 1 1 0 + + + -inf
+ + + + 1 1 1 1 + + + +
The parallel graders then compare only legal positions.
3) Why we add -inf before softmax¶
Now what is the problem beginners hit? They understand the mask.
But they apply it at the wrong time. Correct order is this:
Not this: Why? Because softmax normalizes the whole row.If we block after softmax, row sums stop being one.
Then we must renormalize again. Many people forget that step.
So the clean method is simpler.
Add -inf before softmax.
Then forbidden positions become exactly zero probability. See a worked row.
Start with:
Suppose query position2 must block column 3.
So the mask row is:
Apply the mask before softmax. Now do stable softmax. Subtract the max value4.0.
shifted = [-2.0, -3.0, 0.0, -inf]
exp = [0.1353, 0.0498, 1.0000, 0.0000]
sum = 1.1851
softmax = [0.1142, 0.0420, 0.8438, 0.0000]
Now compare the wrong method. Softmax first on the raw row gives:
shifted = [-2.0, -3.0, 0.0, -1.0]
exp = [0.1353, 0.0498, 1.0000, 0.3679]
sum = 1.5530
softmax = [0.0871, 0.0321, 0.6439, 0.2369]
Mask logits first. Always.
4) Batched multi-head masking is the same rule¶
Single-head attention uses score shape [T, T].
Real decoder code usually uses [B, H, T, T]. Where:
- B is batch size.
- H is number of heads.
- T is sequence length.
The mask itself need not be copied physically.
We let broadcasting do the work. Build it once like this:
Now mask shape is: Scores have shape: Broadcasting stretches the mask over all batches and heads.That is why one mask can supervise many parallel graders. And each grader still uses its own projection lens outputs. A shape-flow picture helps.
Q [B,H,T,d] ──┐ ┌── K [B,H,T,d]
│ │
│ transpose(-2,-1)
│ │
└──────┬────┘
▼
scores [B,H,T,T]
│
+ mask [1,1,T,T]
▼
softmax
│
weights [B,H,T,T]
│
@ V [B,H,T,d]
▼
output [B,H,T,d]
T appears twice in the score grid.
Once for query rows. Once for key columns.
The memory shortcut changes inference efficiency later.
But it does not change this legality rule.
The exam rule remains the same.¶
Where this lives in the wild¶
- GPT-4 decoder training at OpenAI — masked self-attention enforces next-token prediction instead of future leakage.
- Chinchilla scaling experiments at DeepMind — decoder training quality depends on correct autoregressive masking across huge token budgets.
- LLaMA-3 pretraining at Meta — lower-triangular masks keep decoder layers faithful to left-to-right generation.
- Mixtral decoder layers at Mistral AI — sparse expert routing still sits on top of the same causal masking discipline.
- Codex code generation at OpenAI — code tokens must be predicted without reading unseen future source tokens.
Pause and recall¶
- Why is the causal mask a lower-triangular matrix?
- Why do we add
-infbefore softmax instead of after? - Why can mask shape
[1, 1, T, T]work with score shape[B, H, T, T]? - What exact illegal event happens when query row
2sees key column3?
Interview Q&A¶
Q1. State the autoregressive contract in one sentence.
A1. Token t may use only tokens up to and including position t.
Common wrong answer to avoid: “Each token may use all tokens with small weights.”
Q2. Why is np.tril the right helper here?
A2. It preserves the lower triangle, which matches past-and-self visibility.
Common wrong answer to avoid: “Because triangular matrices are faster by default.”
Q3. What breaks if you mask after softmax?
A3. Probability rows stop summing to one unless you renormalize again.
Common wrong answer to avoid: “Nothing breaks because zeros are harmless.”
Q4. What is the practical broadcast shape for batched multi-head scores?
A4. We commonly use a mask shaped [1, 1, T, T] over [B, H, T, T] scores.
Common wrong answer to avoid: “The mask must always be copied to [B, H, T, T] first.”
Apply now (5 min)¶
Quick exercise. Write the T = 5 visibility matrix from memory.
Then compute one masked softmax row by hand.
Finally, sketch from memory the shape flow from Q and K to masked scores.¶
Bridge. Good. We know the rule mathematically. Next we code it carefully, because most failures now come from axes and broadcasting. → 03-masking-in-code.md