Skip to content

02. Causal mask — the autoregressive contract

~10 min read. One matrix. One rule. The entire discipline of decoder training.

Built on the ELI5 in 00-eli5.md. The exam rule — each position sees only itself and earlier positions — becomes a lower- triangular visibility matrix in this topic.


1) The contract comes before the formula

Picture the exam hall first.

Each student may read only completed answers behind them.

Nobody reads tomorrow’s answer. That classroom rule is our exam rule.

Inside one answer sheet, token t uses tokens <= t only.

So what to do? We encode that rule as visibility.

Rows are queries. Columns are keys. 1 means allowed. 0 means blocked.

For T = 4, we get:

          key →   0 1 2 3
query 0        [ 1 0 0 0 ]
query 1        [ 1 1 0 0 ]
query 2        [ 1 1 1 0 ]
query 3        [ 1 1 1 1 ]
See. Every row may read leftward. No row may read rightward.

That is the full autoregressive contract.

If query 2 reads column 3, the contract is broken.

Then the parallel graders receive leaked evidence.

Then every projection lens is solving the wrong task.

Later, the memory shortcut will still keep only past material.

So training must follow the same rule now. Simple, no?

2) This matrix is lower triangular

Now say it in matrix language. The causal mask is lower triangular.

Lower triangle stays open. Upper triangle gets blocked. NumPy gives it directly.

import numpy as np

T = 4
mask = np.tril(np.ones((T, T), dtype=bool))
print(mask)
Output:
[[ True False False False]
 [ True  True False False]
 [ True  True  True False]
 [ True  True  True  True]]
Look. np.tril is exactly the decoder stencil.

We lay this stencil over the score matrix. Allowed cells keep their scores.

Blocked cells get a huge negative number.

That means the exam rule is not a slogan.

It is a matrix operation. A tiny ASCII picture helps.

scores [T,T]      mask [T,T]        masked scores [T,T]
+ + + +           1 0 0 0           + -inf -inf -inf
+ + + +     +     1 1 0 0     ->    +   +   -inf -inf
+ + + +           1 1 1 0           +   +     +  -inf
+ + + +           1 1 1 1           +   +     +    +
The answer sheet stays honest this way.

The parallel graders then compare only legal positions.

3) Why we add -inf before softmax

Now what is the problem beginners hit? They understand the mask.

But they apply it at the wrong time. Correct order is this:

Q [T,d] ──┐
          ├──→ scores [T,T]
K [T,d] ──┘
         + mask [T,T]
            softmax
            weights
Not this:
Q, K -> scores -> softmax -> zero some probabilities
Why? Because softmax normalizes the whole row.

If we block after softmax, row sums stop being one.

Then we must renormalize again. Many people forget that step.

So the clean method is simpler. Add -inf before softmax.

Then forbidden positions become exactly zero probability. See a worked row.

Start with:

raw scores = [2.0, 1.0, 4.0, 3.0]
Suppose query position 2 must block column 3.

So the mask row is:

mask row   = [1, 1, 1, 0]
Apply the mask before softmax.
masked row = [2.0, 1.0, 4.0, -inf]
Now do stable softmax. Subtract the max value 4.0.
shifted    = [-2.0, -3.0, 0.0, -inf]
exp        = [0.1353, 0.0498, 1.0000, 0.0000]
sum        = 1.1851
softmax    = [0.1142, 0.0420, 0.8438, 0.0000]
Good. The future position got exactly zero. Not a tiny rounding guess. A structural zero.

Now compare the wrong method. Softmax first on the raw row gives:

shifted    = [-2.0, -3.0, 0.0, -1.0]
exp        = [0.1353, 0.0498, 1.0000, 0.3679]
sum        = 1.5530
softmax    = [0.0871, 0.0321, 0.6439, 0.2369]
If we now zero the last entry after softmax, we get:
wrong row  = [0.0871, 0.0321, 0.6439, 0.0000]
row sum    = 0.7631
That is not a valid probability row anymore. So what to do?

Mask logits first. Always.

4) Batched multi-head masking is the same rule

Single-head attention uses score shape [T, T].

Real decoder code usually uses [B, H, T, T]. Where: - B is batch size. - H is number of heads. - T is sequence length.

The mask itself need not be copied physically.

We let broadcasting do the work. Build it once like this:

mask = np.tril(np.ones((T, T), dtype=bool))[None, None, :, :]
Now mask shape is:
[1, 1, T, T]
Scores have shape:
[B, H, T, T]
Broadcasting stretches the mask over all batches and heads.

That is why one mask can supervise many parallel graders. And each grader still uses its own projection lens outputs. A shape-flow picture helps.

Q [B,H,T,d] ──┐           ┌── K [B,H,T,d]
              │           │
              │      transpose(-2,-1)
              │           │
              └──────┬────┘
              scores [B,H,T,T]
             + mask [1,1,T,T]
                  softmax
             weights [B,H,T,T]
               @ V [B,H,T,d]
              output [B,H,T,d]
Look. The answer sheet length T appears twice in the score grid.

Once for query rows. Once for key columns.

The memory shortcut changes inference efficiency later.

But it does not change this legality rule.

The exam rule remains the same.

Where this lives in the wild

  • GPT-4 decoder training at OpenAI — masked self-attention enforces next-token prediction instead of future leakage.
  • Chinchilla scaling experiments at DeepMind — decoder training quality depends on correct autoregressive masking across huge token budgets.
  • LLaMA-3 pretraining at Meta — lower-triangular masks keep decoder layers faithful to left-to-right generation.
  • Mixtral decoder layers at Mistral AI — sparse expert routing still sits on top of the same causal masking discipline.
  • Codex code generation at OpenAI — code tokens must be predicted without reading unseen future source tokens.

Pause and recall

  1. Why is the causal mask a lower-triangular matrix?
  2. Why do we add -inf before softmax instead of after?
  3. Why can mask shape [1, 1, T, T] work with score shape [B, H, T, T]?
  4. What exact illegal event happens when query row 2 sees key column 3?

Interview Q&A

Q1. State the autoregressive contract in one sentence. A1. Token t may use only tokens up to and including position t. Common wrong answer to avoid: “Each token may use all tokens with small weights.” Q2. Why is np.tril the right helper here? A2. It preserves the lower triangle, which matches past-and-self visibility. Common wrong answer to avoid: “Because triangular matrices are faster by default.” Q3. What breaks if you mask after softmax? A3. Probability rows stop summing to one unless you renormalize again. Common wrong answer to avoid: “Nothing breaks because zeros are harmless.” Q4. What is the practical broadcast shape for batched multi-head scores? A4. We commonly use a mask shaped [1, 1, T, T] over [B, H, T, T] scores. Common wrong answer to avoid: “The mask must always be copied to [B, H, T, T] first.”


Apply now (5 min)

Quick exercise. Write the T = 5 visibility matrix from memory.

Then compute one masked softmax row by hand.

Finally, sketch from memory the shape flow from Q and K to masked scores.

Bridge. Good. We know the rule mathematically. Next we code it carefully, because most failures now come from axes and broadcasting. → 03-masking-in-code.md