Skip to content

03. Masking in code — from formula to NumPy

~12 min read. The mask is short. The bugs are long. Code it once, correctly.

Built on the ELI5 in 00-eli5.md. The exam rule and the answer sheet — legal leftward visibility inside a fixed context window — now become real NumPy code and real tensor shapes.


1) Single-head causal attention in plain NumPy

Look. We first code the smallest honest version. One head. One sequence. One answer sheet of length T.

The exam rule is the only special ingredient. Here is the full code.

import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    exp = np.exp(x)
    return exp / np.sum(exp, axis=axis, keepdims=True)

def causal_mask(T):
    return np.tril(np.ones((T, T), dtype=bool))

def causal_attention(q, k, v):
    # q, k: [T, d_k]
    # v:    [T, d_v]
    d_k = q.shape[-1]
    scores = q @ k.T / np.sqrt(d_k)        # [T, T]
    mask = causal_mask(q.shape[0])         # [T, T]
    scores = np.where(mask, scores, -1e9)
    weights = softmax(scores, axis=-1)     # [T, T]
    out = weights @ v                      # [T, d_v]
    return out, weights
See. The logic is short. The discipline is not. q @ k.T builds pairwise compatibility. np.where enforces the exam rule.

Softmax turns legal scores into legal probabilities. Then weights mix values. The projection lens produced q, k, and v earlier.

This function assumes that part already happened. The parallel graders come later when we add heads.

The memory shortcut comes later during decoding. Simple, no?

2) Batched multi-head code keeps the same idea

Now what changes with batches and heads? Only the axes. The rule stays identical. So what to do?

We let extra dimensions flow through the same steps. Here is the common pattern.

import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    exp = np.exp(x)
    return exp / np.sum(exp, axis=axis, keepdims=True)

def causal_attention_batched(q, k, v):
    # q, k, v: [B, H, T, d_head]
    B, H, T, d_head = q.shape
    scores = q @ k.transpose(0, 1, 3, 2) / np.sqrt(d_head)
    mask = np.tril(np.ones((T, T), dtype=bool))[None, None, :, :]
    scores = np.where(mask, scores, -1e9)
    weights = softmax(scores, axis=-1)
    out = weights @ v
    return out, weights
Look carefully at the transpose. k.transpose(0, 1, 3, 2) changes [B,H,T,d] into [B,H,d,T]. Then matrix multiply produces [B,H,T,T] scores.

The mask shape is [1,1,T,T]. Broadcasting stretches it across all batches. Broadcasting also stretches it across all parallel graders.

So each head gets the same legality rule. But each head still carries different projection lens outputs.

That is why one mask can supervise many heads. And the answer sheet length still controls both score axes.

Yes?

3) Full numerical walkthrough with every intermediate

Now we do one row fully. Let sequence length be T = 4. Let value width be d_v = 2.

Suppose raw scores are already computed.

scores =
[[ 2.0,  1.0,  0.0, -1.0],
 [ 1.0,  3.0,  2.0,  0.0],
 [ 2.0,  1.0,  4.0,  3.0],
 [ 0.0,  1.0,  2.0,  3.0]]
Suppose values are:
V =
[[1.0, 0.0],
 [0.0, 1.0],
 [2.0, 2.0],
 [9.0, 9.0]]
We inspect query position 2. That is the third token on the answer sheet. Its raw score row is:
row_2_raw = [2.0, 1.0, 4.0, 3.0]
Under the exam rule, query 2 may see columns 0,1,2 only. So its mask row is:
mask_2 = [1, 1, 1, 0]
Apply the mask before softmax.
row_2_masked = [2.0, 1.0, 4.0, -inf]
Now do stable softmax. Step one. Find the max.
max = 4.0
Step two. Subtract the max from each visible score.
shifted = [-2.0, -3.0, 0.0, -inf]
Step three. Exponentiate.
exp = [0.1353, 0.0498, 1.0000, 0.0000]
Step four. Sum the exponentials.
sum = 0.1353 + 0.0498 + 1.0000 + 0.0000
    = 1.1851
Step five. Normalize.
weights_2 = [0.1353, 0.0498, 1.0000, 0.0000] / 1.1851
          = [0.1142, 0.0420, 0.8438, 0.0000]
Good. The future column received zero weight. Not by luck. By the mask. Now compute the context vector.
context_2
= 0.1142 * [1.0, 0.0]
+ 0.0420 * [0.0, 1.0]
+ 0.8438 * [2.0, 2.0]
+ 0.0000 * [9.0, 9.0]
Multiply each term.
term_0 = [0.1142, 0.0000]
term_1 = [0.0000, 0.0420]
term_2 = [1.6876, 1.6876]
term_3 = [0.0000, 0.0000]
Add them.
context_2 = [0.1142, 0.0000]
          + [0.0000, 0.0420]
          + [1.6876, 1.6876]
          + [0.0000, 0.0000]
          = [1.8018, 1.7296]
See. The giant future value [9.0, 9.0] never enters. That is the whole point of the exam rule. And this is why a finite answer sheet remains honest. A quick shape picture helps memory.
Q [T,d_k] ──┐           ┌── K^T [d_k,T]
            │           │
            └──────┬────┘
             scores [T,T]
             + mask [T,T]
        masked scores [T,T]
                softmax
             weights [T,T]
              @ V [T,d_v]
             out [T,d_v]
The projection lens created the inputs. The parallel graders repeat this logic head by head.

The memory shortcut later saves past K and V. But the masking rule itself does not change.

4) Common masking bugs that waste whole afternoons

Now what usually goes wrong? Many things. The mask is short. The bug list is long. 1. Using np.triu instead of np.tril. 2. Applying the mask after softmax. 3. Forgetting k.transpose(0,1,3,2) in the batched case. 4. Using mask shape [T,T] where broadcasted [1,1,T,T] is expected. 5. Filling blocked entries with 0.0 instead of a huge negative number. 6. Accidentally creating an all-masked row and getting NaN. 7. Mixing d_model and d_head in the scaling term.

Let us slow down on two of them. Bug one. Wrong triangle orientation.

If we use triu, we block the past and reveal the future. Generation then becomes absurd. Bug two. All-masked rows.

Standard causal self-attention avoids this because the diagonal stays visible. Every token may at least see itself.

So if we see NaN, we check our mask construction first. Look. A stable decoder needs all five anchors working together.

The answer sheet sets the context size. The projection lens defines Q, K, and V.

The parallel graders split attention into heads. The memory shortcut speeds up later decoding.

But the exam rule protects correctness underneath everything.

Where this lives in the wild

  • PyTorch nn.MultiheadAttention reference implementation — masking and score shaping are the backbone of its causal mode.
  • Hugging Face transformers modeling_gpt2.py — decoder blocks apply causal masking before attention probabilities.
  • Karpathy's nanoGPT implementation — small codebase, same exact lower-triangular masking discipline.
  • xFormers memory-efficient attention at Meta — optimized kernels still preserve causal masking semantics.
  • JAX/Flax attention in Google's T5X — high-performance attention code still lives or dies on correct masking axes.

Pause and recall

  1. Why do we mask scores before softmax in code?
  2. What exact transpose do we need for batched K?
  3. In the numerical walkthrough, why did [9.0, 9.0] disappear completely?
  4. Why does the diagonal staying visible prevent all-masked rows?

Interview Q&A

Q1. What are the minimum ingredients of single-head causal attention code? A1. Scores, lower-triangular mask, stable softmax, and weighted sum over V. Common wrong answer to avoid: “Only Q @ K^T matters.” Q2. Why do many masking bugs survive unit tests? A2. Shapes can look correct even while legality and probability semantics are wrong. Common wrong answer to avoid: “If dimensions match, attention must be correct.” Q3. What broadcast shape is most common for decoder masks? A3. [1, 1, T, T], broadcast over batched multi-head scores. Common wrong answer to avoid: “We always need a separate mask for every head tensor.” Q4. What is one quick sign that the triangle orientation may be wrong? A4. Early positions behave as if they know future tokens. Common wrong answer to avoid: “Only the last token output changes.”


Apply now (5 min)

Quick exercise. Write causal_attention from memory in ten lines. Then hand-compute one masked softmax row for query 1 or 2.

Finally, sketch from memory the ASCII shape flow from Q to output. Also mark where the answer sheet length appears twice.


Bridge. Good. We can enforce legal visibility now. Next we ask why one token becomes three vectors before attention even starts. → 04-qkv-projections.md