01. Opening failure — what breaks without a causal mask¶

~8 min read. The bug that makes training look perfect and generation look like garbage.

Built on the ELI5 in 00-eli5.md. The exam rule — each position sees only itself and earlier positions — is the missing discipline behind this failure.

1) The code looks fine. The learning is fake.¶

Look. This is the classic trap. We write attention. The shapes pass.

The loss falls. We feel safe. But we forgot the exam rule.

Then the model quietly cheats. Simple, no? Here is the broken code.

import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    exp = np.exp(x)
    return exp / np.sum(exp, axis=axis, keepdims=True)

def bad_attention(q, k, v):
    d_k = q.shape[-1]
    scores = q @ k.T / np.sqrt(d_k)
    weights = softmax(scores, axis=-1)
    out = weights @ v
    return out, weights

Now what is the problem? Nothing crashes.

That is why this bug is dangerous.

Inside the answer sheet, every token can see future keys.

So token t can read answers meant for t+1 and beyond.

The projection lens can be perfect here.

Still the training game is wrong.

Even smart parallel graders cannot rescue a broken rule.

Later, the memory shortcut will store only past facts.

So train time and test time stop matching. See.

The model did not learn prediction. It learned leakage.

That is a silent correctness bug. Not a syntax bug.

Not a shape bug. A meaning bug.

2) One toy sequence shows the whole leak¶

Take a tiny sequence.

input   = [I, love, causal, attention]
target  = [love, causal, attention, <eos>]
index     0    1      2         3

What do we want? At position 0, we use I to predict love. At position 1, we use [I, love] to predict causal. At position 2, we use [I, love, causal] to predict attention. At position 3, we use all visible tokens to predict <eos>.

This is the exam rule again. Row means query position.

Column means key position. Y means visible. N means blocked.

Broken attention without a mask:

            key ──→   0   1   2   3
┌───────────────┬─────────────────────┐
│ query 0       │  Y   Y   Y   Y  ✗   │
│ query 1       │  Y   Y   Y   Y  ✗   │
│ query 2       │  Y   Y   Y   Y  ✗   │
│ query 3       │  Y   Y   Y   Y  ✓   │
└───────────────┴─────────────────────┘

Correct decoder attention with the exam rule:

            key ──→   0   1   2   3
┌───────────────┬─────────────────────┐
│ query 0       │  Y   N   N   N      │
│ query 1       │  Y   Y   N   N      │
│ query 2       │  Y   Y   Y   N      │
│ query 3       │  Y   Y   Y   Y      │
└───────────────┴─────────────────────┘

So position 0 must not see love. But broken code allows it.

That is illegal help. Think like an exam hall.

The student at seat 0 looks at answer 1.

Then loss drops for the wrong reason. Now a small numerical slice.

Suppose query 0 gets raw scores:

scores_0 = [1.2, 3.1, 0.4, 0.2]

Stable softmax first subtracts the max.

shifted  = [-1.9, 0.0, -2.7, -2.9]
exp      = [0.1496, 1.0000, 0.0672, 0.0550]
sum      = 1.2718
weights  = [0.1176, 0.7863, 0.0528, 0.0433]

See what happened. Most weight went to column 1.

But column 1 is love. That is the target itself.

So the model copies the future.

The answer sheet became an answer key. Yes?

3) Training becomes easier than generation¶

This is the heart of the mismatch.

During training, the future token is present in the sequence.

Without masking, the model can read it.

During generation, the future token does not exist yet.

So what to do? We must make train time obey inference time.

That contract is causal masking. Look at the two worlds.

TRAINING WITHOUT MASK
query 0 -> can read token 1, 2, 3
problem -> task is secretly easier

GENERATION
query 0 -> future tokens do not exist
problem -> shortcut disappears

So the model solved a different problem.

It solved, “pick from visible future clues.”

We wanted, “predict unseen next token.” These are not the same.

The memory shortcut does not fix this mismatch. A KV cache only stores past keys and values.

It still obeys the exam rule.

The parallel graders also stay honest only if masking is honest. And each projection lens still works under the same contract.

Good architecture cannot compensate for illegal visibility.

This is why generation becomes nonsense. Not always random nonsense.

Sometimes locally fluent nonsense. That is worse.

Because it feels plausible for a while.

4) Why interviewers love this failure case¶

Interviewers like bugs with layers. This bug has many layers.

First, it checks matrix multiplication intuition. Can we explain QK^T clearly?

Second, it checks shape discipline.

Do we know which axis is query and key?

Third, it checks numerical stability. Do we apply stable softmax?

Fourth, it checks reasoning discipline. Can we explain train and inference mismatch?

Fifth, it checks debugging maturity.

Can we spot a silent bug when metrics look good?

So if someone asks,

“Why can loss look perfect but samples look terrible?”

This is one sharp answer.

The model may have violated the exam rule. See. A strong answer also mentions the whole picture.

The answer sheet is finite.

The projection lens creates Q, K, and V.

The parallel graders are the different heads.

The memory shortcut is for fast decoding later.

But before all that, masking must be right. Simple, no?¶

Where this lives in the wild¶

GPT-series decoders at OpenAI — decoder-only training depends on strict causal masking before next-token loss.
GitHub Copilot code completion — the editor suggestion model must not read future code tokens while predicting the next one.
Claude at Anthropic — autoregressive decoding relies on the same masked self-attention contract at every layer.
LLaMA at Meta — decoder pretraining uses lower-triangular attention so each token reads only the past.
Gemini at Google DeepMind — generation stacks still need train-time masking to match step-by-step inference.

Pause and recall¶

Why can unmasked decoder training show low loss but poor generation?
In the toy sequence, why is position 0 seeing love illegal?
Why is this bug called a silent correctness bug?
Why can good Q, K, V math still fail if masking is absent?

Interview Q&A¶

Q1. What exactly breaks without a causal mask? A1. Future tokens become visible during training, so the model cheats instead of predicting. Common wrong answer to avoid: “Only the gradients become unstable.” Q2. Why is generation worse than training here? A2. Training allowed future visibility, but generation has no future tokens available. Common wrong answer to avoid: “Because inference uses a smaller batch size.” Q3. What signal in logs can mislead us? A3. Rapidly falling loss can look excellent even while the task definition is broken. Common wrong answer to avoid: “If loss falls, the attention code must be correct.” Q4. How would you explain this bug to a beginner? A4. The student peeked at tomorrow’s answer during today’s exam. Common wrong answer to avoid: “Attention is random, so mistakes are expected.”

Apply now (5 min)¶

Quick exercise. Write a bad_attention function on paper.

Circle the exact line where future leakage becomes possible.

Then sketch the legal visibility matrix for T = 5 from memory. Also sketch from memory how the exam rule changes training into honest next-token prediction.

Bridge. We now need the exact matrix that enforces this rule. One triangle does the job, every time. → 02-causal-mask.md