00. Causal Attention & Coding — The Five-Year-Old Version¶

Module 03 showed the factory. This module builds the machine with your own hands.

Imagine a student writing a long exam. She writes one answer at a time. She can re-read her own previous answers. She cannot peek at the next answer. That is the whole spirit of causal attention.

Each new answer depends on the earlier answers. So the student keeps scanning backwards. Never forwards. That backward-only rule is the first big idea. We call it the exam rule.

Her exam paper is only so large. Maybe 128 answers. Maybe 4096. But the paper has edges. That paper is the answer sheet — the context window.

Now suppose the student has already written three answers. She wants to write answer four. She does not re-read the entire notebook from scratch. She keeps sticky notes summarizing useful earlier facts. Those sticky notes are the memory shortcut — the KV cache.

The student also switches glasses depending on the task. One pair asks, "What am I searching for?" One pair asks, "What clues do I contain?" One pair asks, "What information should I share?" Those are the projection lenses — queries, keys, and values.

Finally, imagine five graders sitting together. One grader checks subject continuity. One checks punctuation. One checks entity names. One checks local grammar. One checks long-range callbacks. Those are the parallel graders — multi-head attention.

So the model is doing something very human-looking. It reads its own past. It keeps useful notes. It uses different lenses. It lets different graders focus on different patterns. Then it writes one more token.

Now the important warning. If you accidentally let the student peek ahead, training becomes an open-book answer-copying contest. Loss looks fantastic. Generation becomes terrible. Because the model learned to cheat. Not to predict.

That is why causal attention is subtle. The math is short. The discipline is not. You must enforce the exam rule everywhere.

Good. Leave the exam hall picture in your head. We will write the actual code.

One more thing. If you accidentally remove the exam rule during training, the student copies future answers. Loss drops beautifully. Generation collapses. The model learned to cheat, not to predict. That single bug is where this module starts.

A tiny worked example¶

Three tokens enter the model: [The, cat, sat].

The student (the model) wants to predict token four.

Token 1 (The) looks only at itself. The exam rule forbids peeking ahead.
Token 2 (cat) looks at The and cat. Two answers visible.
Token 3 (sat) looks at The, cat, and sat. Three answers visible.

Each token puts on three projection lenses. One lens asks, "What am I looking for?" One labels, "What do I contain?" One carries, "What should I pass forward?"

Five parallel graders process the same tokens simultaneously. One grader notices cat is a noun. Another notices sat follows a subject. Another tracks that The modifies cat. Different graders, different patterns.

Now the model generates token four. It does not re-read all three tokens from scratch. It consults the memory shortcut — sticky notes from earlier steps. Only the new token's query is fresh.

The prediction emerges: on. The model writes it onto the answer sheet and moves to position five.

token:    The ──── cat ──── sat ──── [predict]
           │       │        │          │
sees:     [self]  [1,2]   [1,2,3]   [1,2,3] + cache
           │       │        │          │
lenses:   Q,K,V   Q,K,V   Q,K,V     Q,K,V
           │       │        │          │
graders:  ├─h1─┤  ├─h1─┤  ├─h1─┤    ├─h1─┤
          ├─h2─┤  ├─h2─┤  ├─h2─┤    ├─h2─┤
          └─...┘  └─...┘  └─...┘    └─...┘

That is the full picture. Exam rule, projection lenses, parallel graders, memory shortcut, answer sheet. Every later file calls back to these names.

The placeholders you will see called back¶

Placeholder	Meaning
The exam rule	The causal mask — each position sees only itself and earlier positions.
The answer sheet	The context window — the finite sequence the model processes.
The projection lens	Q/K/V projections — three views of the same token for different jobs.
The parallel graders	Multi-head attention — several heads attending in parallel.
The memory shortcut	KV cache — stored past keys and values for fast inference.

Top resources¶

Let's build GPT: from scratch, in code — Karpathy's live-coding walkthrough. The single best video for this module.
Let's reproduce GPT-2 (124M) — prefill, training, and generation code together.
Understanding and Coding Self-Attention — Sebastian Raschka's practical coding-first guide.
Transformers from Scratch — Peter Bloem's derivation of attention mechanics.
The Illustrated GPT-2 — Jay Alammar's visual guide to decoder generation and attention flow.

What's coming¶

01-opening-failure.md — the silent bug: future leakage during training.
02-causal-mask.md — the autoregressive rule and the lower-triangular mask.
03-masking-in-code.md — NumPy implementation and a full numerical walkthrough.
04-qkv-projections.md — one token, three jobs: query, key, value.
05-why-separate-projections.md — why three matrices beat one shared one.
06-multi-head-split-merge.md — splitting heads, merging heads, and reshape mechanics.
07-multi-head-coding.md — the full multi-head causal attention implementation.
08-output-projection.md — W_O: mixing information across heads.
09-transformer-block-shapes.md — from embeddings to logits, the complete shape walk.
10-kv-cache.md — why cache, what to cache, and the code.
11-kv-cache-memory.md — memory cost formula and the speed-memory tradeoff.
12-prefill-vs-decode.md — two inference phases and when masking still applies.
13-debugging-attention.md — the failure-fix chain for causal attention code.
14-honest-admission.md — what we glossed over and don't fully understand.

Bridge. The first thing that breaks is obvious in hindsight — attention without a mask lets the model cheat by reading the future. → 01-opening-failure.md