Autoregressive Generation¶

The chapters in this module, in reading order.

#	Chapter
00	Causal Attention & Coding — The Five-Year-Old Version
01	Opening failure — what breaks without a causal mask
02	Causal mask — the autoregressive contract
03	Masking in code — from formula to NumPy
04	Q/K/V projections — one token, three jobs
05	Why separate projections — routing and payload need freedom
06	Multi-head split and merge — one wide stream becomes many narrow views
07	Multi-head causal attention — one tensor story from X to Y
08	Output projection — concatenation stacks, W_O mixes
09	Transformer block shapes — keep the axes steady
10	KV cache — reuse old keys, skip old work
11	KV cache memory — speed has a storage bill
12	Prefill vs decode — same weights, different workload
13	Debugging causal attention — silent bugs, direct checks
14	Honest admission — the clean story is not the whole story