Home / AI Foundation / 04. Autoregressive Generation Autoregressive Generation¶ The chapters in this module, in reading order. # Chapter 00 Causal Attention & Coding — The Five-Year-Old Version 01 Opening failure — what breaks without a causal mask 02 Causal mask — the autoregressive contract 03 Masking in code — from formula to NumPy 04 Q/K/V projections — one token, three jobs 05 Why separate projections — routing and payload need freedom 06 Multi-head split and merge — one wide stream becomes many narrow views 07 Multi-head causal attention — one tensor story from X to Y 08 Output projection — concatenation stacks, W_O mixes 09 Transformer block shapes — keep the axes steady 10 KV cache — reuse old keys, skip old work 11 KV cache memory — speed has a storage bill 12 Prefill vs decode — same weights, different workload 13 Debugging causal attention — silent bugs, direct checks 14 Honest admission — the clean story is not the whole story