06. Module 04 Review — Causal Attention & Coding¶
Focus: causal masking, Q/K/V projections, multi-head attention, KV cache, and full transformer-block shape fluency.
Review loop¶
- Re-answer the self-check questions in 01_weekly_plan.md from memory.
- Re-read the failure-fix chain and shape walk in 02_explainer.md.
- Use 04_daily_recall.md aloud, not silently.
- Review 05_hands_on_lab.md and identify one bug you can now debug confidently.
Reflection¶
- Which part of causal attention now feels mechanical instead of magical?
- Where do you still hesitate: mask orientation, head reshaping, or cache mechanics?
- Can you explain prefill versus decode without hand-waving?
- What should feel automatic before Module 05 begins?
Foundation-gap audit before Module 05¶
You should be able to do each of these without notes:
- write scaled dot-product attention from scratch
- draw the causal mask for any
T - explain why masking happens before softmax
- derive Q, K, and V from
X [B, T, D] - split and merge heads while preserving shape correctness
- explain what
W_Odoes after concatenation - describe KV cache shapes and append logic
- walk through one transformer block from embeddings to logits
If two or more bullets still feel weak, revisit 02_explainer.md before moving on.
Rapid-fire checkpoint¶
Conceptual¶
- Why is future leakage such a dangerous silent bug?
- What is the exact role difference between keys and values?
- Why does multi-head attention help more than one giant head?
- Why is
sqrt(d_head)the right scaling term? - What trade-off does KV cache introduce?
Applied¶
- Write the mask broadcasting shape for batched multi-head attention.
- Given
d_model = 768andnum_heads = 12, what isd_head? - State the shapes of
scores,weights, and merged output. - Explain one reason cached decode and naive decode might disagree.
- Walk through the attention block on a whiteboard.
Completion gate¶
- [ ] Weekly plan completed
- [ ] Narrative explainer fully read
- [ ] Assignment shipped
- [ ] Shape walk can be done from memory
- [ ] Ready to move to Module 05
Bridge forward¶
Next module — 05_llm_training_pipeline — covers how these transformer blocks get trained at scale: pretraining, SFT, RLHF, and the full pipeline from raw text to deployed model.