06. Module 04 Review — Causal Attention & Coding¶

Focus: causal masking, Q/K/V projections, multi-head attention, KV cache, and full transformer-block shape fluency.

Review loop¶

Re-answer the self-check questions in 01_weekly_plan.md from memory.
Re-read the failure-fix chain and shape walk in 02_explainer.md.
Use 04_daily_recall.md aloud, not silently.
Review 05_hands_on_lab.md and identify one bug you can now debug confidently.

Reflection¶

Which part of causal attention now feels mechanical instead of magical?
Where do you still hesitate: mask orientation, head reshaping, or cache mechanics?
Can you explain prefill versus decode without hand-waving?
What should feel automatic before Module 05 begins?

Foundation-gap audit before Module 05¶

You should be able to do each of these without notes:

write scaled dot-product attention from scratch
draw the causal mask for any T
explain why masking happens before softmax
derive Q, K, and V from X [B, T, D]
split and merge heads while preserving shape correctness
explain what W_O does after concatenation
describe KV cache shapes and append logic
walk through one transformer block from embeddings to logits

If two or more bullets still feel weak, revisit 02_explainer.md before moving on.

Rapid-fire checkpoint¶

Conceptual¶

Why is future leakage such a dangerous silent bug?
What is the exact role difference between keys and values?
Why does multi-head attention help more than one giant head?
Why is sqrt(d_head) the right scaling term?
What trade-off does KV cache introduce?

Applied¶

Write the mask broadcasting shape for batched multi-head attention.
Given d_model = 768 and num_heads = 12, what is d_head?
State the shapes of scores, weights, and merged output.
Explain one reason cached decode and naive decode might disagree.
Walk through the attention block on a whiteboard.

Completion gate¶

[ ] Weekly plan completed
[ ] Narrative explainer fully read
[ ] Assignment shipped
[ ] Shape walk can be done from memory
[ ] Ready to move to Module 05

Bridge forward¶

Next module — 05_llm_training_pipeline — covers how these transformer blocks get trained at scale: pretraining, SFT, RLHF, and the full pipeline from raw text to deployed model.