Skip to content

06. Module 04 Review — Causal Attention & Coding

Focus: causal masking, Q/K/V projections, multi-head attention, KV cache, and full transformer-block shape fluency.

Review loop

  1. Re-answer the self-check questions in 01_weekly_plan.md from memory.
  2. Re-read the failure-fix chain and shape walk in 02_explainer.md.
  3. Use 04_daily_recall.md aloud, not silently.
  4. Review 05_hands_on_lab.md and identify one bug you can now debug confidently.

Reflection

  • Which part of causal attention now feels mechanical instead of magical?
  • Where do you still hesitate: mask orientation, head reshaping, or cache mechanics?
  • Can you explain prefill versus decode without hand-waving?
  • What should feel automatic before Module 05 begins?

Foundation-gap audit before Module 05

You should be able to do each of these without notes:

  • write scaled dot-product attention from scratch
  • draw the causal mask for any T
  • explain why masking happens before softmax
  • derive Q, K, and V from X [B, T, D]
  • split and merge heads while preserving shape correctness
  • explain what W_O does after concatenation
  • describe KV cache shapes and append logic
  • walk through one transformer block from embeddings to logits

If two or more bullets still feel weak, revisit 02_explainer.md before moving on.

Rapid-fire checkpoint

Conceptual

  1. Why is future leakage such a dangerous silent bug?
  2. What is the exact role difference between keys and values?
  3. Why does multi-head attention help more than one giant head?
  4. Why is sqrt(d_head) the right scaling term?
  5. What trade-off does KV cache introduce?

Applied

  1. Write the mask broadcasting shape for batched multi-head attention.
  2. Given d_model = 768 and num_heads = 12, what is d_head?
  3. State the shapes of scores, weights, and merged output.
  4. Explain one reason cached decode and naive decode might disagree.
  5. Walk through the attention block on a whiteboard.

Completion gate

  • [ ] Weekly plan completed
  • [ ] Narrative explainer fully read
  • [ ] Assignment shipped
  • [ ] Shape walk can be done from memory
  • [ ] Ready to move to Module 05

Bridge forward

Next module — 05_llm_training_pipeline — covers how these transformer blocks get trained at scale: pretraining, SFT, RLHF, and the full pipeline from raw text to deployed model.