04. Week 4 — Daily Recall
Monday
- [W3] Write the scaled dot-product attention formula from memory.
- What is the causal mask for
T=4? Draw the matrix.
- Why can unmasked training give low loss but poor generation?
Tuesday
- What are the three jobs of Q, K, and V?
- If
X has shape [B, T, D], what shapes do Q, K, and V have before head splitting?
- [W2] Why do we divide by
sqrt(d_k) in attention?
Wednesday
- If
d_model = 512 and num_heads = 8, what is d_head?
- Write the shape flow for multi-head attention: input → split → scores → merge.
- Why is the output projection
W_O useful after concatenating heads?
Thursday
- What exactly does KV cache store?
- Why does decode with cache get faster while memory usage rises?
- What is the difference between prefill and decode in GPT inference?
Friday
- Walk through one GPT block from embeddings to logits using shapes only.
- Name three silent bugs that can happen in causal attention code.
- Bridge question: what training topics sit immediately after this module in Week 5?