Skip to content

04. Week 4 — Daily Recall

Monday

  1. [W3] Write the scaled dot-product attention formula from memory.
  2. What is the causal mask for T=4? Draw the matrix.
  3. Why can unmasked training give low loss but poor generation?

Tuesday

  1. What are the three jobs of Q, K, and V?
  2. If X has shape [B, T, D], what shapes do Q, K, and V have before head splitting?
  3. [W2] Why do we divide by sqrt(d_k) in attention?

Wednesday

  1. If d_model = 512 and num_heads = 8, what is d_head?
  2. Write the shape flow for multi-head attention: input → split → scores → merge.
  3. Why is the output projection W_O useful after concatenating heads?

Thursday

  1. What exactly does KV cache store?
  2. Why does decode with cache get faster while memory usage rises?
  3. What is the difference between prefill and decode in GPT inference?

Friday

  1. Walk through one GPT block from embeddings to logits using shapes only.
  2. Name three silent bugs that can happen in causal attention code.
  3. Bridge question: what training topics sit immediately after this module in Week 5?