Skip to content

04. Daily Recall — Tokenization & Attention

Spaced practice. Answer from memory. If stuck, jump to the explainer chapter referenced in parens.

Monday — Chapters 1-2

  1. Why does naive word tokenization fail on ChatGPT-4o and ₹0.15? (02_explainer.md §1.1)
  2. Why is character-level tokenization expensive for attention-based models? (02_explainer.md §2.2)
  3. Why is [UNK] a weak patch for word-level tokenization? (02_explainer.md §2.3)
  4. What problem is subword tokenization trying to balance? (02_explainer.md §2.4)
  5. Replay the first four BPE merges from the toy token corpus. (02_explainer.md §2.5)

Tuesday — Chapter 3

  1. Why are token IDs only addresses? (02_explainer.md §3.1)
  2. Explain embedding lookup as matrix indexing. (02_explainer.md §3.2)
  3. Why do dog bites man and man bites dog expose the weakness of bag-of-words? (02_explainer.md §3.3)
  4. What is the simple formula for combining token and position information? (02_explainer.md §3.4)
  5. Give the geometric intuition for sinusoidal encoding without using the formula first. (02_explainer.md §3.5)
  6. What practical discomfort leads people from absolute positions toward RoPE or ALiBi? (02_explainer.md §3.6-§3.7)

Wednesday — Chapter 4

  1. What is the RNN bottleneck in one clear sentence? (02_explainer.md §4.1)
  2. Explain attention as a soft lookup using the office analogy. (02_explainer.md §4.2)
  3. What does "each word queries every other word" mean concretely? (02_explainer.md §4.3)
  4. Write the scaled dot-product attention formula from memory. (02_explainer.md §4.4)
  5. Why do we divide by √d_k? (02_explainer.md §4.5)
  6. What does the causal mask prevent? Draw the allowed attention triangle. (02_explainer.md §4.6)

Thursday — Chapter 5

  1. Why can one attention head be insufficient? (02_explainer.md §5.1)
  2. What different jobs might two heads learn on the same sentence? (02_explainer.md §5.2)
  3. Walk the full pipeline from raw text to contextual vectors. (02_explainer.md §5.3)
  4. Why does prompt length hurt production latency so quickly? (02_explainer.md §5.4)
  5. What is the role of the output projection after concatenating heads? (02_explainer.md §5.2)

Friday — Cumulative

  1. Draw the five placeholders and label what each one means. (02_explainer.md ELI5)
  2. Sketch the Chapter 6 failure-fix chain from memory. (02_explainer.md §6.1)
  3. Recompute the tiny attention example with raw scores, scaling, and weighted sum. (02_explainer.md §4.4)
  4. Draw how tokenizers is split by the toy BPE merges. (02_explainer.md §2.6)

Weekend — Pre-hands_on_lab

  1. If your chunking pipeline and deployment tokenizer differ, what breaks first? (02_explainer.md §2.7, §6.4)
  2. Which chapter would you revisit before implementing BPE from scratch, and why? (02_explainer.md §2.5-§2.6)
  3. Which part of attention still feels least intuitive to you: Q/K/V, softmax weighting, or multi-head motivation? Re-read that section before coding. (02_explainer.md §4.2-§5.2)