04. Daily Recall — Tokenization & Attention¶
Spaced practice. Answer from memory. If stuck, jump to the explainer chapter referenced in parens.
Monday — Chapters 1-2¶
- Why does naive word tokenization fail on
ChatGPT-4oand₹0.15? (02_explainer.md §1.1) - Why is character-level tokenization expensive for attention-based models? (02_explainer.md §2.2)
- Why is
[UNK]a weak patch for word-level tokenization? (02_explainer.md §2.3) - What problem is subword tokenization trying to balance? (02_explainer.md §2.4)
- Replay the first four BPE merges from the toy
tokencorpus. (02_explainer.md §2.5)
Tuesday — Chapter 3¶
- Why are token IDs only addresses? (02_explainer.md §3.1)
- Explain embedding lookup as matrix indexing. (02_explainer.md §3.2)
- Why do
dog bites manandman bites dogexpose the weakness of bag-of-words? (02_explainer.md §3.3) - What is the simple formula for combining token and position information? (02_explainer.md §3.4)
- Give the geometric intuition for sinusoidal encoding without using the formula first. (02_explainer.md §3.5)
- What practical discomfort leads people from absolute positions toward RoPE or ALiBi? (02_explainer.md §3.6-§3.7)
Wednesday — Chapter 4¶
- What is the RNN bottleneck in one clear sentence? (02_explainer.md §4.1)
- Explain attention as a soft lookup using the office analogy. (02_explainer.md §4.2)
- What does "each word queries every other word" mean concretely? (02_explainer.md §4.3)
- Write the scaled dot-product attention formula from memory. (02_explainer.md §4.4)
- Why do we divide by
√d_k? (02_explainer.md §4.5) - What does the causal mask prevent? Draw the allowed attention triangle. (02_explainer.md §4.6)
Thursday — Chapter 5¶
- Why can one attention head be insufficient? (02_explainer.md §5.1)
- What different jobs might two heads learn on the same sentence? (02_explainer.md §5.2)
- Walk the full pipeline from raw text to contextual vectors. (02_explainer.md §5.3)
- Why does prompt length hurt production latency so quickly? (02_explainer.md §5.4)
- What is the role of the output projection after concatenating heads? (02_explainer.md §5.2)
Friday — Cumulative¶
- Draw the five placeholders and label what each one means. (02_explainer.md ELI5)
- Sketch the Chapter 6 failure-fix chain from memory. (02_explainer.md §6.1)
- Recompute the tiny attention example with raw scores, scaling, and weighted sum. (02_explainer.md §4.4)
- Draw how
tokenizersis split by the toy BPE merges. (02_explainer.md §2.6)
Weekend — Pre-hands_on_lab¶
- If your chunking pipeline and deployment tokenizer differ, what breaks first? (02_explainer.md §2.7, §6.4)
- Which chapter would you revisit before implementing BPE from scratch, and why? (02_explainer.md §2.5-§2.6)
- Which part of attention still feels least intuitive to you: Q/K/V, softmax weighting, or multi-head motivation? Re-read that section before coding. (02_explainer.md §4.2-§5.2)