04. Daily Recall — Tokenization & Attention¶

Spaced practice. Answer from memory. If stuck, jump to the explainer chapter referenced in parens.

Monday — Chapters 1-2¶

Why does naive word tokenization fail on ChatGPT-4o and ₹0.15? (02_explainer.md §1.1)
Why is character-level tokenization expensive for attention-based models? (02_explainer.md §2.2)
Why is [UNK] a weak patch for word-level tokenization? (02_explainer.md §2.3)
What problem is subword tokenization trying to balance? (02_explainer.md §2.4)
Replay the first four BPE merges from the toy token corpus. (02_explainer.md §2.5)

Why are token IDs only addresses? (02_explainer.md §3.1)
Explain embedding lookup as matrix indexing. (02_explainer.md §3.2)
Why do dog bites man and man bites dog expose the weakness of bag-of-words? (02_explainer.md §3.3)
What is the simple formula for combining token and position information? (02_explainer.md §3.4)
Give the geometric intuition for sinusoidal encoding without using the formula first. (02_explainer.md §3.5)
What practical discomfort leads people from absolute positions toward RoPE or ALiBi? (02_explainer.md §3.6-§3.7)

What is the RNN bottleneck in one clear sentence? (02_explainer.md §4.1)
Explain attention as a soft lookup using the office analogy. (02_explainer.md §4.2)
What does "each word queries every other word" mean concretely? (02_explainer.md §4.3)
Write the scaled dot-product attention formula from memory. (02_explainer.md §4.4)
Why do we divide by √d_k? (02_explainer.md §4.5)
What does the causal mask prevent? Draw the allowed attention triangle. (02_explainer.md §4.6)

Why can one attention head be insufficient? (02_explainer.md §5.1)
What different jobs might two heads learn on the same sentence? (02_explainer.md §5.2)
Walk the full pipeline from raw text to contextual vectors. (02_explainer.md §5.3)
Why does prompt length hurt production latency so quickly? (02_explainer.md §5.4)
What is the role of the output projection after concatenating heads? (02_explainer.md §5.2)

Draw the five placeholders and label what each one means. (02_explainer.md ELI5)
Sketch the Chapter 6 failure-fix chain from memory. (02_explainer.md §6.1)
Recompute the tiny attention example with raw scores, scaling, and weighted sum. (02_explainer.md §4.4)
Draw how tokenizers is split by the toy BPE merges. (02_explainer.md §2.6)

If your chunking pipeline and deployment tokenizer differ, what breaks first? (02_explainer.md §2.7, §6.4)
Which chapter would you revisit before implementing BPE from scratch, and why? (02_explainer.md §2.5-§2.6)
Which part of attention still feels least intuitive to you: Q/K/V, softmax weighting, or multi-head motivation? Re-read that section before coding. (02_explainer.md §4.2-§5.2)