04. Week 1 — Daily Recall¶
Spaced practice. Answer from memory. If stuck on a question, jump to the explainer chapter referenced in parens.
Monday (after explainer chapters 1-2)¶
- Why does one perceptron fail at XOR? Walk the geometric reason. (§1.1)
- Universal approximation — what does it guarantee, what doesn't it tell you? (§2.2)
- Why do two stacked linear layers collapse to one? (§2.1)
- Write the forward pass for a 2-layer MLP with ReLU. (§2.2)
- The hand-picked XOR-solving network uses
y = h₁ − 2·h₂. What does eachhrepresent in plain language? (§2.2)
Tuesday (after explainer chapter 3)¶
- He init: what's the variance formula? Why does ReLU need
2/fan_inand not1/fan_in? (§3.1) - What goes wrong if init is too large? Too small? All zeros? (§3.1)
- Explain backprop in one sentence using the chain rule. (§3.2)
- Difference between an iteration, a step, and an epoch? (§3.3)
- Why does mini-batch noise help generalization? (§3.3)
- Why is the cross-entropy + softmax gradient just
p − y? Why is that clean? (§3.4) - What does softmax guarantee about its output? (§3.4)
Wednesday (after explainer chapter 4)¶
- Sigmoid's max derivative is 0.25. Why does that cause vanishing gradient through 10 layers? (§4.1)
- ReLU's "dead neuron" problem — what causes it, how do you detect it in production? (§4.1, §6.4)
- Why does the canyon-shaped loss surface trip plain SGD? (§4.2)
- AdamW vs Adam — what does decoupled weight decay actually fix? (§4.2)
- What is learning-rate warmup, and why do transformers need it? (§6.3)
Thursday (after explainer chapter 5)¶
- Why is a power law a straight line on log-log paper? (§5.1)
- Chinchilla's key finding in one sentence. (§5.1)
- Llama 3 trained an 8B model on 15T tokens (~1875:1 ratio). Why ignore Chinchilla? (§5.1)
- Three signs you are overfitting. Three fixes. (§5.2)
- Dropout — what does it do mechanically and why does that help? (§5.2)
- What is double descent? Why is it a puzzle? (§5.2)
Friday (cumulative)¶
- From memory: draw the full pipeline (input → init → forward → loss → backward → update).
- Self-supervised vs supervised — give one example of each. Which one trains LLMs? (§9 of study_material)
- Three things we still do not fully understand about deep networks. (§5.3)
Weekend (pre-hands_on_lab)¶
- Shapes of W, x, b in a layer with 784 inputs and 128 outputs?
- Why does MNIST use softmax + cross-entropy and not sigmoid + MSE? (§3.4)
- List all 11 failures and 11 fixes from explainer §6.1 — sketch the table from memory.
- For each row in the table, name the corresponding ELI5 placeholder (rule pile, bend, nudge, etc.) where it applies.