04. Week 1 — Daily Recall¶

Spaced practice. Answer from memory. If stuck on a question, jump to the explainer chapter referenced in parens.

Monday (after explainer chapters 1-2)¶

Why does one perceptron fail at XOR? Walk the geometric reason. (§1.1)
Universal approximation — what does it guarantee, what doesn't it tell you? (§2.2)
Why do two stacked linear layers collapse to one? (§2.1)
Write the forward pass for a 2-layer MLP with ReLU. (§2.2)
The hand-picked XOR-solving network uses y = h₁ − 2·h₂. What does each h represent in plain language? (§2.2)

He init: what's the variance formula? Why does ReLU need 2/fan_in and not 1/fan_in? (§3.1)
What goes wrong if init is too large? Too small? All zeros? (§3.1)
Explain backprop in one sentence using the chain rule. (§3.2)
Difference between an iteration, a step, and an epoch? (§3.3)
Why does mini-batch noise help generalization? (§3.3)
Why is the cross-entropy + softmax gradient just p − y? Why is that clean? (§3.4)
What does softmax guarantee about its output? (§3.4)

Sigmoid's max derivative is 0.25. Why does that cause vanishing gradient through 10 layers? (§4.1)
ReLU's "dead neuron" problem — what causes it, how do you detect it in production? (§4.1, §6.4)
Why does the canyon-shaped loss surface trip plain SGD? (§4.2)
AdamW vs Adam — what does decoupled weight decay actually fix? (§4.2)
What is learning-rate warmup, and why do transformers need it? (§6.3)

Why is a power law a straight line on log-log paper? (§5.1)
Chinchilla's key finding in one sentence. (§5.1)
Llama 3 trained an 8B model on 15T tokens (~1875:1 ratio). Why ignore Chinchilla? (§5.1)
Three signs you are overfitting. Three fixes. (§5.2)
Dropout — what does it do mechanically and why does that help? (§5.2)
What is double descent? Why is it a puzzle? (§5.2)

From memory: draw the full pipeline (input → init → forward → loss → backward → update).
Self-supervised vs supervised — give one example of each. Which one trains LLMs? (§9 of study_material)
Three things we still do not fully understand about deep networks. (§5.3)

Shapes of W, x, b in a layer with 784 inputs and 128 outputs?
Why does MNIST use softmax + cross-entropy and not sigmoid + MSE? (§3.4)
List all 11 failures and 11 fixes from explainer §6.1 — sketch the table from memory.
For each row in the table, name the corresponding ELI5 placeholder (rule pile, bend, nudge, etc.) where it applies.