01. Week 1 — Overview of LLMs, Neural Networks Foundations¶
Key concepts to master¶
- Forward pass: input → linear transform → activation → output
- Weight initialization: Xavier (sigmoid/tanh), He (ReLU); why init scale matters
- Backprop: chain rule applied backward through layers
- Mini-batch SGD: batches, epochs, why noise helps generalization
- Loss functions: cross-entropy + softmax for classification (clean gradient
p - y); MSE for regression - Activation functions: ReLU/GELU; dead-neuron problem; vanishing gradient diagnosis
- Optimizers: SGD with momentum → Adam → AdamW (decoupled weight decay)
- Regularization: dropout, weight decay, early stopping; double descent
- Self-supervised vs supervised vs RL learning paradigms
- Scaling laws (power-law on log-log paper) + Chinchilla ratio (~20 tokens/parameter)
🧠 Mental models¶
- Forward pass: "An assembly line that keeps transforming raw input into a decision."
- Weight initialization: "Set the starting loudness so signals neither explode nor whisper away."
- Backpropagation: "A blame-hands_on_lab chain that sends error responsibility backward."
- Activation functions: "Gates that bend straight lines into expressive shapes."
- Optimizers: "SGD follows the slope; momentum adds a flywheel; AdamW keeps per-parameter speed limits."
- Scaling laws: "Bigger models and datasets follow diminishing-returns power curves, not magic step changes."
⚠️ Common traps¶
- Initializing all weights the same or too small, so neurons learn identical or vanishing features.
- Using MSE for classification and getting weaker gradients than cross-entropy with softmax.
- Picking a learning rate that is technically stable but too small to make real progress.
- Killing ReLUs with aggressive learning rates or poor initialization, then blaming model capacity.
- Treating Adam and AdamW as interchangeable even though weight decay is handled differently.
- Ignoring the batch-size/noise tradeoff when comparing optimization runs.
🔗 Prerequisites & connections¶
- Builds on: Module 00 ideas about loss minimization, regularization, train/validation discipline, and linear models as weighted-sum predictors.
- Feeds into: embedding learning, attention blocks, transformer optimization, and scaling-law reasoning in later LLM modules.
💬 Interview phrasing¶
- "Why can't we just stack linear layers without activations?"
- "Walk me through backprop on a two-layer network — what is the chain rule doing?"
- "Why is He initialization paired with ReLU, and Xavier with tanh or sigmoid?"
- "AdamW vs Adam — what changed mathematically, and why do people care?"
- "Why do scaling-law papers talk about compute-optimal token/parameter ratios?"
⏱️ Difficulty markers¶
- 🟢 forward pass
- 🟢 loss functions
- 🟡 mini-batch SGD
- 🟡 activation behavior
- 🔴 backpropagation
- 🔴 weight initialization
- 🔴 scaling laws
Self-check questions¶
For full Q&A bank with common-wrong-answer notes, see explainer §6.3.
- Why do we need non-linear activation functions? (explainer §2.2)
- Why does He init use
2/fan_infor ReLU? (§3.1) - Cross-entropy vs MSE — when each, and why? (§3.4)
- Why mini-batch instead of full-batch or single-example? (§3.3)
- ReLU vs GELU — what's the difference? (§4.1, §4.2)
- AdamW vs Adam — what changed? (§4.2)
- Dropout — what does it do mechanically? (§5.2)
- Chinchilla scaling — what did it correct from Kaplan? (§5.1)
Health check¶
- [ ] Read all 6 chapters of explainer
- [ ] MNIST repo public with >95% accuracy
- [ ] LinkedIn post #1 published
- [ ] All daily-recall questions answered from memory
- [ ] Failure-fix table from §6.1 sketched without looking
- [ ] Applications: 5+ to top S-tier