01. Week 1 — Overview of LLMs, Neural Networks Foundations¶

Key concepts to master¶

Forward pass: input → linear transform → activation → output
Weight initialization: Xavier (sigmoid/tanh), He (ReLU); why init scale matters
Backprop: chain rule applied backward through layers
Mini-batch SGD: batches, epochs, why noise helps generalization
Loss functions: cross-entropy + softmax for classification (clean gradient p - y); MSE for regression
Activation functions: ReLU/GELU; dead-neuron problem; vanishing gradient diagnosis
Optimizers: SGD with momentum → Adam → AdamW (decoupled weight decay)
Regularization: dropout, weight decay, early stopping; double descent
Self-supervised vs supervised vs RL learning paradigms
Scaling laws (power-law on log-log paper) + Chinchilla ratio (~20 tokens/parameter)

Forward pass: "An assembly line that keeps transforming raw input into a decision."
Weight initialization: "Set the starting loudness so signals neither explode nor whisper away."
Backpropagation: "A blame-hands_on_lab chain that sends error responsibility backward."
Activation functions: "Gates that bend straight lines into expressive shapes."
Optimizers: "SGD follows the slope; momentum adds a flywheel; AdamW keeps per-parameter speed limits."
Scaling laws: "Bigger models and datasets follow diminishing-returns power curves, not magic step changes."

Initializing all weights the same or too small, so neurons learn identical or vanishing features.
Using MSE for classification and getting weaker gradients than cross-entropy with softmax.
Picking a learning rate that is technically stable but too small to make real progress.
Killing ReLUs with aggressive learning rates or poor initialization, then blaming model capacity.
Treating Adam and AdamW as interchangeable even though weight decay is handled differently.
Ignoring the batch-size/noise tradeoff when comparing optimization runs.

Builds on: Module 00 ideas about loss minimization, regularization, train/validation discipline, and linear models as weighted-sum predictors.
Feeds into: embedding learning, attention blocks, transformer optimization, and scaling-law reasoning in later LLM modules.

"Why can't we just stack linear layers without activations?"
"Walk me through backprop on a two-layer network — what is the chain rule doing?"
"Why is He initialization paired with ReLU, and Xavier with tanh or sigmoid?"
"AdamW vs Adam — what changed mathematically, and why do people care?"
"Why do scaling-law papers talk about compute-optimal token/parameter ratios?"

For full Q&A bank with common-wrong-answer notes, see explainer §6.3.

Why do we need non-linear activation functions? (explainer §2.2)
Why does He init use 2/fan_in for ReLU? (§3.1)
Cross-entropy vs MSE — when each, and why? (§3.4)
Why mini-batch instead of full-batch or single-example? (§3.3)
ReLU vs GELU — what's the difference? (§4.1, §4.2)
AdamW vs Adam — what changed? (§4.2)
Dropout — what does it do mechanically? (§5.2)
Chinchilla scaling — what did it correct from Kaplan? (§5.1)