03. Week 1 — Neural Networks Foundations¶
For deep understanding see
02_explainer.md— narrative with worked examples, diagrams, retrieval prompts. This file is the quick-reference glossary: formulas, definitions, lookup tables.
Section 1 — From perceptron to multi-layer¶
A perceptron is the simplest neural unit:
Stack many in layers → multi-layer perceptron (MLP). Each layer is:
- Linear transform: Wx + b
- Non-linear activation: σ(Wx + b)
Without activation, the whole network collapses to one linear function. With it, the network can approximate any continuous function (universal approximation theorem).
See explainer chapter 2 for the full failure-fix story (XOR cannot be cut by one line).
Section 2 — Forward pass¶
Section 3 — Weight initialization¶
Weights start as small random numbers. Scale matters.
| Failure mode | Cause |
|---|---|
| NaN at step 1 | init too large; forward values explode |
| Loss does not decrease | init too small; gradient vanishes from step 0 |
| All neurons learn the same thing | init all-zero or all-constant; symmetry not broken |
Two rules:
| Init | Variance | When |
|---|---|---|
| Xavier / Glorot | 2 / (fan_in + fan_out) |
sigmoid / tanh |
| He | 2 / fan_in |
ReLU (compensates for half the activations being zero) |
PyTorch nn.Linear defaults to a He variant. See explainer §3.1.
Section 4 — Loss functions¶
How wrong is the prediction?
- Cross-entropy (classification):
L = -Σ y_i · log(ŷ_i) - Mean squared error (regression):
L = (1/n) · Σ (y_i - ŷ_i)² - Negative log-likelihood:
L = -log(p(y|x))
Softmax + cross-entropy — the canonical pair¶
For K classes, logits → softmax → probabilities → cross-entropy:
Clean gradient: ∂L/∂z_i = p_i − y_i (predicted minus one-hot true). No saturation.
For LLMs: same math, K = vocab size (~50K). See explainer §3.4.
Section 5 — Backpropagation¶
Chain rule applied backward through the network:
Update each weight in the opposite direction:
α = learning rate. Too high → unstable. Too low → slow.
Mini-batch SGD¶
| Method | Inputs per step | Notes |
|---|---|---|
| Full-batch GD | all | Slow, no noise |
| Stochastic GD | 1 | Fast, very noisy |
| Mini-batch GD | 32–256 | Default. Balance speed and noise |
Vocabulary: - Step / iteration = one gradient update on one batch - Epoch = one full pass over the training set - Noise from mini-batches helps escape shallow minima and prefer flat ones
See explainer §3.3.
Section 6 — Activation functions¶
| Function | Formula | When |
|---|---|---|
| ReLU | max(0, x) | Default for hidden layers; cheap; suffers dead-neuron problem |
| GELU | x · Φ(x) | Smoother ReLU; default in transformers |
| SiLU / Swish | x · σ(x) | Used in Llama family |
| sigmoid | 1 / (1 + e^-x) | Output for binary classification; saturates at extremes |
| tanh | (e^x - e^-x) / (e^x + e^-x) | RNNs; bounded -1 to 1 |
| softmax | e^x_i / Σe^x_j | Multi-class output; turns logits into probabilities |
GELU is the default in modern transformers — smooth (better gradients) and slightly outperforms ReLU empirically.
Vanishing gradient. Sigmoid's max derivative is 0.25; multiplied through 10 layers → 0.25¹⁰ ≈ 10⁻⁶. ReLU's derivative is 1 in active region → 1¹⁰ = 1. See explainer §4.1.
Section 7 — Optimizers (beyond vanilla GD)¶
| Optimizer | What it adds |
|---|---|
| SGD with momentum | Running avg of past gradients; smooths updates |
| Adam | Per-parameter adaptive learning rates from first/second moment estimates |
| AdamW | Adam + decoupled weight decay; default for transformers |
The decoupling matters — in plain Adam, weight decay was scaled by the adaptive lr, which broke regularization. AdamW separates them. See explainer §4.2.
Section 8 — Overfitting and regularization¶
| Train loss | Val loss | Diagnosis |
|---|---|---|
| high | high | Underfitting (model too small) |
| low | low | Good |
| low | high | Overfitting (memorizing) |
Regularization toolkit:
- More data — preferred fix when possible
- Dropout — randomly zero 10–50% of neurons each training step; forces redundancy
- Weight decay — add
λ·Σ W²to loss; prefer small weights (AdamW does this cleanly) - Early stopping — stop when validation loss starts rising
- Data augmentation — random crops, flips, noise (vision/audio)
Double descent. Classical theory says more params than data → overfit. Empirically, past a critical point, generalization improves again. Modern LLMs are in this second regime. Open problem. See explainer §5.2.
Section 9 — Learning paradigms¶
| Paradigm | Data | Examples |
|---|---|---|
| Supervised | Input + ground-truth label | Image classification, sentiment analysis |
| Self-supervised | Input only; create labels from input itself | Next-token prediction (LLMs), masked LM (BERT) |
| Contrastive | Pairs (similar, dissimilar) | CLIP, SimCLR — learns embeddings |
| Reinforcement | Rewards from environment | Game playing, RLHF for LLMs |
LLMs are pre-trained with self-supervised (next-token prediction). Then post-trained with supervised (instruction tuning) and reinforcement (RLHF / DPO).
Section 10 — Scaling laws¶
Loss decreases as a power law in model size N, data D, compute C:
L(N) ≈ a · N^(−α) (similar for D, C; α typically 0.05–0.1)
Plotted on log-log axes, this is a straight line. Predictable improvement per doubling.
Chinchilla (Hoffmann 2022) corrected Kaplan: for compute-optimal training, model and data scale together. Ratio ~20 tokens per parameter. GPT-3 had ~1.7. Chinchilla has ~20 and beat GPT-3.
Modern frontier models (Llama 3, etc.) push past Chinchilla — over-train small models because inference cost dominates. See explainer §5.1.
Reading list¶
- Karpathy "Neural Networks: Zero to Hero" — parts 1-2
- "Scaling Laws for Neural Language Models" (Kaplan 2020)
- "Training Compute-Optimal LLMs" (Chinchilla, Hoffmann 2022)
- 3Blue1Brown "Neural Networks" (3-part series)
Reference material¶
YouTube¶
- But what is a neural network? | Deep learning chapter 1 — Builds intuition for neurons, weights, and activation functions through animated visualizations of a digit-recognition network.
- The spelled-out intro to neural networks and backpropagation: building micrograd — Implements a scalar-valued autograd engine and a neural net from scratch so every step of backpropagation is explicit.
Blogs¶
- Neural Networks and Deep Learning - Chapter 1 — Derives gradient descent and backprop from first principles through a worked MNIST classifier.
- CS231n: Neural Networks Part 1 - Setting up the Architecture — Covers activation functions, layer sizing, and representational power.
Self-check¶
For full Q&A see explainer §6.3.
- Why do we need non-linear activation functions? (§2.2)
- Why does He init use
2/fan_infor ReLU? (§3.1) - Cross-entropy vs MSE — when each, and why? (§3.4)
- Mini-batch vs full-batch vs single-example — tradeoffs? (§3.3)
- ReLU vs GELU — what's the difference, when each? (§4.1)
- AdamW vs Adam — what changed? (§4.2)
- Dropout — what does it actually do? (§5.2)
- Self-supervised learning: how do LLMs create their own labels? (§9)
- Chinchilla scaling: what did it correct from Kaplan? (§5.1)