03. Week 1 — Neural Networks Foundations¶

For deep understanding see 02_explainer.md — narrative with worked examples, diagrams, retrieval prompts. This file is the quick-reference glossary: formulas, definitions, lookup tables.

Section 1 — From perceptron to multi-layer¶

A perceptron is the simplest neural unit:

output = activation(W·input + b)

Stack many in layers → multi-layer perceptron (MLP). Each layer is: - Linear transform: Wx + b - Non-linear activation: σ(Wx + b)

Without activation, the whole network collapses to one linear function. With it, the network can approximate any continuous function (universal approximation theorem).

See explainer chapter 2 for the full failure-fix story (XOR cannot be cut by one line).

Section 2 — Forward pass¶

h_1 = relu(W_1 @ x + b_1)
h_2 = relu(W_2 @ h_1 + b_2)
y_hat = softmax(W_3 @ h_2 + b_3)

Section 3 — Weight initialization¶

Weights start as small random numbers. Scale matters.

Failure mode	Cause
NaN at step 1	init too large; forward values explode
Loss does not decrease	init too small; gradient vanishes from step 0
All neurons learn the same thing	init all-zero or all-constant; symmetry not broken

Two rules:

Init	Variance	When
Xavier / Glorot	`2 / (fan_in + fan_out)`	sigmoid / tanh
He	`2 / fan_in`	ReLU (compensates for half the activations being zero)

PyTorch nn.Linear defaults to a He variant. See explainer §3.1.

Section 4 — Loss functions¶

How wrong is the prediction?

Cross-entropy (classification): L = -Σ y_i · log(ŷ_i)
Mean squared error (regression): L = (1/n) · Σ (y_i - ŷ_i)²
Negative log-likelihood: L = -log(p(y|x))

Softmax + cross-entropy — the canonical pair¶

For K classes, logits → softmax → probabilities → cross-entropy:

p_i = exp(z_i) / Σ_j exp(z_j)
L   = -log(p_c)            where c is the true class

Clean gradient: ∂L/∂z_i = p_i − y_i (predicted minus one-hot true). No saturation.

For LLMs: same math, K = vocab size (~50K). See explainer §3.4.

Section 5 — Backpropagation¶

Chain rule applied backward through the network:

∂L/∂W_1 = ∂L/∂ŷ · ∂ŷ/∂h_2 · ∂h_2/∂h_1 · ∂h_1/∂W_1

Update each weight in the opposite direction:

W := W - α · ∂L/∂W

α = learning rate. Too high → unstable. Too low → slow.

Mini-batch SGD¶

Method	Inputs per step	Notes
Full-batch GD	all	Slow, no noise
Stochastic GD	1	Fast, very noisy
Mini-batch GD	32–256	Default. Balance speed and noise

Vocabulary: - Step / iteration = one gradient update on one batch - Epoch = one full pass over the training set - Noise from mini-batches helps escape shallow minima and prefer flat ones

See explainer §3.3.

Section 6 — Activation functions¶

Function	Formula	When
ReLU	max(0, x)	Default for hidden layers; cheap; suffers dead-neuron problem
GELU	x · Φ(x)	Smoother ReLU; default in transformers
SiLU / Swish	x · σ(x)	Used in Llama family
sigmoid	1 / (1 + e^-x)	Output for binary classification; saturates at extremes
tanh	(e^x - e^-x) / (e^x + e^-x)	RNNs; bounded -1 to 1
softmax	e^x_i / Σe^x_j	Multi-class output; turns logits into probabilities

GELU is the default in modern transformers — smooth (better gradients) and slightly outperforms ReLU empirically.

Vanishing gradient. Sigmoid's max derivative is 0.25; multiplied through 10 layers → 0.25¹⁰ ≈ 10⁻⁶. ReLU's derivative is 1 in active region → 1¹⁰ = 1. See explainer §4.1.

Section 7 — Optimizers (beyond vanilla GD)¶

Optimizer	What it adds
SGD with momentum	Running avg of past gradients; smooths updates
Adam	Per-parameter adaptive learning rates from first/second moment estimates
AdamW	Adam + decoupled weight decay; default for transformers

The decoupling matters — in plain Adam, weight decay was scaled by the adaptive lr, which broke regularization. AdamW separates them. See explainer §4.2.

Section 8 — Overfitting and regularization¶

Train loss	Val loss	Diagnosis
high	high	Underfitting (model too small)
low	low	Good
low	high	Overfitting (memorizing)

Regularization toolkit:

More data — preferred fix when possible
Dropout — randomly zero 10–50% of neurons each training step; forces redundancy
Weight decay — add λ·Σ W² to loss; prefer small weights (AdamW does this cleanly)
Early stopping — stop when validation loss starts rising
Data augmentation — random crops, flips, noise (vision/audio)

Double descent. Classical theory says more params than data → overfit. Empirically, past a critical point, generalization improves again. Modern LLMs are in this second regime. Open problem. See explainer §5.2.

Section 9 — Learning paradigms¶

Paradigm	Data	Examples
Supervised	Input + ground-truth label	Image classification, sentiment analysis
Self-supervised	Input only; create labels from input itself	Next-token prediction (LLMs), masked LM (BERT)
Contrastive	Pairs (similar, dissimilar)	CLIP, SimCLR — learns embeddings
Reinforcement	Rewards from environment	Game playing, RLHF for LLMs

LLMs are pre-trained with self-supervised (next-token prediction). Then post-trained with supervised (instruction tuning) and reinforcement (RLHF / DPO).

Section 10 — Scaling laws¶

Loss decreases as a power law in model size N, data D, compute C:

L(N) ≈ a · N^(−α) (similar for D, C; α typically 0.05–0.1)

Plotted on log-log axes, this is a straight line. Predictable improvement per doubling.

Chinchilla (Hoffmann 2022) corrected Kaplan: for compute-optimal training, model and data scale together. Ratio ~20 tokens per parameter. GPT-3 had ~1.7. Chinchilla has ~20 and beat GPT-3.

Modern frontier models (Llama 3, etc.) push past Chinchilla — over-train small models because inference cost dominates. See explainer §5.1.

Reading list¶

Karpathy "Neural Networks: Zero to Hero" — parts 1-2
"Scaling Laws for Neural Language Models" (Kaplan 2020)
"Training Compute-Optimal LLMs" (Chinchilla, Hoffmann 2022)
3Blue1Brown "Neural Networks" (3-part series)

Reference material¶

YouTube¶

But what is a neural network? | Deep learning chapter 1 — Builds intuition for neurons, weights, and activation functions through animated visualizations of a digit-recognition network.
The spelled-out intro to neural networks and backpropagation: building micrograd — Implements a scalar-valued autograd engine and a neural net from scratch so every step of backpropagation is explicit.

Blogs¶

Neural Networks and Deep Learning - Chapter 1 — Derives gradient descent and backprop from first principles through a worked MNIST classifier.
CS231n: Neural Networks Part 1 - Setting up the Architecture — Covers activation functions, layer sizing, and representational power.

Self-check¶

For full Q&A see explainer §6.3.

Why do we need non-linear activation functions? (§2.2)
Why does He init use 2/fan_in for ReLU? (§3.1)
Cross-entropy vs MSE — when each, and why? (§3.4)
Mini-batch vs full-batch vs single-example — tradeoffs? (§3.3)
ReLU vs GELU — what's the difference, when each? (§4.1)
AdamW vs Adam — what changed? (§4.2)
Dropout — what does it actually do? (§5.2)
Self-supervised learning: how do LLMs create their own labels? (§9)
Chinchilla scaling: what did it correct from Kaplan? (§5.1)