Skip to content

03. Week 1 — Neural Networks Foundations

For deep understanding see 02_explainer.md — narrative with worked examples, diagrams, retrieval prompts. This file is the quick-reference glossary: formulas, definitions, lookup tables.

Section 1 — From perceptron to multi-layer

A perceptron is the simplest neural unit:

output = activation(W·input + b)

Stack many in layers → multi-layer perceptron (MLP). Each layer is: - Linear transform: Wx + b - Non-linear activation: σ(Wx + b)

Without activation, the whole network collapses to one linear function. With it, the network can approximate any continuous function (universal approximation theorem).

See explainer chapter 2 for the full failure-fix story (XOR cannot be cut by one line).

Section 2 — Forward pass

h_1 = relu(W_1 @ x + b_1)
h_2 = relu(W_2 @ h_1 + b_2)
y_hat = softmax(W_3 @ h_2 + b_3)

Section 3 — Weight initialization

Weights start as small random numbers. Scale matters.

Failure mode Cause
NaN at step 1 init too large; forward values explode
Loss does not decrease init too small; gradient vanishes from step 0
All neurons learn the same thing init all-zero or all-constant; symmetry not broken

Two rules:

Init Variance When
Xavier / Glorot 2 / (fan_in + fan_out) sigmoid / tanh
He 2 / fan_in ReLU (compensates for half the activations being zero)

PyTorch nn.Linear defaults to a He variant. See explainer §3.1.

Section 4 — Loss functions

How wrong is the prediction?

  • Cross-entropy (classification): L = -Σ y_i · log(ŷ_i)
  • Mean squared error (regression): L = (1/n) · Σ (y_i - ŷ_i)²
  • Negative log-likelihood: L = -log(p(y|x))

Softmax + cross-entropy — the canonical pair

For K classes, logits → softmax → probabilities → cross-entropy:

p_i = exp(z_i) / Σ_j exp(z_j)
L   = -log(p_c)            where c is the true class

Clean gradient: ∂L/∂z_i = p_i − y_i (predicted minus one-hot true). No saturation.

For LLMs: same math, K = vocab size (~50K). See explainer §3.4.

Section 5 — Backpropagation

Chain rule applied backward through the network:

∂L/∂W_1 = ∂L/∂ŷ · ∂ŷ/∂h_2 · ∂h_2/∂h_1 · ∂h_1/∂W_1

Update each weight in the opposite direction:

W := W - α · ∂L/∂W

α = learning rate. Too high → unstable. Too low → slow.

Mini-batch SGD

Method Inputs per step Notes
Full-batch GD all Slow, no noise
Stochastic GD 1 Fast, very noisy
Mini-batch GD 32–256 Default. Balance speed and noise

Vocabulary: - Step / iteration = one gradient update on one batch - Epoch = one full pass over the training set - Noise from mini-batches helps escape shallow minima and prefer flat ones

See explainer §3.3.

Section 6 — Activation functions

Function Formula When
ReLU max(0, x) Default for hidden layers; cheap; suffers dead-neuron problem
GELU x · Φ(x) Smoother ReLU; default in transformers
SiLU / Swish x · σ(x) Used in Llama family
sigmoid 1 / (1 + e^-x) Output for binary classification; saturates at extremes
tanh (e^x - e^-x) / (e^x + e^-x) RNNs; bounded -1 to 1
softmax e^x_i / Σe^x_j Multi-class output; turns logits into probabilities

GELU is the default in modern transformers — smooth (better gradients) and slightly outperforms ReLU empirically.

Vanishing gradient. Sigmoid's max derivative is 0.25; multiplied through 10 layers → 0.25¹⁰ ≈ 10⁻⁶. ReLU's derivative is 1 in active region → 1¹⁰ = 1. See explainer §4.1.

Section 7 — Optimizers (beyond vanilla GD)

Optimizer What it adds
SGD with momentum Running avg of past gradients; smooths updates
Adam Per-parameter adaptive learning rates from first/second moment estimates
AdamW Adam + decoupled weight decay; default for transformers

The decoupling matters — in plain Adam, weight decay was scaled by the adaptive lr, which broke regularization. AdamW separates them. See explainer §4.2.

Section 8 — Overfitting and regularization

Train loss Val loss Diagnosis
high high Underfitting (model too small)
low low Good
low high Overfitting (memorizing)

Regularization toolkit:

  • More data — preferred fix when possible
  • Dropout — randomly zero 10–50% of neurons each training step; forces redundancy
  • Weight decay — add λ·Σ W² to loss; prefer small weights (AdamW does this cleanly)
  • Early stopping — stop when validation loss starts rising
  • Data augmentation — random crops, flips, noise (vision/audio)

Double descent. Classical theory says more params than data → overfit. Empirically, past a critical point, generalization improves again. Modern LLMs are in this second regime. Open problem. See explainer §5.2.

Section 9 — Learning paradigms

Paradigm Data Examples
Supervised Input + ground-truth label Image classification, sentiment analysis
Self-supervised Input only; create labels from input itself Next-token prediction (LLMs), masked LM (BERT)
Contrastive Pairs (similar, dissimilar) CLIP, SimCLR — learns embeddings
Reinforcement Rewards from environment Game playing, RLHF for LLMs

LLMs are pre-trained with self-supervised (next-token prediction). Then post-trained with supervised (instruction tuning) and reinforcement (RLHF / DPO).

Section 10 — Scaling laws

Loss decreases as a power law in model size N, data D, compute C:

L(N) ≈ a · N^(−α) (similar for D, C; α typically 0.05–0.1)

Plotted on log-log axes, this is a straight line. Predictable improvement per doubling.

Chinchilla (Hoffmann 2022) corrected Kaplan: for compute-optimal training, model and data scale together. Ratio ~20 tokens per parameter. GPT-3 had ~1.7. Chinchilla has ~20 and beat GPT-3.

Modern frontier models (Llama 3, etc.) push past Chinchilla — over-train small models because inference cost dominates. See explainer §5.1.

Reading list

  1. Karpathy "Neural Networks: Zero to Hero" — parts 1-2
  2. "Scaling Laws for Neural Language Models" (Kaplan 2020)
  3. "Training Compute-Optimal LLMs" (Chinchilla, Hoffmann 2022)
  4. 3Blue1Brown "Neural Networks" (3-part series)

Reference material

YouTube

Blogs

Self-check

For full Q&A see explainer §6.3.

  1. Why do we need non-linear activation functions? (§2.2)
  2. Why does He init use 2/fan_in for ReLU? (§3.1)
  3. Cross-entropy vs MSE — when each, and why? (§3.4)
  4. Mini-batch vs full-batch vs single-example — tradeoffs? (§3.3)
  5. ReLU vs GELU — what's the difference, when each? (§4.1)
  6. AdamW vs Adam — what changed? (§4.2)
  7. Dropout — what does it actually do? (§5.2)
  8. Self-supervised learning: how do LLMs create their own labels? (§9)
  9. Chinchilla scaling: what did it correct from Kaplan? (§5.1)