Skip to content

01. Week 1 — Overview of LLMs, Neural Networks Foundations

Key concepts to master

  • Forward pass: input → linear transform → activation → output
  • Weight initialization: Xavier (sigmoid/tanh), He (ReLU); why init scale matters
  • Backprop: chain rule applied backward through layers
  • Mini-batch SGD: batches, epochs, why noise helps generalization
  • Loss functions: cross-entropy + softmax for classification (clean gradient p - y); MSE for regression
  • Activation functions: ReLU/GELU; dead-neuron problem; vanishing gradient diagnosis
  • Optimizers: SGD with momentum → Adam → AdamW (decoupled weight decay)
  • Regularization: dropout, weight decay, early stopping; double descent
  • Self-supervised vs supervised vs RL learning paradigms
  • Scaling laws (power-law on log-log paper) + Chinchilla ratio (~20 tokens/parameter)

🧠 Mental models

  • Forward pass: "An assembly line that keeps transforming raw input into a decision."
  • Weight initialization: "Set the starting loudness so signals neither explode nor whisper away."
  • Backpropagation: "A blame-hands_on_lab chain that sends error responsibility backward."
  • Activation functions: "Gates that bend straight lines into expressive shapes."
  • Optimizers: "SGD follows the slope; momentum adds a flywheel; AdamW keeps per-parameter speed limits."
  • Scaling laws: "Bigger models and datasets follow diminishing-returns power curves, not magic step changes."

⚠️ Common traps

  • Initializing all weights the same or too small, so neurons learn identical or vanishing features.
  • Using MSE for classification and getting weaker gradients than cross-entropy with softmax.
  • Picking a learning rate that is technically stable but too small to make real progress.
  • Killing ReLUs with aggressive learning rates or poor initialization, then blaming model capacity.
  • Treating Adam and AdamW as interchangeable even though weight decay is handled differently.
  • Ignoring the batch-size/noise tradeoff when comparing optimization runs.

🔗 Prerequisites & connections

  • Builds on: Module 00 ideas about loss minimization, regularization, train/validation discipline, and linear models as weighted-sum predictors.
  • Feeds into: embedding learning, attention blocks, transformer optimization, and scaling-law reasoning in later LLM modules.

💬 Interview phrasing

  • "Why can't we just stack linear layers without activations?"
  • "Walk me through backprop on a two-layer network — what is the chain rule doing?"
  • "Why is He initialization paired with ReLU, and Xavier with tanh or sigmoid?"
  • "AdamW vs Adam — what changed mathematically, and why do people care?"
  • "Why do scaling-law papers talk about compute-optimal token/parameter ratios?"

⏱️ Difficulty markers

  • 🟢 forward pass
  • 🟢 loss functions
  • 🟡 mini-batch SGD
  • 🟡 activation behavior
  • 🔴 backpropagation
  • 🔴 weight initialization
  • 🔴 scaling laws

Self-check questions

For full Q&A bank with common-wrong-answer notes, see explainer §6.3.

  1. Why do we need non-linear activation functions? (explainer §2.2)
  2. Why does He init use 2/fan_in for ReLU? (§3.1)
  3. Cross-entropy vs MSE — when each, and why? (§3.4)
  4. Why mini-batch instead of full-batch or single-example? (§3.3)
  5. ReLU vs GELU — what's the difference? (§4.1, §4.2)
  6. AdamW vs Adam — what changed? (§4.2)
  7. Dropout — what does it do mechanically? (§5.2)
  8. Chinchilla scaling — what did it correct from Kaplan? (§5.1)

Health check

  • [ ] Read all 6 chapters of explainer
  • [ ] MNIST repo public with >95% accuracy
  • [ ] LinkedIn post #1 published
  • [ ] All daily-recall questions answered from memory
  • [ ] Failure-fix table from §6.1 sketched without looking
  • [ ] Applications: 5+ to top S-tier