00. Neural networks in kid words — the cat-spotting robot¶

Read this first. The whole module called back to this picture. Five minutes.

The setup¶

You want a robot that spots cats in photos. You give it pixels in. You want a yes/no out.

How? You write rules.

One rule is not enough¶

You write one rule. "If pixel here is fluffy, say cat." This is one perceptron.

Too dumb. Many things are fluffy. A blanket is fluffy. A dog is fluffy. The robot says "cat" for everything fluffy. Mostly wrong.

So you write many rules and stack them up.

The rule pile (and why it collapses)¶

Stack one rule on another. Call this the rule pile.

But here is the trick. If every rule is straight — just adding pixels with weights — the whole rule pile collapses into one big straight rule. No improvement, even with a thousand layers. Algebra eats your work.

This is the first failure. We come back to it constantly. Whenever the doc says "the rule pile collapses", remember — straight rules cannot stack into curves on their own.

The bend¶

So we put a tiny bend between each rule. A bend is just a small non-linear curve. Now the rules cannot collapse into one. They build up into shapes — curves, corners, blobs.

The robot can now see shapes, not just lines.

The bend has many flavors. ReLU is a sharp elbow. Sigmoid is a smooth S. Tanh is sigmoid centered at zero. GELU is sigmoid's smarter cousin. Different bends behave differently when we train, but the role is the same — break the collapse.

The nudge¶

Where do the rule numbers (weights) come from? The robot does not guess them all at once.

It guesses. Checks the answer. Then nudges every rule a tiny bit toward the right answer. Do this a million times. The robot gets very good.

This nudging has a formal name — backpropagation — and it works by chain-rule calculus. Blame flows backward from the wrong answer to every rule that contributed. Each rule shifts in the direction that would have made the answer less wrong.

Whenever the doc says "the nudge", remember — gradient flowing back, weights shifting toward less wrong.

When the nudge dies¶

But "nudge a tiny bit" is harder than it sounds.

If the rule pile is too deep, the nudges shrink to nothing before reaching the early rules. The first layers stop learning. The deep robot becomes a shallow robot in disguise. This is the vanishing gradient.

The fix — pick bends that keep nudges alive (ReLU, GELU). Add residual connections that let nudges skip past dead layers. Architectures stop being a wall and become highways.

Whenever the doc says "the nudge dies", remember — deep pile, sigmoid bend, no signal reaching the bottom. Ship a different bend.

Smart nudging¶

Plain nudging is also dumb. It treats every rule the same. It forgets which way it was just heading. It walks straight into walls.

Smart nudging — remember past direction (momentum), use different speeds for different rules (per-parameter learning rates), normalize by recent gradient size — is much better. This smart nudging has names: Adam, AdamW, Lion. They are not magical. They are just plain nudging plus memory.

Whenever the doc says "smart nudging", remember — Adam-family optimizer, momentum + per-parameter scaling.

Bigger pile, more photos¶

Finally — bigger rule pile + more cat photos = smarter robot.

There is even a math rule for the best ratio. Roughly twenty photos per rule (Chinchilla scaling). Less photos = the robot memorizes. More photos than the pile can handle = wasted photos.

This is why frontier model labs spend more on data than on GPUs. Photos are the bottleneck, not compute.

The whole chain in one breath¶

straight rule    → too dumb (XOR)
straight pile    → collapses into one straight rule
bent pile        → can shape curves, corners, blobs
nudge            → shifts every rule a tiny bit toward less wrong
deep nudge       → dies before reaching early rules
fix the bend     → ReLU keeps nudge alive
plain nudge      → walks into walls
smart nudge      → Adam remembers direction, scales per rule
big pile + data  → smarter robot, with a ratio rule

That is a neural network. Everything else is engineering on top.

The placeholders you will see called back¶

Whenever a later file in this module says one of these words, picture this robot:

Placeholder	Picture
Rule pile	Stack of perceptrons. Each one is a weighted sum of the layer below.
Bend	The non-linear function applied between layers — ReLU, sigmoid, tanh, GELU.
Nudge	The gradient update from backprop. Shifts a weight a tiny bit toward less wrong.
Smart nudge	Optimizer with memory (Adam, AdamW). Plain SGD plus momentum and per-parameter scaling.
Cat-robot	The whole robot. Inputs → rule pile with bends → output.

Every technical chapter in this module calls back to these by name. If a sentence feels abstract, return here and reread the cat-robot picture.

What's coming¶

The rest of the module is the failure-fix chain:

One rule fails XOR. Why a single perceptron cannot decide diagonals. → 01-xor-problem.md
Bends fix the collapse. ReLU, sigmoid, tanh, GELU. When to pick which.
The forward pass. How a stacked network actually computes.
Weight initialization. Where the rule pile starts. Zero fails. Random fails differently.
Loss functions. How wrongness is measured. MSE vs cross-entropy.
Backpropagation. The nudge, in detail. Chain rule made physical.
SGD, batches, epochs. Training mechanics that work in practice.
Vanishing gradients. The nudge dying in deep nets.
Optimizers. Smart nudging — momentum, Adam, AdamW.
Regularization. Dropout, weight decay, batch norm — keeping the pile honest.
Scaling laws. Why bigger works, and the data-to-parameters ratio.
Batch norm vs layer norm. Two normalization strategies — which axis, which use case, and a common interview gotcha.
Honest admission. What is still mysterious.

Each file is one piece. Each piece exists because the previous piece broke at something specific. Read in order.

Bridge. The first thing that breaks is XOR. One straight rule cannot decide it — and the geometry tells you why immediately. Read 01-xor-problem.md next.