03. The forward pass — one prediction, end to end¶

The cat-robot, computing. Every shape, every number, in the order they happen.

Built on 02-activation-functions.md. The bend is now in your hand. Let us see the rule pile run.

The mental model first — pipeline, not magic¶

See. A neural network is just a pipeline. Numbers go in one end. They flow through stages. A number comes out the other end.

Each stage is one layer. Inside a layer, many neurons compute in parallel — same input, different weights. That is one whole "rule pile slice" of the cat-robot.

Then a bend. Then the next slice.

   inputs              hidden layer              output
                       (parallel neurons)

   x ─┐              ┌──── neuron 1 ────┐
      │              │                  │
      ├──→ stretch ──┼──── neuron 2 ────┼──→ bend ──→ stretch ──→ ŷ
      │              │                  │
      └─             └──── neuron k ────┘

Stretch = matrix multiply. Bend = activation. Stretch again = next layer. That is it.

So one forward pass is one number entering and one number leaving — but inside, dozens of parallel neurons fire at every layer. The cat-robot is wide, not just deep.

The toy network we will hand-compute¶

Smallest network that solves XOR. Two inputs. Two hidden neurons with ReLU. One output.

       x₁ ─────●─────────●─── h₁ ─────●
                ╲       ╱              ╲
                 ╲     ╱                ╲
                  ╲   ╱                  ●─── ŷ
                   ╲ ╱                  ╱
                    ╳                  ╱
                   ╱ ╲                ╱
                  ╱   ╲              ╱
                 ╱     ╲            ╱
       x₂ ─────●─────────●─── h₂ ──●

       input          hidden          output
       layer          layer           layer
       (2 nodes)      (2 ReLU)        (1 node)

Both inputs feed both hidden neurons. Both hidden neurons feed the output. This is a fully connected network.

We use the weights from chapter 2 of the explainer — the ones that solve XOR:

h₁ = ReLU(x₁ + x₂) — fires on OR
h₂ = ReLU(x₁ + x₂ − 1) — fires on AND
ŷ = h₁ − 2·h₂ — OR minus twice AND = XOR

In matrix form:

W₁ = [[1, 1],     b₁ = [0, −1]
      [1, 1]]     

W₂ = [1, −2]      b₂ = 0

Tensor shapes — say them out loud¶

Before any number, fix the shapes in your head. Yes?

Symbol	Shape	Meaning
`x`	`(2,)`	one input vector — two features
`W₁`	`(2, 2)`	hidden weights — `(out_dim, in_dim)`
`b₁`	`(2,)`	hidden bias — one per hidden neuron
`z₁`	`(2,)`	hidden pre-activation = `W₁·x + b₁`
`a₁`	`(2,)`	hidden post-activation = `ReLU(z₁)`
`W₂`	`(1, 2)`	output weights
`b₂`	`(1,)`	output bias
`ŷ`	`(1,)`	final scalar prediction

The shape rule for W·x: (out, in) · (in,) → (out,). Inner dimensions cancel.

Bias matches the output dimension of the layer, not the input. One bias per neuron, not per input.

Two views of one layer — neuron and matrix¶

The same computation, two ways to picture it. Pick whichever feels native.

Per-neuron view (algebra-friendly)¶

Each hidden neuron is its own dot product:

z₁[0] = w₁[0,0]·x[0] + w₁[0,1]·x[1] + b₁[0]
z₁[1] = w₁[1,0]·x[0] + w₁[1,1]·x[1] + b₁[1]

Then apply the bend, neuron by neuron:

a₁[0] = ReLU(z₁[0])
a₁[1] = ReLU(z₁[1])

This is the rule pile from the ELI5 — each rule is one weighted sum, then a bend.

Matrix view (GPU-friendly)¶

Same thing, all neurons at once:

z₁ = W₁ · x + b₁         # (2,2)·(2,) + (2,) = (2,)
a₁ = ReLU(z₁)            # element-wise bend, shape unchanged

One matmul replaces the whole loop. GPUs love this. PyTorch's nn.Linear(2, 2) is exactly W·x + b under the hood. The library does not care which view you imagine — the hardware sees only the matmul.

Walk all four XOR inputs through the pipe¶

Now we run the cat-robot four times. Same weights, four different inputs. Show every intermediate.

Input `x = (0, 0)`¶

z₁ = W₁·x + b₁
   = [[1,1],[1,1]] · [0,0] + [0,−1]
   = [0, 0] + [0, −1]
   = [0, −1]

a₁ = ReLU([0, −1])  = [0, 0]

ŷ  = W₂·a₁ + b₂
   = [1, −2]·[0, 0] + 0
   = 0

Input `x = (0, 1)`¶

z₁ = [[1,1],[1,1]]·[0,1] + [0,−1] = [1, 0]
a₁ = ReLU([1, 0])               = [1, 0]
ŷ  = [1, −2]·[1, 0] + 0         = 1

Input `x = (1, 0)`¶

z₁ = [[1,1],[1,1]]·[1,0] + [0,−1] = [1, 0]
a₁ = ReLU([1, 0])               = [1, 0]
ŷ  = [1, −2]·[1, 0] + 0         = 1

Input `x = (1, 1)`¶

z₁ = [[1,1],[1,1]]·[1,1] + [0,−1] = [2, 1]
a₁ = ReLU([2, 1])               = [2, 1]
ŷ  = [1, −2]·[2, 1] + 0         = 2 − 2 = 0

The full table¶

`x₁`	`x₂`	`z₁`	`a₁`	`ŷ`	XOR target
0	0	(0, −1)	(0, 0)	0	0 ✓
0	1	(1, 0)	(1, 0)	1	1 ✓
1	0	(1, 0)	(1, 0)	1	1 ✓
1	1	(2, 1)	(2, 1)	0	0 ✓

Four for four. The bend at zero is what made h₂ stay silent at (0,1) and (1,0) but fire at (1,1). Without the bend, the rule pile collapses — same single line as chapter 1, all four wrong.

This is the cat-robot solving its first non-trivial problem. Wide enough to detect both OR and AND. Bent enough to subtract them cleanly.

Pause and recall. Without scrolling — what are the tensor shapes at each layer? Where does the bend go?

Where this lives in the wild¶

The forward pass is the most-shipped operation in machine learning. Some specific places:

PyTorch nn.Sequential. Sequential(Linear(2,2), ReLU(), Linear(2,1)) literally encodes our toy network. Each __call__ runs one forward pass, layer by layer, in registration order.
JAX vmap. Wraps a single-input forward function and vectorises it across a batch dimension automatically. No code change to handle (B, 2) instead of (2,). Same matmul, bigger leading axis.
ONNX inference graph. Exports the forward pass as a frozen DAG of ops (MatMul, Add, ReLU, MatMul). ONNX Runtime then schedules these on CPU, CUDA, or NPU without any framework code.
vLLM forward kernels. For each generated token, vLLM runs one transformer forward pass. The matmuls are fused into custom CUDA kernels (paged-attention, FlashAttention) so the same shape-arithmetic you just did happens millions of times per second per GPU.
llama.cpp matmul kernels. Hand-tuned CPU/Metal matmuls for W·x + b. The whole library is essentially "do the forward pass fast enough on a laptop". Same (out, in)·(in,) → (out,) rule, written in SIMD.

Different stacks. One operation. Stretch, bend, stretch.

Q&A¶

Q: Where exactly does the bend go — before or after the matmul?
A: After. The order is matmul → add bias → activation. The pre-activation z is the linear part. The post-activation a is what the next layer sees as input. If you put ReLU before the matmul, you have just ReLU'd the raw input — meaningless for (0,1) data, and you skip the linear mixing entirely.
Common wrong answer to avoid: "doesn't matter, math is symmetric". It is not. The bend is the only non-linear step. Putting it on the wrong side of the stretch is the whole reason single-layer perceptrons fail XOR.

Q: What is the shape of the bias term and why?
A: One scalar per output neuron — shape (out_dim,). Each neuron has its own learnable offset. Bias does not depend on the input; it shifts the neuron's threshold up or down. If bias matched in_dim you would be confusing "shift the output" with "shift the input", and you would have one bias being added in in_dim different ways. That is not a thing.

Q: Why is batched forward a matrix-matrix multiply, not matrix-vector?
A: Stack B input vectors into a matrix X of shape (B, in_dim). Then Z = X · Wᵀ + b is shape (B, out_dim) — one matrix-matrix multiply that processes all B examples in parallel. Same arithmetic, but GPUs hit peak FLOPS on matmul, so batching is essentially free up to memory limits.
Common wrong answer to avoid: "batching changes what the network computes". It does not. Each row of Z is identical to what you would get from B separate forward passes. Only the layout changed.

Q: If a₁ for input (0,0) is (0, 0), are both hidden neurons "dead"?
A: For this one input, yes — both contributed zero. Across the four inputs they are not dead; h₁ fires on three of them and h₂ fires on one. A truly dead neuron is one whose pre-activation is negative for every training input, so ReLU outputs zero always and gradient never flows. That is the dying-ReLU problem from chapter 4 of the explainer.

Apply now (5 min)¶

Take a piece of paper. Pick x = (1, 1). Without looking back, compute z₁, a₁, and ŷ using the toy weights. Write each shape next to each value. If your ŷ is not 0, find the slip.

Then, still without looking — sketch the network. Two input nodes. Two hidden ReLU nodes. One output node. Arrows showing every connection. Label W₁, b₁, W₂, b₂ on the right edges.

If you can do both in under 4 minutes, the forward pass is in your hands.

Bridge. We picked the weights by hand. Real networks find them by training. But training needs a starting point — and zero weights collapse the rule pile, large weights explode the forward pass on layer one. Where does the rule pile actually start? See 04-weight-initialization.md.