03. The forward pass — one prediction, end to end¶
The cat-robot, computing. Every shape, every number, in the order they happen.
Built on
02-activation-functions.md. The bend is now in your hand. Let us see the rule pile run.
The mental model first — pipeline, not magic¶
See. A neural network is just a pipeline. Numbers go in one end. They flow through stages. A number comes out the other end.
Each stage is one layer. Inside a layer, many neurons compute in parallel — same input, different weights. That is one whole "rule pile slice" of the cat-robot.
Then a bend. Then the next slice.
inputs hidden layer output
(parallel neurons)
x ─┐ ┌──── neuron 1 ────┐
│ │ │
├──→ stretch ──┼──── neuron 2 ────┼──→ bend ──→ stretch ──→ ŷ
│ │ │
└─ └──── neuron k ────┘
Stretch = matrix multiply. Bend = activation. Stretch again = next layer. That is it.
So one forward pass is one number entering and one number leaving — but inside, dozens of parallel neurons fire at every layer. The cat-robot is wide, not just deep.
The toy network we will hand-compute¶
Smallest network that solves XOR. Two inputs. Two hidden neurons with ReLU. One output.
x₁ ─────●─────────●─── h₁ ─────●
╲ ╱ ╲
╲ ╱ ╲
╲ ╱ ●─── ŷ
╲ ╱ ╱
╳ ╱
╱ ╲ ╱
╱ ╲ ╱
╱ ╲ ╱
x₂ ─────●─────────●─── h₂ ──●
input hidden output
layer layer layer
(2 nodes) (2 ReLU) (1 node)
Both inputs feed both hidden neurons. Both hidden neurons feed the output. This is a fully connected network.
We use the weights from chapter 2 of the explainer — the ones that solve XOR:
h₁ = ReLU(x₁ + x₂)— fires on ORh₂ = ReLU(x₁ + x₂ − 1)— fires on ANDŷ = h₁ − 2·h₂— OR minus twice AND = XOR
In matrix form:
Tensor shapes — say them out loud¶
Before any number, fix the shapes in your head. Yes?
| Symbol | Shape | Meaning |
|---|---|---|
x |
(2,) |
one input vector — two features |
W₁ |
(2, 2) |
hidden weights — (out_dim, in_dim) |
b₁ |
(2,) |
hidden bias — one per hidden neuron |
z₁ |
(2,) |
hidden pre-activation = W₁·x + b₁ |
a₁ |
(2,) |
hidden post-activation = ReLU(z₁) |
W₂ |
(1, 2) |
output weights |
b₂ |
(1,) |
output bias |
ŷ |
(1,) |
final scalar prediction |
The shape rule for W·x: (out, in) · (in,) → (out,). Inner dimensions cancel.
Bias matches the output dimension of the layer, not the input. One bias per neuron, not per input.
Two views of one layer — neuron and matrix¶
The same computation, two ways to picture it. Pick whichever feels native.
Per-neuron view (algebra-friendly)¶
Each hidden neuron is its own dot product:
Then apply the bend, neuron by neuron:
This is the rule pile from the ELI5 — each rule is one weighted sum, then a bend.
Matrix view (GPU-friendly)¶
Same thing, all neurons at once:
One matmul replaces the whole loop. GPUs love this. PyTorch's nn.Linear(2, 2) is exactly W·x + b under the hood. The library does not care which view you imagine — the hardware sees only the matmul.
Walk all four XOR inputs through the pipe¶
Now we run the cat-robot four times. Same weights, four different inputs. Show every intermediate.
Input x = (0, 0)¶
z₁ = W₁·x + b₁
= [[1,1],[1,1]] · [0,0] + [0,−1]
= [0, 0] + [0, −1]
= [0, −1]
a₁ = ReLU([0, −1]) = [0, 0]
ŷ = W₂·a₁ + b₂
= [1, −2]·[0, 0] + 0
= 0
Input x = (0, 1)¶
Input x = (1, 0)¶
Input x = (1, 1)¶
z₁ = [[1,1],[1,1]]·[1,1] + [0,−1] = [2, 1]
a₁ = ReLU([2, 1]) = [2, 1]
ŷ = [1, −2]·[2, 1] + 0 = 2 − 2 = 0
The full table¶
x₁ |
x₂ |
z₁ |
a₁ |
ŷ |
XOR target |
|---|---|---|---|---|---|
| 0 | 0 | (0, −1) | (0, 0) | 0 | 0 ✓ |
| 0 | 1 | (1, 0) | (1, 0) | 1 | 1 ✓ |
| 1 | 0 | (1, 0) | (1, 0) | 1 | 1 ✓ |
| 1 | 1 | (2, 1) | (2, 1) | 0 | 0 ✓ |
Four for four. The bend at zero is what made h₂ stay silent at (0,1) and (1,0) but fire at (1,1). Without the bend, the rule pile collapses — same single line as chapter 1, all four wrong.
This is the cat-robot solving its first non-trivial problem. Wide enough to detect both OR and AND. Bent enough to subtract them cleanly.
Pause and recall. Without scrolling — what are the tensor shapes at each layer? Where does the bend go?
Where this lives in the wild¶
The forward pass is the most-shipped operation in machine learning. Some specific places:
- PyTorch
nn.Sequential.Sequential(Linear(2,2), ReLU(), Linear(2,1))literally encodes our toy network. Each__call__runs one forward pass, layer by layer, in registration order. - JAX
vmap. Wraps a single-input forward function and vectorises it across a batch dimension automatically. No code change to handle(B, 2)instead of(2,). Same matmul, bigger leading axis. - ONNX inference graph. Exports the forward pass as a frozen DAG of ops (MatMul, Add, ReLU, MatMul). ONNX Runtime then schedules these on CPU, CUDA, or NPU without any framework code.
- vLLM forward kernels. For each generated token, vLLM runs one transformer forward pass. The matmuls are fused into custom CUDA kernels (paged-attention, FlashAttention) so the same shape-arithmetic you just did happens millions of times per second per GPU.
- llama.cpp matmul kernels. Hand-tuned CPU/Metal matmuls for
W·x + b. The whole library is essentially "do the forward pass fast enough on a laptop". Same(out, in)·(in,) → (out,)rule, written in SIMD.
Different stacks. One operation. Stretch, bend, stretch.
Q&A¶
Q: Where exactly does the bend go — before or after the matmul?
A: After. The order is matmul → add bias → activation. The pre-activation z is the linear part. The post-activation a is what the next layer sees as input. If you put ReLU before the matmul, you have just ReLU'd the raw input — meaningless for (0,1) data, and you skip the linear mixing entirely.
Common wrong answer to avoid: "doesn't matter, math is symmetric". It is not. The bend is the only non-linear step. Putting it on the wrong side of the stretch is the whole reason single-layer perceptrons fail XOR.
Q: What is the shape of the bias term and why?
A: One scalar per output neuron — shape (out_dim,). Each neuron has its own learnable offset. Bias does not depend on the input; it shifts the neuron's threshold up or down. If bias matched in_dim you would be confusing "shift the output" with "shift the input", and you would have one bias being added in in_dim different ways. That is not a thing.
Q: Why is batched forward a matrix-matrix multiply, not matrix-vector?
A: Stack B input vectors into a matrix X of shape (B, in_dim). Then Z = X · Wᵀ + b is shape (B, out_dim) — one matrix-matrix multiply that processes all B examples in parallel. Same arithmetic, but GPUs hit peak FLOPS on matmul, so batching is essentially free up to memory limits.
Common wrong answer to avoid: "batching changes what the network computes". It does not. Each row of Z is identical to what you would get from B separate forward passes. Only the layout changed.
Q: If a₁ for input (0,0) is (0, 0), are both hidden neurons "dead"?
A: For this one input, yes — both contributed zero. Across the four inputs they are not dead; h₁ fires on three of them and h₂ fires on one. A truly dead neuron is one whose pre-activation is negative for every training input, so ReLU outputs zero always and gradient never flows. That is the dying-ReLU problem from chapter 4 of the explainer.
Apply now (5 min)¶
Take a piece of paper. Pick x = (1, 1). Without looking back, compute z₁, a₁, and ŷ using the toy weights. Write each shape next to each value. If your ŷ is not 0, find the slip.
Then, still without looking — sketch the network. Two input nodes. Two hidden ReLU nodes. One output node. Arrows showing every connection. Label W₁, b₁, W₂, b₂ on the right edges.
If you can do both in under 4 minutes, the forward pass is in your hands.
Bridge. We picked the weights by hand. Real networks find them by training. But training needs a starting point — and zero weights collapse the rule pile, large weights explode the forward pass on layer one. Where does the rule pile actually start? See
04-weight-initialization.md.