Skip to content

03. The forward pass — one prediction, end to end

The cat-robot, computing. Every shape, every number, in the order they happen.

Built on 02-activation-functions.md. The bend is now in your hand. Let us see the rule pile run.


The mental model first — pipeline, not magic

See. A neural network is just a pipeline. Numbers go in one end. They flow through stages. A number comes out the other end.

Each stage is one layer. Inside a layer, many neurons compute in parallel — same input, different weights. That is one whole "rule pile slice" of the cat-robot.

Then a bend. Then the next slice.

   inputs              hidden layer              output
                       (parallel neurons)

   x ─┐              ┌──── neuron 1 ────┐
      │              │                  │
      ├──→ stretch ──┼──── neuron 2 ────┼──→ bend ──→ stretch ──→ ŷ
      │              │                  │
      └─             └──── neuron k ────┘

Stretch = matrix multiply. Bend = activation. Stretch again = next layer. That is it.

So one forward pass is one number entering and one number leaving — but inside, dozens of parallel neurons fire at every layer. The cat-robot is wide, not just deep.


The toy network we will hand-compute

Smallest network that solves XOR. Two inputs. Two hidden neurons with ReLU. One output.

       x₁ ─────●─────────●─── h₁ ─────●
                ╲       ╱              ╲
                 ╲     ╱                ╲
                  ╲   ╱                  ●─── ŷ
                   ╲ ╱                  ╱
                    ╳                  ╱
                   ╱ ╲                ╱
                  ╱   ╲              ╱
                 ╱     ╲            ╱
       x₂ ─────●─────────●─── h₂ ──●

       input          hidden          output
       layer          layer           layer
       (2 nodes)      (2 ReLU)        (1 node)

Both inputs feed both hidden neurons. Both hidden neurons feed the output. This is a fully connected network.

We use the weights from chapter 2 of the explainer — the ones that solve XOR:

  • h₁ = ReLU(x₁ + x₂) — fires on OR
  • h₂ = ReLU(x₁ + x₂ − 1) — fires on AND
  • ŷ = h₁ − 2·h₂ — OR minus twice AND = XOR

In matrix form:

W₁ = [[1, 1],     b₁ = [0, −1]
      [1, 1]]     

W₂ = [1, −2]      b₂ = 0

Tensor shapes — say them out loud

Before any number, fix the shapes in your head. Yes?

Symbol Shape Meaning
x (2,) one input vector — two features
W₁ (2, 2) hidden weights — (out_dim, in_dim)
b₁ (2,) hidden bias — one per hidden neuron
z₁ (2,) hidden pre-activation = W₁·x + b₁
a₁ (2,) hidden post-activation = ReLU(z₁)
W₂ (1, 2) output weights
b₂ (1,) output bias
ŷ (1,) final scalar prediction

The shape rule for W·x: (out, in) · (in,) → (out,). Inner dimensions cancel.

Bias matches the output dimension of the layer, not the input. One bias per neuron, not per input.


Two views of one layer — neuron and matrix

The same computation, two ways to picture it. Pick whichever feels native.

Per-neuron view (algebra-friendly)

Each hidden neuron is its own dot product:

z₁[0] = w₁[0,0]·x[0] + w₁[0,1]·x[1] + b₁[0]
z₁[1] = w₁[1,0]·x[0] + w₁[1,1]·x[1] + b₁[1]

Then apply the bend, neuron by neuron:

a₁[0] = ReLU(z₁[0])
a₁[1] = ReLU(z₁[1])

This is the rule pile from the ELI5 — each rule is one weighted sum, then a bend.

Matrix view (GPU-friendly)

Same thing, all neurons at once:

z₁ = W₁ · x + b₁         # (2,2)·(2,) + (2,) = (2,)
a₁ = ReLU(z₁)            # element-wise bend, shape unchanged

One matmul replaces the whole loop. GPUs love this. PyTorch's nn.Linear(2, 2) is exactly W·x + b under the hood. The library does not care which view you imagine — the hardware sees only the matmul.


Walk all four XOR inputs through the pipe

Now we run the cat-robot four times. Same weights, four different inputs. Show every intermediate.

Input x = (0, 0)

z₁ = W₁·x + b₁
   = [[1,1],[1,1]] · [0,0] + [0,−1]
   = [0, 0] + [0, −1]
   = [0, −1]

a₁ = ReLU([0, −1])  = [0, 0]

ŷ  = W₂·a₁ + b₂
   = [1, −2]·[0, 0] + 0
   = 0

Input x = (0, 1)

z₁ = [[1,1],[1,1]]·[0,1] + [0,−1] = [1, 0]
a₁ = ReLU([1, 0])               = [1, 0]
ŷ  = [1, −2]·[1, 0] + 0         = 1

Input x = (1, 0)

z₁ = [[1,1],[1,1]]·[1,0] + [0,−1] = [1, 0]
a₁ = ReLU([1, 0])               = [1, 0]
ŷ  = [1, −2]·[1, 0] + 0         = 1

Input x = (1, 1)

z₁ = [[1,1],[1,1]]·[1,1] + [0,−1] = [2, 1]
a₁ = ReLU([2, 1])               = [2, 1]
ŷ  = [1, −2]·[2, 1] + 0         = 2 − 2 = 0

The full table

x₁ x₂ z₁ a₁ ŷ XOR target
0 0 (0, −1) (0, 0) 0 0 ✓
0 1 (1, 0) (1, 0) 1 1 ✓
1 0 (1, 0) (1, 0) 1 1 ✓
1 1 (2, 1) (2, 1) 0 0 ✓

Four for four. The bend at zero is what made h₂ stay silent at (0,1) and (1,0) but fire at (1,1). Without the bend, the rule pile collapses — same single line as chapter 1, all four wrong.

This is the cat-robot solving its first non-trivial problem. Wide enough to detect both OR and AND. Bent enough to subtract them cleanly.


Pause and recall. Without scrolling — what are the tensor shapes at each layer? Where does the bend go?


Where this lives in the wild

The forward pass is the most-shipped operation in machine learning. Some specific places:

  • PyTorch nn.Sequential. Sequential(Linear(2,2), ReLU(), Linear(2,1)) literally encodes our toy network. Each __call__ runs one forward pass, layer by layer, in registration order.
  • JAX vmap. Wraps a single-input forward function and vectorises it across a batch dimension automatically. No code change to handle (B, 2) instead of (2,). Same matmul, bigger leading axis.
  • ONNX inference graph. Exports the forward pass as a frozen DAG of ops (MatMul, Add, ReLU, MatMul). ONNX Runtime then schedules these on CPU, CUDA, or NPU without any framework code.
  • vLLM forward kernels. For each generated token, vLLM runs one transformer forward pass. The matmuls are fused into custom CUDA kernels (paged-attention, FlashAttention) so the same shape-arithmetic you just did happens millions of times per second per GPU.
  • llama.cpp matmul kernels. Hand-tuned CPU/Metal matmuls for W·x + b. The whole library is essentially "do the forward pass fast enough on a laptop". Same (out, in)·(in,) → (out,) rule, written in SIMD.

Different stacks. One operation. Stretch, bend, stretch.


Q&A

Q: Where exactly does the bend go — before or after the matmul?
A: After. The order is matmul → add bias → activation. The pre-activation z is the linear part. The post-activation a is what the next layer sees as input. If you put ReLU before the matmul, you have just ReLU'd the raw input — meaningless for (0,1) data, and you skip the linear mixing entirely.
Common wrong answer to avoid: "doesn't matter, math is symmetric". It is not. The bend is the only non-linear step. Putting it on the wrong side of the stretch is the whole reason single-layer perceptrons fail XOR.

Q: What is the shape of the bias term and why?
A: One scalar per output neuron — shape (out_dim,). Each neuron has its own learnable offset. Bias does not depend on the input; it shifts the neuron's threshold up or down. If bias matched in_dim you would be confusing "shift the output" with "shift the input", and you would have one bias being added in in_dim different ways. That is not a thing.

Q: Why is batched forward a matrix-matrix multiply, not matrix-vector?
A: Stack B input vectors into a matrix X of shape (B, in_dim). Then Z = X · Wᵀ + b is shape (B, out_dim) — one matrix-matrix multiply that processes all B examples in parallel. Same arithmetic, but GPUs hit peak FLOPS on matmul, so batching is essentially free up to memory limits.
Common wrong answer to avoid: "batching changes what the network computes". It does not. Each row of Z is identical to what you would get from B separate forward passes. Only the layout changed.

Q: If a₁ for input (0,0) is (0, 0), are both hidden neurons "dead"?
A: For this one input, yes — both contributed zero. Across the four inputs they are not dead; h₁ fires on three of them and h₂ fires on one. A truly dead neuron is one whose pre-activation is negative for every training input, so ReLU outputs zero always and gradient never flows. That is the dying-ReLU problem from chapter 4 of the explainer.


Apply now (5 min)

Take a piece of paper. Pick x = (1, 1). Without looking back, compute z₁, a₁, and ŷ using the toy weights. Write each shape next to each value. If your ŷ is not 0, find the slip.

Then, still without looking — sketch the network. Two input nodes. Two hidden ReLU nodes. One output node. Arrows showing every connection. Label W₁, b₁, W₂, b₂ on the right edges.

If you can do both in under 4 minutes, the forward pass is in your hands.


Bridge. We picked the weights by hand. Real networks find them by training. But training needs a starting point — and zero weights collapse the rule pile, large weights explode the forward pass on layer one. Where does the rule pile actually start? See 04-weight-initialization.md.