Backprop By Hand — Analysis¶

The architecture¶

A 2-layer MLP for 2D binary classification (XOR-like dataset):

x (n, 2)
   ↓ z1 = x @ W1 + b1     (n, 8)
   ↓ a1 = ReLU(z1)        (n, 8)
   ↓ z2 = a1 @ W2 + b2    (n, 2)
   ↓ loss = cross_entropy(z2, y)

The gradient chain¶

The key derivation: softmax + cross-entropy combines into a clean form.

Given z2 ∈ ℝ^{N×C} and integer labels y ∈ {0..C-1}^N:

dL/dz2 = (softmax(z2) - one_hot(y)) / N

The (probs - one_hot)/N simplification is the most-asked whiteboard derivation. From there the chain is mechanical:

dL/dW2 = a1.T @ dz2
dL/db2 = dz2.sum(axis=0)
dL/da1 = dz2 @ W2.T
dL/dz1 = dL/da1 * (z1 > 0)        # ReLU derivative is the indicator
dL/dW1 = x.T @ dz1
dL/db1 = dz1.sum(axis=0)

Memorise this chain; every interview probes one of its links.

Validation¶

Two layers of verification:

Finite-difference gradient check. test_finite_difference_check perturbs each parameter by ±ε and compares the numerical gradient (L(p+ε) - L(p-ε)) / 2ε against the analytic gradient. Tolerance 1e-4. If the chain is wrong anywhere, this fails.
Training convergence. test_xor_converges trains on a 200-point XOR-shaped dataset for 3000 iterations. Final accuracy ≥ 90% (typically 97%). If forward, backward, or step is wrong, the loss either doesn't decrease or oscillates.

Run output¶

iter    0  loss 0.6397  acc 0.450
iter  300  loss 0.4415  acc 0.835
iter  600  loss 0.2516  acc 0.955
iter  900  loss 0.1787  acc 0.970
iter 1200  loss 0.1437  acc 0.970
iter 1500  loss 0.1236  acc 0.970
iter 1800  loss 0.1102  acc 0.970
iter 2100  loss 0.1006  acc 0.970
iter 2400  loss 0.0933  acc 0.970
iter 2700  loss 0.0875  acc 0.970
final         loss 0.0828  acc 0.970

Loss decreases monotonically (averaging over the noise of mini-batches; here we do full-batch). Accuracy climbs to 97% in ~900 iterations and plateaus — the network has learned the XOR pattern.

Common bugs the tests catch¶

Forgetting /N in dz2. Gradients are then N times too large; training overshoots, loss explodes.
Off-by-one in one_hot. Subtracting from the wrong column.
Wrong ReLU derivative. Using (z1 >= 0) vs. (z1 > 0) — both work in practice; the strict inequality matches the conventional derivation.
Forgetting to transpose in matrix products. a1 @ dz2 instead of a1.T @ dz2 — shapes mismatch; numpy errors.
Mutating params during backward. The probs[arange(n), y] -= 1 line modifies the softmax output in place; if you reused the cache after, the gradient computation would be wrong.

Why these patterns recur in interviews¶

This exercise is the smallest faithful version of every neural network's backward pass. The same chain (with more layers and fancier activations) is what PyTorch's autograd builds for any model. Demonstrating you can derive and implement it from scratch — without torch.nn — proves you understand:

The chain rule applied to matrix calculus.
The softmax + cross-entropy simplification.
The role of ReLU's derivative being the input's sign.
The shape juggling of T and @.

Most "AI infrastructure" interviews include a backprop question because it tests whether you understand the substrate. Library users without this knowledge struggle to debug autograd issues (NaN gradients, no_grad context, gradient checkpointing).

What this exercise teaches¶

Gradient computation is mechanical; the discipline is in the chain and the shapes.
Finite-difference checks are the structural defence against off-by-one and transposition bugs.
ReLU's gradient is the indicator function — simple and crucial.
Cross-entropy + softmax simplifies to probs - one_hot divided by batch size. Memorise this.

The whiteboard answer to "derive the gradient of softmax + cross-entropy" takes 3 minutes if you've internalised it; 15 if you haven't.