Backprop By Hand — Analysis¶
The architecture¶
A 2-layer MLP for 2D binary classification (XOR-like dataset):
x (n, 2)
↓ z1 = x @ W1 + b1 (n, 8)
↓ a1 = ReLU(z1) (n, 8)
↓ z2 = a1 @ W2 + b2 (n, 2)
↓ loss = cross_entropy(z2, y)
The gradient chain¶
The key derivation: softmax + cross-entropy combines into a clean form.
Given z2 ∈ ℝ^{N×C} and integer labels y ∈ {0..C-1}^N:
The (probs - one_hot)/N simplification is the most-asked whiteboard derivation. From there the chain is mechanical:
dL/dW2 = a1.T @ dz2
dL/db2 = dz2.sum(axis=0)
dL/da1 = dz2 @ W2.T
dL/dz1 = dL/da1 * (z1 > 0) # ReLU derivative is the indicator
dL/dW1 = x.T @ dz1
dL/db1 = dz1.sum(axis=0)
Memorise this chain; every interview probes one of its links.
Validation¶
Two layers of verification:
-
Finite-difference gradient check.
test_finite_difference_checkperturbs each parameter by ±ε and compares the numerical gradient(L(p+ε) - L(p-ε)) / 2εagainst the analytic gradient. Tolerance 1e-4. If the chain is wrong anywhere, this fails. -
Training convergence.
test_xor_convergestrains on a 200-point XOR-shaped dataset for 3000 iterations. Final accuracy ≥ 90% (typically 97%). If forward, backward, or step is wrong, the loss either doesn't decrease or oscillates.
Run output¶
iter 0 loss 0.6397 acc 0.450
iter 300 loss 0.4415 acc 0.835
iter 600 loss 0.2516 acc 0.955
iter 900 loss 0.1787 acc 0.970
iter 1200 loss 0.1437 acc 0.970
iter 1500 loss 0.1236 acc 0.970
iter 1800 loss 0.1102 acc 0.970
iter 2100 loss 0.1006 acc 0.970
iter 2400 loss 0.0933 acc 0.970
iter 2700 loss 0.0875 acc 0.970
final loss 0.0828 acc 0.970
Loss decreases monotonically (averaging over the noise of mini-batches; here we do full-batch). Accuracy climbs to 97% in ~900 iterations and plateaus — the network has learned the XOR pattern.
Common bugs the tests catch¶
- Forgetting
/Nindz2. Gradients are then N times too large; training overshoots, loss explodes. - Off-by-one in
one_hot. Subtracting from the wrong column. - Wrong ReLU derivative. Using
(z1 >= 0)vs.(z1 > 0)— both work in practice; the strict inequality matches the conventional derivation. - Forgetting to transpose in matrix products.
a1 @ dz2instead ofa1.T @ dz2— shapes mismatch; numpy errors. - Mutating params during backward. The
probs[arange(n), y] -= 1line modifies the softmax output in place; if you reused the cache after, the gradient computation would be wrong.
Why these patterns recur in interviews¶
This exercise is the smallest faithful version of every neural network's backward pass. The same chain (with more layers and fancier activations) is what PyTorch's autograd builds for any model. Demonstrating you can derive and implement it from scratch — without torch.nn — proves you understand:
- The chain rule applied to matrix calculus.
- The softmax + cross-entropy simplification.
- The role of ReLU's derivative being the input's sign.
- The shape juggling of
Tand@.
Most "AI infrastructure" interviews include a backprop question because it tests whether you understand the substrate. Library users without this knowledge struggle to debug autograd issues (NaN gradients, no_grad context, gradient checkpointing).
What this exercise teaches¶
- Gradient computation is mechanical; the discipline is in the chain and the shapes.
- Finite-difference checks are the structural defence against off-by-one and transposition bugs.
- ReLU's gradient is the indicator function — simple and crucial.
- Cross-entropy + softmax simplifies to
probs - one_hotdivided by batch size. Memorise this.
The whiteboard answer to "derive the gradient of softmax + cross-entropy" takes 3 minutes if you've internalised it; 15 if you haven't.