Skip to content

05. Loss functions — measuring wrongness

Five minutes. The number that tells the cat-robot how wrong it just was — and how loud the nudge should be.

Built on the ELI5 in 00-eli5.md. The rule pile outputs a guess. The loss number says how wrong that guess was. The nudge is computed from that number. Wrong loss = wrong nudge = stuck robot.


The mental model first

See. The cat-robot guesses. The truth is sitting there. We need one number that says: how wrong was that guess?

That number is the loss. Big = very wrong. Zero = perfect.

The whole training loop hangs on this number. Forward pass produces a guess. Loss compares guess to truth. Backprop turns the loss into a nudge. The nudge shifts every weight in the rule pile.

So if the loss is mis-shaped — if it sends weak nudges when the robot is very wrong, or strong nudges when the robot is correct — training crawls or thrashes. The loss is not a scoring decoration. It is the shape of the slope the rule pile rolls down.

Two main flavors. One for predicting numbers. One for predicting categories. Pick wrong, pay forever.


Regression — predict a number, use MSE

Predicting house price. Predicting tomorrow's temperature. The output is a real number.

Mean squared error. For one example:

L = (ŷ − y)²

Picture a parabola. Bottom at ŷ = y. Climbs fast on either side. Far-off guesses get punished quadratically. Near-misses get punished gently.

The gradient w.r.t. the prediction:

∂L/∂ŷ = 2(ŷ − y)

Big error → big gradient → loud nudge. Small error → small gradient → soft nudge. Exactly the smart-nudging shape we want for a continuous output.

This is fine when ŷ is unbounded. It is not fine when ŷ is a probability stuck between 0 and 1, because then a sigmoid or softmax sits in the way and breaks the slope. Hold this thought.


Classification — predict a category, use cross-entropy

Predicting cat / dog / bird. The output is one of K classes. The network produces K raw scores called logits. Softmax turns them into probabilities.

p_i = exp(z_i) / Σ_j exp(z_j)

All positive. Sum to 1. Largest logit wins.

Now we compare the predicted distribution p to the true one-hot y. The loss is cross-entropy:

L = −Σ y_i log(p_i) = −log(p_c)        # where c is the true class

Picture. The robot pays attention only to the probability it assigned to the correct class. If it said 0.99 → loss ≈ 0.01. If it said 0.01 → loss ≈ 4.6. If it said 0 → loss = ∞. The penalty for confident wrongness is brutal.

true class p_c loss
1.0 0.00
0.5 0.69
0.1 2.30
0.01 4.61

Right answer with high confidence → tiny loss. Wrong answer with high confidence → huge loss. The slope explodes where it should — exactly where the cat-robot needs the loudest nudge.


Why MSE on classification is wrong — three honest attempts

The claim. MSE on softmax probabilities gives weak, sometimes wrong-direction gradients. Let us prove it numerically.

Setup. Three classes. True label = class 0, so y = [1, 0, 0]. Softmax output p = [0.2, 0.5, 0.3]. The robot is confidently wrong — it thinks class 1 is most likely.

We compare two losses on the same prediction. We want the gradient that says push class 0 up, push class 1 down hard.

Attempt 1 — MSE on the probabilities

L_MSE = Σ (p_i − y_i)²
      = (0.2−1)² + (0.5−0)² + (0.3−0)²
      = 0.64 + 0.25 + 0.09
      = 0.98

Gradient w.r.t. each p_i: 2(p_i − y_i) = [-1.6, +1.0, +0.6].

Looks reasonable on p. But the network optimizes logits z, not p. To get ∂L/∂z, multiply by the softmax Jacobian. After the algebra, each component picks up a p_i(1 − p_i) factor. For our numbers:

Note: this uses the diagonal approximation of the softmax Jacobian — good enough for intuition.

∂L/∂z_0 ≈ 2(p_0 − y_0)·p_0(1−p_0) = 2·(−0.8)·0.2·0.8 = −0.256
∂L/∂z_1 ≈ 2(p_1 − y_1)·p_1(1−p_1) = 2·(+0.5)·0.5·0.5 = +0.250
∂L/∂z_2 ≈ 2(p_2 − y_2)·p_2(1−p_2) = 2·(+0.3)·0.3·0.7 = +0.126

The signs are right. The magnitudes are tiny. The robot is very wrong but the nudge is a whisper. Worse — when the robot is catastrophically wrong (say p_0 = 0.01), the factor p_0(1−p_0) = 0.0099 shrinks the gradient even more. More wrong → smaller nudge. Backwards.

Attempt 2 — MSE on the logits directly

Skip softmax. Use L = Σ (z_i − y_i)² with z = [0.5, 1.5, 0.9] (some logits that produce the same p).

L = (0.5−1)² + (1.5−0)² + (0.9−0)² = 0.25 + 2.25 + 0.81 = 3.31
∂L/∂z = 2(z − y) = [-1.0, +3.0, +1.8]

Stronger gradient, yes. But now we are pushing logits toward [1, 0, 0] — not probabilities. The model can never reach the target because softmax never outputs exactly one-hot. We are chasing a finish line that does not exist. And worse: logits have no natural scale. Two valid logit sets producing the same p get different MSE losses. The training signal is meaningless.

Attempt 3 — Cross-entropy on the same prediction

L_CE = −log(p_0) = −log(0.2) = 1.61

Now the magic. The gradient w.r.t. the logits, after the calculus collapses:

∂L_CE/∂z_i = p_i − y_i

For our prediction:

∂L/∂z_0 = 0.2 − 1 = −0.80
∂L/∂z_1 = 0.5 − 0 = +0.50
∂L/∂z_2 = 0.3 − 0 = +0.30

Compare to attempt 1: 3× louder on the right class. Cleaner. And it scales with wrongness — if p_0 = 0.01, the gradient on z_0 is −0.99, almost full strength. More wrong → louder nudge. Exactly the slope we want.

That is the contrast. MSE through softmax mutes the cat-robot when it is most wrong. Cross-entropy roars in proportion to the wrongness.


The clean-gradient miracle, in one picture

The derivation collapses because softmax and the exp inside cross-entropy cancel each other's curvature. After the dust:

∂L/∂z_i = p_i − y_i

That is it. The gradient on each logit is just the predicted probability minus the true probability. No saturation. No vanishing factor. No σ'(z) term hiding in the corner.

Picture. The gradient just whispers to each class: here is how wrong your probability was. Move that much.

This is one of the few places in deep learning where the math becomes simpler when you compose two pieces. Softmax + cross-entropy is the canonical pairing. Sigmoid + binary-cross-entropy is the same trick for two classes.


ASCII — MSE vs cross-entropy as a function of p_c

X-axis: probability assigned to the true class. Y-axis: loss. We want the loss to climb hard as p_c → 0.

loss
 5 |  *                                     CE = −log(p_c)
   |   *                                    MSE = (1−p_c)²
 4 |    *
   |     *
 3 |      *
   |       *
 2 |        **
   |          **
 1 |   #        ***
   |   ##          ****
 0 |   ###############****  → CE
   |   ###############      → MSE
   +────────────────────────────→ p_c
     0.0    0.25    0.5    0.75   1.0

   * = cross-entropy        # = MSE

Read it. Near p_c = 1 (correct, confident), both losses are tiny. Near p_c = 0 (catastrophically wrong, confident), MSE caps at 1. Cross-entropy goes to infinity. That cliff is what gives cross-entropy the loud nudge when the rule pile is most wrong.


Losses you'll hear about

  • Label smoothing. Instead of hard [0, 1], use softer targets like [0.05, 0.95]. This reduces absurdly overconfident logits.
  • Focal loss. It down-weights easy examples, so the model spends more attention on hard ones. Very common in detection.
  • KL divergence. It measures the gap between two full distributions. Common in distillation when a student copies a teacher's probabilities.

Pause and recall. Without scrolling — what does softmax + cross-entropy collapse to in the gradient? Why does MSE on softmax probabilities mute the cat-robot when it is most wrong? Where did the p_c(1−p_c) factor come from in attempt 1?


Where this lives in the wild

Loss choice ships everywhere. Five named systems:

  • GPT-4 / LLaMA / Claude — next-token cross-entropy. Every step, the model produces logits over the full vocabulary (e.g., 128K tokens for LLaMA-3). Softmax → probability per token. Cross-entropy on the actual next token in the corpus. Sum across all positions in the sequence. The entire pretraining bill is paid against this one loss.
  • CLIP — symmetric contrastive cross-entropy. Image and text encoders produce embeddings. For each image-text pair in a batch, similarity scores against all other texts become logits. Softmax + cross-entropy treats it as a classification: "which text in this batch matches this image?" Symmetric loss in the other direction. The loss alignment is what makes zero-shot classification work.
  • Whisper — CTC and cross-entropy stacked. The encoder uses CTC loss (a cross-entropy variant that handles unaligned audio-token sequences). The decoder uses standard token-level cross-entropy. Two losses, both information-theoretic, both computing predicted-minus-true on logits.
  • Meta DETR — focal-loss family for object detection. Most boxes are easy background negatives. A focal-loss variant down-weights those easy negatives, so learning stays focused on hard objects.
  • Stable Diffusion (v-prediction) — MSE on velocity vectors. Each denoising step predicts a continuous-valued velocity vector. Target is unbounded. Loss is plain MSE between predicted and true velocity. The clean-gradient miracle is irrelevant here — the output is not a probability. MSE is correct precisely because we are predicting a number, not a category.

The pattern. Predicting a category → cross-entropy on softmax. Predicting a continuous value → MSE. Mismatch the loss to the output and training stalls or diverges.


Interview Q&A

Q: Why not just use MSE for classification — it still penalizes wrongness, no?
A: Because through softmax, MSE picks up a p_i(1 − p_i) factor in the logit gradient. When the model is very wrong, p_c is near 0, that factor collapses, and the gradient becomes tiny. Cross-entropy cancels that factor exactly — gradient is just p − y, full strength on the most-wrong class.
Common wrong answer to avoid: "MSE doesn't understand probabilities." It does — it just sends weaker nudges where you need the loudest. The problem is gradient shape, not semantics.

Q: What does softmax + cross-entropy buy you that softmax + MSE doesn't?
A: A non-saturating gradient. The ∂L/∂z = p − y form means the cat-robot's nudge is proportional to its wrongness, with no shrinking factor in front. With MSE through softmax, the most-wrong predictions get the smallest gradients — backwards from what training needs.
Common wrong answer to avoid: "cross-entropy is the right loss because it's information-theoretic." True but useless. The actual reason it ships is the gradient shape.

Q: When IS MSE the right loss for a model output?
A: When the output is a continuous unbounded number — house price, pixel intensity, velocity vector in a diffusion model, embedding offset in a regression head. No softmax in the way, so no saturation factor. The parabola gradient 2(ŷ − y) is exactly what you want for a real-valued target.
Common wrong answer to avoid: "Any numeric-looking output can use MSE." No — probabilities behind sigmoid or softmax are numeric too, but their gradient shape changes the story.

Q: Why is binary cross-entropy with sigmoid cleaner than MSE with sigmoid?
A: Same trick as softmax + CE. The derivative of sigmoid σ'(z) = σ(z)(1 − σ(z)) cancels exactly against the cross-entropy denominator, leaving ∂L/∂z = ŷ − y. With MSE + sigmoid, that σ'(z) factor stays in the gradient and goes to zero at the saturating ends — confidently-wrong predictions train glacially.
Common wrong answer to avoid: "Because BCE punishes mistakes more." The real win is not emotional harshness. It is the clean cancellation in the gradient.


Apply now (5 min)

Take the example: y = [1, 0, 0], p = [0.2, 0.5, 0.3]. By hand:

  1. Compute the MSE loss Σ(p_i − y_i)². Compute the cross-entropy −log(p_0).
  2. Compute the cross-entropy gradient on logits: p − y = [−0.8, +0.5, +0.3]. Feel the size.
  3. Compute the MSE-through-softmax gradient on logits: 2(p_i − y_i)·p_i(1−p_i). Feel how much smaller it is on the wrong class.
  4. Now sketch from memory the loss-curve picture — MSE flat near 0, cross-entropy shooting to infinity. One axis. Two curves. No notes.

If you can produce both numbers and the picture in five minutes, you own this idea.


Bridge. We have a number that says how wrong the cat-robot was. Now we need to turn that number into a nudge for every weight in the rule pile — backwards through every layer. That is backpropagation. Read 06-backpropagation.md next.