Skip to content

05. Training objective — what exactly the denoiser is asked to predict

~12 min read. The thing that turns denoising from a good story into a trainable loss.

Built on the ELI5 in 00-eli5.md. The sculptor's training — the learned reverse process — becomes practical only when we choose a clear target: predict the noise, the clean image, or an equivalent score signal.


1) Three equivalent stories, one learning problem

Before writing a loss, get the picture straight.

The model sees a noisy sample.

Then we can tell the same task in three voices.

Voice one says,

"predict the exact noise I added."

Voice two says,

"recover the clean image hiding underneath."

Voice three says,

"estimate the score, the direction that points back toward higher density."

Different language.

Same denoising geometry.

           predict epsilon
        ┌──────────────────┐
        │                  ▼
noisy x_t ─────────────→ clean x0_hat
        │                  ▲
        └──────→ score ────┘

This is why diffusion papers can feel like they are arguing while actually agreeing.

The viewpoints are mathematically linked.

Stable Diffusion code usually trains the noise view.

ELBO derivations often explain the probabilistic foundation.

Score matching explains the vector-field intuition.

One job.

Three stories.

2) Noise prediction with a full worked example

Start with a toy clean value x0 = 0.70.

Let alpha_bar_t = 0.64.

Let the sampled noise be epsilon = 0.50.

Then sqrt(alpha_bar_t) = 0.80 and sqrt(1 - alpha_bar_t) = 0.60.

So the noisy input is:

x_t = 0.80 × 0.70 + 0.60 × 0.50
    = 0.56 + 0.30
    = 0.86

Suppose the model predicts epsilon_hat = 0.45.

The simple MSE loss is:

L = (epsilon_hat - epsilon)^2
  = (0.45 - 0.50)^2
  = (-0.05)^2
  = 0.0025

Good.

That 0.0025 is the number from the recall section.

And because we predicted noise, we can also recover a clean-image estimate.

x0_hat = (x_t - 0.60 × epsilon_hat) / 0.80
       = (0.86 - 0.27) / 0.80
       = 0.59 / 0.80
       = 0.7375

So noise prediction is not a dead end.

It quietly gives you x0_hat too.

That is why the three stories stay connected.

3) Score matching and ELBO are two more views of the same training

Now zoom out.

In score language, the model learns a vector field that points from noisy regions toward probable image regions.

In ELBO language, we are maximizing a variational lower bound for the full latent-variable model.

In code, we often optimize a weighted MSE.

These are not three unrelated religions.

They are three camera angles.

score view : learn where higher data density lies at each noise level
ELBO view  : justify the probabilistic model end to end
MSE view   : train a practical denoiser that actually converges on GPUs

Researchers keep the ELBO because it explains the model family cleanly.

Engineers keep the MSE because it is stable and simple.

Both are useful.

If you understand the bridge, interview questions stop sounding scary.

4) Why noise prediction became the practical favorite

Noise is statistically simple.

Images are not.

That one sentence explains a lot.

Predicting raw pixels means chasing texture, color, semantics, and dataset bias all at once.

Predicting Gaussian noise gives a cleaner target distribution.

It also behaves well across timesteps.

Early steps and late steps still share the same noise-language target.

This is why Stable Diffusion, Imagen, and many production U-Nets default to epsilon prediction or close cousins like v prediction.

practical recipe:
build x_t from known x0 and epsilon
train U-Net to predict epsilon
recover x0_hat when needed
sample with the same denoiser repeatedly

Elegant theory matters.

But when training budgets are real and deadlines are rude, people choose the target that optimizes cleanly.

Noise prediction earned that trust.

Where this lives in the wild

  • Stable Diffusion training scripts — usually optimize noise prediction because it is simple and stable across huge datasets.

  • SDXL variants with v-prediction — alternative parameterizations are chosen to behave better under certain scheduler settings.

  • Hugging Face Diffusers fine-tuning examples — losses are written in the practical epsilon-prediction form.

  • Imagen-style research code — the same denoising objective can be explained through score matching and variational bounds.

  • Internal product ablations — teams often compare epsilon, x0, and v objectives by prompt adherence and sampler stability.


Pause and recall

  • What three related targets can the denoiser be trained to predict?

  • In the worked example, what was the MSE loss when epsilon_hat = 0.45 and epsilon = 0.50?

  • Why does predicting noise let us recover a clean-image estimate too?

  • Why do engineers often use the simple noise-prediction loss even when the ELBO view is richer?


Interview Q&A

Q: Why is diffusion training supervised even though generation looks unsupervised at inference time? A: Because during training we know the clean image, the timestep, and the exact sampled noise that produced the noisy input. Common wrong answer to avoid: "The model has to invent its own targets because no labels exist."

Q: Why can predicting noise be easier than directly predicting the clean image? A: Because Gaussian noise has simpler statistics, so the target is more regular across the dataset and across timesteps. Common wrong answer to avoid: "Noise prediction is only used because researchers dislike probability theory."

Q: Why does score matching appear in diffusion explanations? A: Because denoising can be understood as learning the gradient that points toward higher-density image regions at each noise level. Common wrong answer to avoid: "Score matching is a completely unrelated objective with no link to denoising."

Q: Why do we keep the ELBO interpretation if we train with MSE? A: Because the ELBO explains the probabilistic foundation, while the simplified MSE gives a practical implementation that works well. Common wrong answer to avoid: "Once we use MSE, the probabilistic story becomes irrelevant."


Apply now (5 min)

Quick exercise. Pick your own x0, alpha_bar_t, and epsilon, then form x_t and recover x0_hat from a guessed epsilon_hat.

Compute every intermediate value by hand once.

Sketch from memory the triangle noise prediction ↔ x0 prediction ↔ score matching.

Under the sketch, write one line on why the sculptor's training works only because we know the corruption we injected.


Bridge. Good. We have the learning target now. So how do we actually generate an image with the classic diffusion loop? We start at pure noise and walk all the way back. → 06-ddpm-sampling.md