02. Forward process — teaching a clean image to disappear politely¶

~11 min read. The thing that turns a crisp picture into structured noise without losing the math.

Built on the ELI5 in 00-eli5.md. The marble block — the pure Gaussian-noise starting point — does not appear by magic; we create it by gradually corrupting real images in a known way.

1) Why we deliberately ruin a clean image¶

Picture a studio exercise first.

A teacher gives you a clean passport photo.

Then the teacher sprays a little static on it.

Then a little more.

Then so much that only rough structure is left.

Why do this on purpose?

Because learning to undo known damage is easier than learning to hallucinate from nothing.

The forward process creates the exam sheet and the answer key together.

Low-noise steps teach edge repair.

Mid-noise steps teach shape repair.

High-noise steps teach the model what a plausible image family even looks like.

┌──────────┐  small noise  ┌──────────┐  more noise  ┌──────────┐  repeat  ┌──────────┐
│ clean x0 │ ────────────→ │   x1     │ ───────────→ │   x2     │ ───────→ │   xT     │
└──────────┘               └──────────┘              └──────────┘          └──────────┘
     │                                                                                 │
     └──────────────────────────── known corruption recipe ─────────────────────────────┘

This is not random vandalism.

It is curriculum design.

Stable Diffusion XL, Adobe Firefly, and internal research trainers all rely on this idea.

One clean image becomes many supervised cases.

Pick timestep t.

Sample fresh Gaussian noise.

Now you have a noisy input and a known target.

Here the corruption recipe gives labels for free.

If the final state looks like plain Gaussian noise, sampling can start from a simple distribution.

That simple start is the whole trick.

2) The one-step rule and a worked number example¶

Keep the picture in mind.

Each forward step keeps most of yesterday's image and mixes in one fresh hiss term.

Then the formula feels natural.

x_t = sqrt(1 - beta_t) x_{t-1} + sqrt(beta_t) epsilon_t

Here is a tiny scalar example.

Start with a clean value x0 = 0.80.

Let beta1 = 0.20 and sampled noise epsilon1 = -0.20.

Then sqrt(1 - beta1) = sqrt(0.80) = 0.8944.

And sqrt(beta1) = sqrt(0.20) = 0.4472.

So the first corruption step is:

x1 = 0.8944 × 0.80 + 0.4472 × (-0.20)
   = 0.7155 - 0.0894
   = 0.6261

Now do one more step.

Use beta2 = 0.20 again.

Use fresh noise epsilon2 = -0.4130.

Then:

x2 = 0.8944 × 0.6261 + 0.4472 × (-0.4130)
   = 0.5599 - 0.1869
   = 0.3730

Good.

That is the exact toy number used in the recall question.

See what the rule did.

The clean signal did not vanish in one slap.

It faded politely.

That politeness is what makes the reverse job learnable.

If beta_t were huge, one step would erase too much.

If beta_t were tiny forever, the terminal state would never become simple enough.

So even this little rule carries a design trade-off.

3) Why the direct formula is such a gift¶

Now the engineering shortcut.

We do not have to replay every earlier step just to build x_t during training.

All those little Gaussian steps collapse into one direct formula.

x_t = sqrt(alpha_bar_t) x0 + sqrt(1 - alpha_bar_t) epsilon

Here alpha_bar_t means the product of all earlier alpha_i = 1 - beta_i values.

For the two-step toy above, alpha_bar_2 = 0.80 × 0.80 = 0.64.

So we can also write:

x2 = sqrt(0.64) × 0.80 + sqrt(0.36) × epsilon
   = 0.80 × 0.80 + 0.60 × epsilon
   = 0.64 + 0.60 × epsilon

To hit the same x2 = 0.3730, one equivalent cumulative noise draw is epsilon = -0.4450.

Check it.

0.64 + 0.60 × (-0.4450) = 0.3730.

This is a gift because training becomes cheap and clean.

Construct x_t directly.

Hugging Face Diffusers precomputes these buffers.

Without the closed form, every minibatch would waste work replaying earlier steps.

With it, one photo can instantly become a timestep 50 case or a timestep 800 case.

4) What the forward process must guarantee¶

A good forward process must promise four things.

First, the final state should be close to a simple Gaussian start.

Second, the information should disappear smoothly, not fall off a cliff.

Third, the coefficients should stay numerically stable and easy to precompute.

Fourth, neighboring timesteps should still feel related so the reverse model learns a gentle ladder.

Think like a product engineer.

If the corruption is too harsh, mid steps become useless mush.

If it is too weak, generation starts from something that is not really simple noise.

If the schedule is jagged, training difficulty jumps around.

Then loss curves wobble and sample quality turns moody.

good forward path:
clean detail ──→ rough layout ──→ vague structure ──→ Gaussian-like start

bad forward path:
clean detail ──→ sudden mush ──→ still mush ──→ hard-to-learn reverse jump

This is why file 03 matters so much.

The forward rule is not only "add noise".

It is also "add noise with a tempo that keeps the inverse problem solvable".

That tempo is the noise schedule.

Where this lives in the wild¶

Stable Diffusion training — random timesteps are sampled directly with the closed-form corruption rule instead of replaying all earlier steps.
Hugging Face Diffusers schedulers — alpha_bar and related buffers are precomputed to sample noisy latents efficiently.
Imagen-style training loops — the forward process provides clean supervision pairs at many noise levels.
Adobe Firefly image models — the same corruption logic is used so the denoiser can train on varied noise strengths.
Research U-Nets for medical imaging diffusion — controlled Gaussian corruption creates a known inverse problem for the network to solve.

Pause and recall¶

Why do we intentionally destroy a clean image before learning to generate one?
In the toy example, how did x0 = 0.80 become x2 = 0.3730?
What does alpha_bar_t tell us in one sentence?
Why is direct sampling of x_t from x_0 such a big practical win?

Interview Q&A¶

Q: Why is the forward process chosen by us instead of learned from data? A: Because we want a simple, known corruption distribution so the difficult part of the learning problem stays in the reverse process. Common wrong answer to avoid: "The model should learn both corruption and denoising from scratch."

Q: Why is Gaussian noise used so often? A: Because Gaussian transitions compose cleanly, making the math tractable and allowing direct sampling at arbitrary timesteps. Common wrong answer to avoid: "Because Gaussian noise looks most realistic to the human eye."

Q: Why do we care that the final state becomes almost pure noise? A: Because sampling starts there, so the reverse model needs a simple known starting distribution. Common wrong answer to avoid: "Because the forward process only exists for data augmentation."

Q: Why is gradual corruption better than one giant corruption step? A: Because the reverse model then only has to learn many local repairs instead of one impossible global recovery jump. Common wrong answer to avoid: "One big noising step is equivalent and always easier."

Apply now (5 min)¶

Quick exercise. Pick one grayscale value like 0.3 or 0.8 and run two noising steps with your own beta values.

Write down every intermediate number: sqrt(1 - beta), sqrt(beta), the sampled noise, and the new pixel value.

Sketch from memory the chain clean image → slightly noisy → more noisy → pure noise.

Under the sketch, write one line on why the marble block is something we manufacture during training.

Bridge. Good. We can now corrupt images in a controlled way. But how much noise should we add early, and how much should we save for later? That is the scheduler question. → 03-noise-schedules.md