00. Diffusion Models — The Five-Year-Old Version¶

One statue seems to appear from fog, but really it is carved out one careful tap at a time.

Look. Imagine a sculptor staring at a rough stone block. Nothing is visible yet. Just dust and possibility. That rough stone is the marble block. In diffusion, that is pure Gaussian noise. It is the ugly starting point.

Now the sculptor does not smash the whole thing at once. They make one careful tap. Then another. Then another. Each tap removes a little confusion. Each tap reveals a little shape. That one tap is the chisel stroke. In diffusion, that is one denoising step.

But a random person cannot do this. The sculptor had practice. They learned how real statues should look. That practice is the sculptor's training. In diffusion, that is the learned reverse process. We show noisy images. The model learns how to undo the noise. Simple, no?

Now suppose the customer says, "Make a flying tiger made of paper." That instruction is the blueprint. In diffusion, that is text conditioning or a guidance signal. It tells the sculptor what to reveal. It does not do the carving by itself. The carving still happens through many denoising moves.

Sometimes we want the same statue faster. So the sculptor skips some tiny taps. Or learns a shortcut from an expert. That shortcut is the speed shortcut. In diffusion, that means fewer sampling steps or distillation. The whole module is just this story in engineering form. We start from noise. We learn tiny repairs. We end with an image.

The placeholders you will see called back¶

Placeholder	Meaning
the marble block	pure Gaussian noise / starting point
the chisel stroke	one denoising step
the sculptor's training	learned reverse process
the blueprint	text conditioning / guidance signal
the speed shortcut	fewer sampling steps / distillation

Top resources¶

Lil'Log on diffusion models — the cleanest intuition-to-math bridge.
Hugging Face annotated diffusion blog — practical DDPM walkthrough with code.
Hugging Face Diffusers docs — scheduler, pipeline, and training implementation map.
The Illustrated Stable Diffusion — very good end-to-end visuals for text-to-image.
Denoising Diffusion Probabilistic Models — the baseline DDPM paper for the core loop.
Denoising Diffusion Implicit Models — the key idea behind faster deterministic sampling.
Latent Diffusion Models — why modern image diffusion works in compressed latent space.
Classifier-Free Diffusion Guidance — how prompt steering works without a separate classifier.

What's coming¶

01-opening-failure.md — why CLIP plus noise plus gradient ascent gives weird pictures.
02-forward-process.md — how we add noise step by step in a controlled way.
03-noise-schedules.md — why the rate of corruption matters.
04-reverse-process.md — how a model learns to undo noise.
05-training-objective.md — what the denoiser is actually trained to predict.
06-ddpm-sampling.md — the original slow generation loop.
07-ddim-accelerated-sampling.md — fewer steps with a more direct path.
08-classifier-free-guidance.md — steering the image toward the prompt.
09-latent-diffusion.md — why we usually diffuse in compressed latent space.
10-text-to-image-architecture.md — Stable Diffusion as a full production stack.
11-controlnet-image-to-image.md — adding edges, depth, and pose as extra rails.
12-consistency-models-distillation.md — how we chase near-instant generation.
13-honest-admission.md — what still breaks and what we still do not know.

Bridge. Good. We know the statue metaphor now. But if the blueprint is so helpful, why not just optimize a noisy image against CLIP and finish the job? → 01-opening-failure.md