03. Week 14 — Diffusion Models: Study Material¶
Companion files: Weekly Plan · Explainer · Daily Recall · Assignment · Revision
How to use this file: This is a reference companion. Read 02_explainer.md first for narrative understanding. Use this file for concise definitions, formulas, and structured comparisons when reviewing.
Section 1 — The forward process (adding noise)¶
Start with a clean image x_0. Gradually add Gaussian noise over T steps:
After T steps (T ~ 1000), x_T is pure noise. The schedule (β_1, ..., β_T) is fixed.
Closed-form: Skip directly from x_0 to any x_t:
Section 2 — The reverse process (denoising)¶
Train a model to predict the noise added at each step: - Sample (x_0, t), add noise to get x_t - Train model to predict ε (simplified DDPM loss):
At inference: start from pure noise, run T reverse steps, get a sample.
This is the core of DDPM (Ho 2020). See worked numerical examples in 02_explainer.md Chapter 2 and 3.
Section 3 — Score-based interpretation¶
Equivalent formulation: predict the score = gradient of log probability.
Same math; different intuition. Both lead to the same model in practice.
Section 4 — U-Net architecture¶
The model εθ is a U-Net. Encoder-decoder with skip connections:
Input → downsampling blocks → bottleneck → upsampling blocks → output
(skips between matching encoder and decoder levels)
Why U-Net? Multi-scale: large-scale structure at bottleneck, fine details from skips.
In conditional diffusion: cross-attention layers at each resolution inject text embeddings.
Timestep t is encoded as a sinusoidal embedding and added to every residual block.
See full ASCII diagram in 02_explainer.md Chapter 3.
Section 5 — Latent diffusion (Stable Diffusion)¶
Pixel-space diffusion is expensive (512×512×3 = 786K values per step).
Latent diffusion (Rombach 2022): 1. VAE encoder E: image (512×512×3) → latent z (64×64×4). ~64x spatial compression. 2. Diffuse in latent space — same DDPM algorithm, smaller tensors. 3. VAE decoder D: z (64×64×4) → image (512×512×3).
~64x speedup in spatial operations. Stable Diffusion's core architecture.
Section 6 — Text conditioning¶
Text prompts → CLIP/T5 text encoder → token embeddings.
Conditioning is injected into the U-Net via cross-attention at every resolution level:
Q = W_Q(image_features)
K = W_K(text_embeddings)
V = W_V(text_embeddings)
output = softmax(QK^T / √d) · V
See 02_explainer.md Chapter 4 for a full explanation of how spatial regions attend to text tokens.
Section 7 — Classifier-free guidance¶
Train one model for both conditional and unconditional denoising (randomly drop text during training).
At inference, blend:
w (guidance scale) controls fidelity to prompt: - w = 0 → ignores prompt - w = 7-10 → strong prompt adherence (typical default) - w > 15 → over-saturated, artifacts
Section 8 — ControlNet¶
Add spatial conditioning beyond text: - Edge maps (Canny) - Depth maps - Pose skeletons - Segmentation maps
Architecture: frozen U-Net + trainable adapter that takes the spatial conditioning input. Allows precise control: "this layout, that style."
Section 9 — Modern architectures¶
| Architecture | Key idea | Used in |
|---|---|---|
| DDPM U-Net | Original; convolutional U-Net | SD 1.x, SD 2.x |
| Latent Diffusion | Diffuse in VAE latent space | All SD variants |
| DiT | Transformer replaces U-Net; adaLN conditioning | SD3, FLUX |
| Flow Matching | Linear interpolation paths; velocity prediction | SD3, FLUX, Transfusion |
| DDIM | Deterministic ODE sampler; 20-50 steps | Universal sampler |
| LCM | Distillation to 4-8 steps | SDXL-LCM |
| Consistency Models | Single-step generation | Research + production |
Section 10 — Diffusion vs LLMs¶
| Aspect | LLMs | Diffusion |
|---|---|---|
| Generation | Sequential (token-by-token) | Iterative (refining whole sample) |
| Architecture | Transformer | U-Net or DiT |
| Math | Maximum likelihood / CE loss | Score matching / MSE loss |
| Output | Discrete (tokens) | Continuous (pixels / latents) |
| Latency | Linear in output length | ~O(T) fixed sampling steps |
| Modalities | Text mostly | Image, video, audio |
Recent text diffusion models (Diffusion-LM, others) exist but are research-stage.
Reading list¶
- "DDPM" (Ho 2020)
- "Latent Diffusion / Stable Diffusion" (Rombach 2022)
- Lillian Weng's blog on diffusion (lilianweng.github.io)
- ControlNet paper (Zhang 2023)
- DiT paper (Peebles & Xie, 2023)
- HF diffusers library tutorials
Reference material¶
YouTube¶
- DDPM - Diffusion Models Beat GANs on Image Synthesis (Paper Explained) — Rigorous walkthrough of the DDPM paper covering the forward noising process, reverse denoising, and why diffusion replaced GANs.
- Coding Stable Diffusion from Scratch in PyTorch — Implements the full Stable Diffusion stack from first principles, including CLIP, the VAE, the UNet, and samplers.
Blogs¶
- What are Diffusion Models? — Strong theoretical reference on DDPMs, score-based models, and latent diffusion with derivations and intuition.
- The Annotated Diffusion Model — Pairs DDPM math directly with runnable PyTorch code, making it easy to bridge theory and implementation.
Self-check¶
- Forward vs reverse diffusion — what does each do?
- Write the closed-form x_t expression. What is ᾱt?
- Why latent diffusion vs pixel diffusion? What does the VAE contribute?
- Classifier-free guidance: what is trained differently? What is computed at inference?
- What does guidance scale w = 15 produce and why?
- DiT vs U-Net: architectural difference and scaling advantage?