Skip to content

03. Week 14 — Diffusion Models: Study Material

Companion files: Weekly Plan · Explainer · Daily Recall · Assignment · Revision

How to use this file: This is a reference companion. Read 02_explainer.md first for narrative understanding. Use this file for concise definitions, formulas, and structured comparisons when reviewing.


Section 1 — The forward process (adding noise)

Start with a clean image x_0. Gradually add Gaussian noise over T steps:

q(x_t | x_{t-1}) = N(x_t; sqrt(1-β_t) x_{t-1}, β_t I)

After T steps (T ~ 1000), x_T is pure noise. The schedule (β_1, ..., β_T) is fixed.

Closed-form: Skip directly from x_0 to any x_t:

x_t = √ᾱt · x₀  +  √(1-ᾱt) · ε,    ε ~ N(0,I)
where ᾱt = product of (1-β_i) for i=1..t

Section 2 — The reverse process (denoising)

Train a model to predict the noise added at each step: - Sample (x_0, t), add noise to get x_t - Train model to predict ε (simplified DDPM loss):

L = E[ ‖ε - εθ(x_t, t)‖² ]

At inference: start from pure noise, run T reverse steps, get a sample.

This is the core of DDPM (Ho 2020). See worked numerical examples in 02_explainer.md Chapter 2 and 3.

Section 3 — Score-based interpretation

Equivalent formulation: predict the score = gradient of log probability.

∇ log p(x) ≈ predicted score

Same math; different intuition. Both lead to the same model in practice.

Section 4 — U-Net architecture

The model εθ is a U-Net. Encoder-decoder with skip connections:

Input → downsampling blocks → bottleneck → upsampling blocks → output
        (skips between matching encoder and decoder levels)

Why U-Net? Multi-scale: large-scale structure at bottleneck, fine details from skips.

In conditional diffusion: cross-attention layers at each resolution inject text embeddings.

Timestep t is encoded as a sinusoidal embedding and added to every residual block.

See full ASCII diagram in 02_explainer.md Chapter 3.

Section 5 — Latent diffusion (Stable Diffusion)

Pixel-space diffusion is expensive (512×512×3 = 786K values per step).

Latent diffusion (Rombach 2022): 1. VAE encoder E: image (512×512×3) → latent z (64×64×4). ~64x spatial compression. 2. Diffuse in latent space — same DDPM algorithm, smaller tensors. 3. VAE decoder D: z (64×64×4) → image (512×512×3).

~64x speedup in spatial operations. Stable Diffusion's core architecture.

Section 6 — Text conditioning

Text prompts → CLIP/T5 text encoder → token embeddings.

Conditioning is injected into the U-Net via cross-attention at every resolution level:

Q = W_Q(image_features)
K = W_K(text_embeddings)
V = W_V(text_embeddings)
output = softmax(QK^T / √d) · V

See 02_explainer.md Chapter 4 for a full explanation of how spatial regions attend to text tokens.

Section 7 — Classifier-free guidance

Train one model for both conditional and unconditional denoising (randomly drop text during training).

At inference, blend:

ε_pred = ε_uncond + w × (ε_cond - ε_uncond)

w (guidance scale) controls fidelity to prompt: - w = 0 → ignores prompt - w = 7-10 → strong prompt adherence (typical default) - w > 15 → over-saturated, artifacts

Section 8 — ControlNet

Add spatial conditioning beyond text: - Edge maps (Canny) - Depth maps - Pose skeletons - Segmentation maps

Architecture: frozen U-Net + trainable adapter that takes the spatial conditioning input. Allows precise control: "this layout, that style."

Section 9 — Modern architectures

Architecture Key idea Used in
DDPM U-Net Original; convolutional U-Net SD 1.x, SD 2.x
Latent Diffusion Diffuse in VAE latent space All SD variants
DiT Transformer replaces U-Net; adaLN conditioning SD3, FLUX
Flow Matching Linear interpolation paths; velocity prediction SD3, FLUX, Transfusion
DDIM Deterministic ODE sampler; 20-50 steps Universal sampler
LCM Distillation to 4-8 steps SDXL-LCM
Consistency Models Single-step generation Research + production

Section 10 — Diffusion vs LLMs

Aspect LLMs Diffusion
Generation Sequential (token-by-token) Iterative (refining whole sample)
Architecture Transformer U-Net or DiT
Math Maximum likelihood / CE loss Score matching / MSE loss
Output Discrete (tokens) Continuous (pixels / latents)
Latency Linear in output length ~O(T) fixed sampling steps
Modalities Text mostly Image, video, audio

Recent text diffusion models (Diffusion-LM, others) exist but are research-stage.

Reading list

  1. "DDPM" (Ho 2020)
  2. "Latent Diffusion / Stable Diffusion" (Rombach 2022)
  3. Lillian Weng's blog on diffusion (lilianweng.github.io)
  4. ControlNet paper (Zhang 2023)
  5. DiT paper (Peebles & Xie, 2023)
  6. HF diffusers library tutorials

Reference material

YouTube

Blogs

  • What are Diffusion Models? — Strong theoretical reference on DDPMs, score-based models, and latent diffusion with derivations and intuition.
  • The Annotated Diffusion Model — Pairs DDPM math directly with runnable PyTorch code, making it easy to bridge theory and implementation.

Self-check

  1. Forward vs reverse diffusion — what does each do?
  2. Write the closed-form x_t expression. What is ᾱt?
  3. Why latent diffusion vs pixel diffusion? What does the VAE contribute?
  4. Classifier-free guidance: what is trained differently? What is computed at inference?
  5. What does guidance scale w = 15 produce and why?
  6. DiT vs U-Net: architectural difference and scaling advantage?