03. Week 14 — Diffusion Models: Study Material¶

Companion files: Weekly Plan · Explainer · Daily Recall · Assignment · Revision

How to use this file: This is a reference companion. Read 02_explainer.md first for narrative understanding. Use this file for concise definitions, formulas, and structured comparisons when reviewing.

Section 1 — The forward process (adding noise)¶

Start with a clean image x_0. Gradually add Gaussian noise over T steps:

q(x_t | x_{t-1}) = N(x_t; sqrt(1-β_t) x_{t-1}, β_t I)

After T steps (T ~ 1000), x_T is pure noise. The schedule (β_1, ..., β_T) is fixed.

Closed-form: Skip directly from x_0 to any x_t:

x_t = √ᾱt · x₀  +  √(1-ᾱt) · ε,    ε ~ N(0,I)
where ᾱt = product of (1-β_i) for i=1..t

Section 2 — The reverse process (denoising)¶

Train a model to predict the noise added at each step: - Sample (x_0, t), add noise to get x_t - Train model to predict ε (simplified DDPM loss):

L = E[ ‖ε - εθ(x_t, t)‖² ]

At inference: start from pure noise, run T reverse steps, get a sample.

This is the core of DDPM (Ho 2020). See worked numerical examples in 02_explainer.md Chapter 2 and 3.

Section 3 — Score-based interpretation¶

Equivalent formulation: predict the score = gradient of log probability.

∇ log p(x) ≈ predicted score

Same math; different intuition. Both lead to the same model in practice.

Section 4 — U-Net architecture¶

The model εθ is a U-Net. Encoder-decoder with skip connections:

Input → downsampling blocks → bottleneck → upsampling blocks → output
        (skips between matching encoder and decoder levels)

Why U-Net? Multi-scale: large-scale structure at bottleneck, fine details from skips.

In conditional diffusion: cross-attention layers at each resolution inject text embeddings.

Timestep t is encoded as a sinusoidal embedding and added to every residual block.

See full ASCII diagram in 02_explainer.md Chapter 3.

Section 5 — Latent diffusion (Stable Diffusion)¶

Pixel-space diffusion is expensive (512×512×3 = 786K values per step).

Latent diffusion (Rombach 2022): 1. VAE encoder E: image (512×512×3) → latent z (64×64×4). ~64x spatial compression. 2. Diffuse in latent space — same DDPM algorithm, smaller tensors. 3. VAE decoder D: z (64×64×4) → image (512×512×3).

~64x speedup in spatial operations. Stable Diffusion's core architecture.

Section 6 — Text conditioning¶

Text prompts → CLIP/T5 text encoder → token embeddings.

Conditioning is injected into the U-Net via cross-attention at every resolution level:

Q = W_Q(image_features)
K = W_K(text_embeddings)
V = W_V(text_embeddings)
output = softmax(QK^T / √d) · V

See 02_explainer.md Chapter 4 for a full explanation of how spatial regions attend to text tokens.

Section 7 — Classifier-free guidance¶

Train one model for both conditional and unconditional denoising (randomly drop text during training).

At inference, blend:

ε_pred = ε_uncond + w × (ε_cond - ε_uncond)

w (guidance scale) controls fidelity to prompt: - w = 0 → ignores prompt - w = 7-10 → strong prompt adherence (typical default) - w > 15 → over-saturated, artifacts

Section 8 — ControlNet¶

Add spatial conditioning beyond text: - Edge maps (Canny) - Depth maps - Pose skeletons - Segmentation maps

Architecture: frozen U-Net + trainable adapter that takes the spatial conditioning input. Allows precise control: "this layout, that style."

Section 9 — Modern architectures¶

Architecture	Key idea	Used in
DDPM U-Net	Original; convolutional U-Net	SD 1.x, SD 2.x
Latent Diffusion	Diffuse in VAE latent space	All SD variants
DiT	Transformer replaces U-Net; adaLN conditioning	SD3, FLUX
Flow Matching	Linear interpolation paths; velocity prediction	SD3, FLUX, Transfusion
DDIM	Deterministic ODE sampler; 20-50 steps	Universal sampler
LCM	Distillation to 4-8 steps	SDXL-LCM
Consistency Models	Single-step generation	Research + production

Section 10 — Diffusion vs LLMs¶

Aspect	LLMs	Diffusion
Generation	Sequential (token-by-token)	Iterative (refining whole sample)
Architecture	Transformer	U-Net or DiT
Math	Maximum likelihood / CE loss	Score matching / MSE loss
Output	Discrete (tokens)	Continuous (pixels / latents)
Latency	Linear in output length	~O(T) fixed sampling steps
Modalities	Text mostly	Image, video, audio

Recent text diffusion models (Diffusion-LM, others) exist but are research-stage.

Reading list¶

"DDPM" (Ho 2020)
"Latent Diffusion / Stable Diffusion" (Rombach 2022)
Lillian Weng's blog on diffusion (lilianweng.github.io)
ControlNet paper (Zhang 2023)
DiT paper (Peebles & Xie, 2023)
HF diffusers library tutorials

Reference material¶

YouTube¶

DDPM - Diffusion Models Beat GANs on Image Synthesis (Paper Explained) — Rigorous walkthrough of the DDPM paper covering the forward noising process, reverse denoising, and why diffusion replaced GANs.
Coding Stable Diffusion from Scratch in PyTorch — Implements the full Stable Diffusion stack from first principles, including CLIP, the VAE, the UNet, and samplers.

Blogs¶

What are Diffusion Models? — Strong theoretical reference on DDPMs, score-based models, and latent diffusion with derivations and intuition.
The Annotated Diffusion Model — Pairs DDPM math directly with runnable PyTorch code, making it easy to bridge theory and implementation.

Self-check¶

Forward vs reverse diffusion — what does each do?
Write the closed-form x_t expression. What is ᾱt?
Why latent diffusion vs pixel diffusion? What does the VAE contribute?
Classifier-free guidance: what is trained differently? What is computed at inference?
What does guidance scale w = 15 produce and why?
DiT vs U-Net: architectural difference and scaling advantage?