02. Diffusion Models — Narrative Explainer¶
Module 14 · Companion files: Weekly Plan · Study Material · Daily Recall · Assignment · Revision
Table of Contents¶
- ELI5 — The Sculptor Analogy
- Chapter 1 — The Opening Failure
- Chapter 2 — The Forward Process
- Chapter 3 — The Reverse Process
- Chapter 4 — Conditioning and Guidance
- Chapter 5 — Latent Diffusion and Modern Architectures
- Honest Admission
- Chapter 6 — Recap, Drills, and Production Notes
- Foundation-Gap Audit
- Retrieval Prompts
- What Comes Next
ELI5 — The Sculptor Analogy¶
Imagine a sculptor standing in front of a block of marble. That block is shapeless, undifferentiated, full of possibility. It is also, for our purposes, pure noise — the marble block.
The sculptor does not carve the finished statue in a single heroic stroke. He works with a chisel. One careful tap at a time. One tap removes a thin layer. Another tap refines an edge. Each individual tap is the chisel stroke — a single denoising step in the reverse diffusion process.
How does the sculptor know where to tap, in which direction, and how hard? He did not guess. He trained for years. He studied thousands of finished sculptures. He looked at rough marble and imagined the form inside. He learned, stroke by stroke, how rough stone becomes a face, a torso, a wing. That accumulated knowledge — the pattern of what strokes produce what results — is the sculptor's training: learning the reverse process from noise to signal.
Sometimes a client gives the sculptor a sketch on paper before he starts. "Make it look like a roaring lion, head down, mid-leap." That sketch constrains every chisel stroke. It is the blueprint — the text conditioning signal that guides generation toward a specific target.
A master sculptor, working on a deadline, can also work faster. Instead of twenty thousand strokes, he uses four thousand, skipping the redundant micro-passes he knows are unnecessary. That efficiency is the speed shortcut — fewer sampling steps, or distillation techniques like LCM or DDIM.
Now, diffusion models work exactly this way:
- Start with the marble block: pure Gaussian noise sampled from N(0, I).
- Apply the chisel stroke repeatedly: a denoising neural network that removes a little noise each step.
- Use the sculptor's training: the reverse process learned from millions of real images.
- Follow the blueprint: a text prompt encoded into embeddings and injected via cross-attention.
- Use the speed shortcut when needed: fewer steps with DDIM, or a distilled consistency model.
That is diffusion. Every equation and architectural diagram you will see below is just a formal description of these five things.
Chapter 1 — The Opening Failure¶
Story: the naïve approach that always breaks¶
Suppose you want to generate an image from a text prompt. You have no prior knowledge of diffusion. Your first instinct: start from random pixels and optimize.
You sample a 512×512 image at random from a uniform distribution. You score it against your text prompt using a CLIP similarity function. You run gradient ascent. Pixels shift. CLIP score improves marginally. You run a thousand more steps.
The result is a blurry, incoherent mess. It does not look like anything. It has no edges, no objects, no structure. Sometimes it resembles a hallucinogenic smear. CLIP sees it as vaguely relevant. A human sees noise.
Why does this fail so completely?
Reason one: the search space is vast. A 512×512 RGB image has 786,432 independent dimensions. Random initialization places you in a region of that space that has essentially zero probability under the distribution of real images. Local gradients point in contradictory directions. There is no coherent path from here to a valid image.
Reason two: CLIP is not a generator. CLIP knows which images match which texts. It does not know how to build an image that matches a text. Optimizing toward CLIP from random noise finds adversarial perturbations, not natural images.
Reason three: you have no prior. You are not using any knowledge of what real images look like. A model that has seen millions of images knows that faces have two eyes, skies have gradients, fur has texture. You are ignoring all of that.
Diffusion solves all three problems. It does not optimize from scratch. It learns the distribution of real images during training. It starts from noise but follows a learned path back to the distribution of real images. The path is learned, step by step, from data.
This is why diffusion works where direct optimization fails.
Stakes¶
Diffusion models power every major image and video generation system in production today. Stable Diffusion, DALL-E 3, Midjourney, Imagen, Adobe Firefly, Sora (video) — all are diffusion-based or diffusion-inspired. This is not a niche research topic. Understanding diffusion is table-stakes knowledge for AI engineering in 2024 and beyond.
Chapter 2 — The Forward Process¶
What it is¶
The forward process is a recipe for destroying an image. Take a clean photograph x₀. Apply small, incremental doses of Gaussian noise over T steps. After T steps (typically T = 1000), you are left with pure Gaussian noise x_T. The image is completely unrecognizable.
This process is not learned. It is a fixed mathematical recipe, defined before training begins. We define it precisely because the reverse process must undo it exactly.
The step-by-step formula¶
At each step t, given the previous noisy image x_{t-1}:
In plain English: - Multiply the current image by √(1-βt) — a value slightly below 1 — which shrinks signal. - Add Gaussian noise scaled by βt.
βt is the noise schedule parameter at step t. It controls how much noise enters at each step.
The noise schedule¶
Three schedules are worth knowing:
Linear schedule (original DDPM, Ho 2020):
β₁ = 0.0001, ..., β₁₀₀₀ = 0.02
Simple. Adds noise too fast at early steps for high-resolution images.
Cosine schedule (improved DDPM, Nichol & Dhariwal 2021):
ᾱt follows a cosine curve.
Gentler at the beginning, steeper near the end.
Preserves structure longer. Better for images.
Scaled linear (used in Stable Diffusion):
β² linearly spaced from 0.00085 to 0.012.
Empirical compromise. Works well in latent space.
The cosine schedule was a significant improvement. It allows the model to see more of the intermediate structure during training. This leads to better samples.
The variance-preserving closed form¶
The most useful property of the forward process: we can skip directly from x₀ to any x_t in closed form. No need to run t sequential steps.
Define:
Then the distribution of x_t given x₀ is:
Which means we can sample x_t directly:
This is variance-preserving: as t increases, the signal component (√ᾱt · x₀) shrinks and the noise component (√(1-ᾱt) · ε) grows, but their sum of variances stays near 1.
At t=0: ᾱ₀ = 1, so x₀ = x₀. Perfect signal. At t=T: ᾱ_T ≈ 0, so x_T ≈ ε. Pure noise.
ASCII diagram: forward process¶
x₀ (clean image)
┌─────────┐
│ Photo │ ──→ x₁ ──→ x₂ ──→ ... ──→ x_T
│ sharp │ +tiny +more pure
│ clear │ noise noise noise
└─────────┘
ᾱ_t = 1 ᾱ_t ≈ 0.9 ᾱ_t ≈ 0.5 ᾱ_t ≈ 0
SNR: high SNR: high SNR: low SNR: ~0
Numerical mini-example: one forward step¶
Let's do one forward step by hand on a tiny 2D image.
Given: x₀ = [0.8, -0.3] (two-pixel "image" for simplicity)
β₁ = 0.02
α₁ = 1 - 0.02 = 0.98
√α₁ = √0.98 ≈ 0.9899
Sample: ε = [0.5, -0.7] ~ N(0, I)
x₁ = √α₁ · x₀ + √β₁ · ε
= 0.9899 × [0.8, -0.3] + √0.02 × [0.5, -0.7]
= [0.792, -0.297] + 0.1414 × [0.5, -0.7]
= [0.792, -0.297] + [0.071, -0.099]
= [0.863, -0.396]
Signal is slightly corrupted. After 1000 such steps, x₁₀₀₀ ≈ pure noise.
This closed-form shortcut — from x₀ directly to any x_t — makes training efficient. We sample random t values during training without needing to run t sequential steps.
Chapter 3 — The Reverse Process¶
The goal¶
The forward process destroys images in a controlled way. The reverse process is what we learn: starting from x_T (pure noise), produce a sample that looks like a real image from the training distribution.
The true reverse process is:
We approximate it with a neural network:
The key: the network takes the noisy image x_t and the timestep t as input, and outputs the parameters of the reverse Gaussian step.
What the network predicts¶
Predict the noise (ε-prediction): The network estimates which noise ε was added to get from x₀ to x_t.
From the predicted noise, we can recover x₀ and then compute the reverse step.
Predict the clean image (x₀-prediction): The network estimates the clean image directly.
Both are mathematically equivalent. ε-prediction dominates in practice because it produces more stable training. Some modern models use v-prediction, a linear combination of the two that is numerically better conditioned.
The simplified training loss¶
The full variational lower bound simplifies to:
Steps: 1. Sample a real image x₀ from training data. 2. Sample a random timestep t ~ Uniform(1, T). 3. Sample noise ε ~ N(0, I). 4. Compute x_t = √ᾱt · x₀ + √(1-ᾱt) · ε. 5. Pass x_t and t to the network. Get predicted noise εθ. 6. Compute L = ‖ε - εθ‖². Backpropagate.
This is one of the cleanest loss functions in deep learning. No adversarial training, no perceptual loss, no GAN discriminator. Just MSE between the true noise and the predicted noise.
The U-Net architecture¶
The function εθ is implemented as a U-Net. The U-Net was originally designed for biomedical image segmentation. It turns out to be ideal for diffusion because it must: - Accept a noisy image and output a same-shaped noise estimate. - Understand both coarse structure and fine texture simultaneously.
INPUT: noisy image x_t (H × W × C)
│
▼
┌───────────────────────────────────────────────┐
│ ENCODER (Downsampling path) │
│ res=H: [Conv, Attn] → feat_1 (H×W×64) │
│ res=H/2: [Conv, Attn] → feat_2 (H/2×W/2×128)│
│ res=H/4: [Conv, Attn] → feat_3 (H/4×W/4×256)│
│ res=H/8: [Conv, Attn] → feat_4 (H/8×W/8×512)│
└──────────────────┬────────────────────────────┘
│ bottleneck (H/8×W/8×512)
┌──────────────────┴────────────────────────────┐
│ DECODER (Upsampling path) │
│ res=H/4: Upsample + cat(feat_3) → [Conv,Attn]│
│ res=H/2: Upsample + cat(feat_2) → [Conv,Attn]│
│ res=H: Upsample + cat(feat_1) → [Conv,Attn]│
└───────────────────────────────────────────────┘
│
▼
OUTPUT: predicted noise εθ (H × W × C) ← same shape as input
SKIP CONNECTIONS (cat = concatenate):
feat_1 ────────────────────────────────→ decoder level 1
feat_2 ─────────────────────────→ decoder level 2
feat_3 ──────────────────→ decoder level 3
Why skip connections matter: The encoder compresses spatial information to understand what the image is. The skip connections pass spatial detail directly to the decoder so it can reconstruct fine texture. Without skips, outputs are blurry.
Timestep conditioning: The timestep t is encoded as a sinusoidal embedding (like transformer positional encoding) and added to the activations at every residual block. This tells the model how noisy the current input is.
Cross-attention for text: In conditional models, every resolution level includes cross-attention layers that attend to text embeddings. This is where text influences the spatial generation.
Reverse step formula¶
Given x_t and predicted noise εθ, the reverse step is:
σt is a variance term (usually set to β_t or a learned interpolation). At the last step (t=1 → t=0), we typically set σt = 0 for a deterministic final output.
Numerical mini-example: reverse step¶
Given: x₁ = [0.863, -0.396] (from the forward example)
α₁ = 0.98, β₁ = 0.02, ᾱ₁ = 0.98
εθ(x₁, t=1) = [0.5, -0.7] ← model predicts the noise correctly
μθ = (1/√0.98) · ([0.863, -0.396] - 0.02/√(1-0.98) × [0.5, -0.7])
= 1.0102 · ([0.863, -0.396] - 0.02/0.1414 × [0.5, -0.7])
= 1.0102 · ([0.863, -0.396] - 0.1414 × [0.5, -0.7])
= 1.0102 · ([0.863, -0.396] - [0.071, -0.099])
= 1.0102 · [0.792, -0.297]
= [0.800, -0.300]
Matches x₀ = [0.8, -0.3]. The reverse step recovers the original.
In practice, T = 1000 steps of this — each imperfect, but correct in aggregate — produces clean, realistic samples.
Chapter 4 — Conditioning and Guidance¶
Why we need conditioning¶
An unconditional diffusion model generates random images from the learned distribution. You get dog photos, landscapes, portraits — whatever was in training data, at random. That is not useful.
We want: "generate a photorealistic image of a red cat sitting on a blue sofa, golden-hour lighting." We need conditioning.
Text conditioning via CLIP¶
The standard approach: encode the text prompt with a CLIP text encoder (or T5 for newer models like Imagen and SD3).
"a red cat on a blue sofa"
│
▼ CLIP Text Encoder (ViT-L/14)
│
[v₁, v₂, ..., v₇₇] ← 77 token embeddings, each dim=768 (CLIP) or 4096 (T5-XXL)
CLIP was trained on 400M image-text pairs with a contrastive objective. Its text encoder understands how language relates to visual content. This is why CLIP embeddings are a strong conditioning signal.
Cross-attention in the U-Net¶
Text embeddings are injected into the U-Net at every resolution level via cross-attention:
Image features → Q = W_Q(features) [spatial queries]
Text embeddings → K = W_K(text) [text keys]
V = W_V(text) [text values]
Attention = softmax(QKᵀ / √d_k) · V
The image features query the text embeddings. A spatial region that needs to look "red" will attend strongly to the "red" token. A region that must contain "cat" will attend to "cat". This attention mechanism is how text shapes the image, pixel by pixel.
Classifier-free guidance (CFG)¶
Text conditioning alone is not enough. The model often hedges — it does not follow the text as strongly as needed. This produces acceptable images that do not precisely match the prompt.
Classifier-free guidance (Ho & Salimans, 2022) solves this without a separate classifier.
During training: - With probability p (typically 0.1–0.2), replace the text conditioning with a null token ("" or a learned unconditional embedding). - The same model learns both conditional generation (with text) and unconditional generation (without text).
During inference: Run two forward passes per step: 1. Conditional: εθ(x_t, t, text) 2. Unconditional: εθ(x_t, t, "")
Then blend:
What this does: - The difference (ε_cond - ε_uncond) is the direction in noise space that makes the output more text-consistent. - Multiplying by w amplifies this direction. - You are steering the denoising trajectory more aggressively toward text-consistent outputs.
Guidance scale w in practice:
w = 0: Unconditional. Ignores text. Raw diversity.
w = 1: Conditional without amplification. Often blurry.
w = 3-5: Soft guidance. More creative, less literal.
w = 7-10: Standard setting. Strong text adherence.
w = 15-20: Over-saturated. Faces distort. Artifacts appear.
Diversity ◄──────────────────────────────► Fidelity
│ │
w = 0 w = 5 w = 10 w = 20
(ignores text) (creative) (balanced) (text-locked,
artifacts)
Chapter 5 — Latent Diffusion and Modern Architectures¶
Why pixel-space diffusion is too expensive¶
A 512×512 RGB image has 786,432 values. Running 1000 U-Net forward passes over all those values requires enormous compute.
DDPM trained on 256×256 images takes days on multi-GPU clusters. Inference takes minutes per image. This is not production-viable. Something had to change.
The VAE trick: compress first, diffuse later¶
Latent diffusion (Rombach et al., 2022 — the paper behind Stable Diffusion) is the key insight:
The image distribution is highly redundant. Adjacent pixels are correlated. We do not need to run diffusion on every pixel. Run it on a compressed representation instead.
The solution: 1. Train a VAE on images separately. - Encoder E: image (512×512×3) → latent z (64×64×4). 8× spatial compression. - Decoder D: latent z (64×64×4) → image (512×512×3). 2. Freeze the VAE. 3. Train the diffusion model to denoise in latent space.
TEXT: "a red cat on a blue sofa"
│
▼ CLIP / T5 encoder
Text embeddings
│
▼ (cross-attention into U-Net)
┌──────────────────────────────────────────────┐
│ Diffusion U-Net (in latent space) │
│ Input: z_T (64×64×4, pure latent noise) │
│ Output: z_0 (64×64×4, clean latent) │
└──────────────────────────────────────────────┘
│
▼ VAE Decoder D
FINAL IMAGE (512×512×3)
The speedup is quadratic in spatial dimensions: 64×64 vs 512×512 = 64x fewer spatial operations per step. Combined with fewer required steps, latent diffusion is roughly 100-200x faster than pixel-space DDPM.
This is why Stable Diffusion runs on a consumer GPU in seconds. Pixel-space diffusion cannot.
What the VAE latent space looks like¶
The 4-channel latent z is not random. Each channel captures structured information: - Channel 0: broad luminance/structure. - Channel 1: mid-frequency edges. - Channels 2-3: color and texture information.
The VAE latent is the space in which diffusion happens. The model never "sees" pixels directly. It denoises latents, and only at the very end does the VAE decoder render them to pixels.
Pixel space: [H × W × 3] → VAE Encoder → Latent: [H/8 × W/8 × 4]
(or /4 depending on VAE)
Latent space is where noise is added, denoised, and generated.
Pixel space is only touched at encode (training) and decode (inference) time.
DiT — Diffusion Transformers¶
U-Net was the default architecture through 2022. Then DiT (Peebles & Xie, 2023) proposed replacing it entirely with a pure Transformer.
Why? Transformers scale much more reliably than U-Nets. More parameters → better FID, predictably.
How it works:
Latent z_t (H/8 × W/8 × C)
│
▼ Patchify into tokens (2×2 patches)
Sequence of N tokens (N = H×W / 4)
│
▼ Transformer blocks with adaLN
│ (adaLN: scale + shift conditioned on timestep + class embeddings)
│
▼ Unpatchify
Predicted noise (H/8 × W/8 × C)
Adaptive Layer Norm (adaLN): Instead of fixed LayerNorm, the scale and shift parameters are computed from the conditioning signal (timestep + class label or text). This is cleaner than cross-attention for some conditioning signals.
DiT models scale as DiT-S, DiT-B, DiT-L, DiT-XL. Each larger model produces significantly better FID. This scaling law was the key claim of the paper.
Modern usage: FLUX (Black Forest Labs), Stable Diffusion 3, and almost certainly Sora's video model use DiT or 3D-DiT variants. The U-Net era is fading.
Flow matching — a cleaner framework¶
Flow matching replaces the noise-schedule-based forward process with a simpler linear interpolation:
The model learns to predict the velocity (the direction of flow at each point):
Why better? - Trajectories are straight lines. DDPM trajectories are curved. - Straight trajectories need fewer steps. - Cleaner theoretical framework (continuous normalizing flows). - Used in Stable Diffusion 3, FLUX, Meta's Transfusion.
Consistency models and distillation — the speed shortcut¶
Standard DDPM inference: 1000 steps. DDIM: 50 steps. Still slow.
Consistency models (Song et al., 2023) train the model to map any point on the trajectory directly back to x₀:
This "consistency" property means: one step is enough. Or two steps if you want slightly better quality.
Other speed techniques:
| Method | Steps | Quality loss | Notes |
|---|---|---|---|
| DDPM (original) | 1000 | Baseline | Too slow for production |
| DDIM | 20-50 | Minimal | Deterministic; most common |
| DPM-Solver++ | 15-25 | Minimal | Better ODE solver |
| LCM (distillation) | 4-8 | Small | Distilled from SD |
| SDXL-Turbo | 1-4 | Moderate | Adversarial distillation |
| Consistency model | 1-4 | Moderate | Single-step possible |
| FLUX-Schnell | 4 | Small | Flow matching + distillation |
The speed shortcut matters enormously in production. Going from 50 steps to 4 steps is a 12x inference speedup with acceptable quality for most applications.
Honest Admission¶
Mode collapse: subtler than GANs, but real¶
Diffusion models do not collapse the way GANs do. GANs could get stuck generating the same face forever. Diffusion models produce diverse outputs. But they have their own biases.
If training data over-represents certain aesthetics, ethnicities, compositional styles, or object classes, the model's "random" outputs will reflect those biases. Western beauty standards, photographic lighting conventions, urban environments — all are over-represented in web-scraped data. The generated image looks diverse in texture and color. It is subtly homogeneous in content and representation.
Fine-tuned models (DreamBooth, LoRA) collapse intentionally toward the fine-tune target. When this goes wrong — small dataset, too many training steps, learning rate too high — you get a model that can only generate variants of the same image. This is fine-tune-induced mode collapse. It is common and fixable (use regularization images, stop earlier).
Evaluation is genuinely hard¶
FID (Fréchet Inception Distance) is the standard benchmark. It compares the distribution of InceptionNet features between real and generated images. Low FID = similar distributions = good.
Problems with FID: - Rewards InceptionNet-style realism, not prompt adherence. - Sensitive to the number of samples (use 50K minimum). - Can be gamed by memorizing training images. - Two models with identical FID can produce wildly different user experiences.
CLIP score measures text-image alignment. It rewards CLIP-compatible aesthetics. Models trained with CLIP conditioning already understand how to maximize CLIP score, so this metric is partially circular.
Human evaluation is the only honest metric. It is expensive, slow, and inconsistent across evaluators. There is no single good automated metric for "this image is exactly what the user wanted." GenAI research continues to use imperfect proxies.
Copyright concerns¶
Diffusion models trained on web-scraped data have seen copyrighted artwork. The model does not store the images directly (it is not a lookup table). But it has learned the style, composition, and content of those images. The legal status of outputs is unresolved in most jurisdictions as of mid-2024.
For production work: - Read the license of the model you are using. FLUX.1-dev: Apache 2.0 for weights but non-commercial for outputs. SD3: custom license. SDXL: CreativeML OpenRAIL+M. - Do not train on a dataset without understanding its provenance. - Do not generate content mimicking a specific living artist's style for commercial use without legal review. - Consider using datasets with explicit consent (e.g., Adobe Firefly is trained on licensed stock).
This problem will not be resolved soon. It is legal and ethical, not technical.
Chapter 6 — Recap, Drills, and Production Notes¶
Failure-fix table¶
| Failure | Symptom | Root cause | Fix |
|---|---|---|---|
| Blurry outputs | Soft edges, no detail | Noise schedule too aggressive early | Use cosine schedule; try DDIM |
| Prompt ignored | Output unrelated to text | CFG too low; weak text encoder | Increase guidance scale (w to 7–10) |
| Over-saturated artifacts | Burned colors, distorted faces | CFG too high | Lower guidance scale (try w=7) |
| Fine-tune collapse | Every output looks identical | Too many steps, no regularization | Use regularization images; cut training |
| OOM during inference | CUDA out of memory | fp32 or large batch | Use fp16; enable attention slicing; xformers |
| Slow inference (>10s/img) | Each image unusably slow | Too many steps; pixel-space model | Switch to DDIM or LCM; use latent diffusion |
| VAE grid artifacts | Regular tiling pattern | VAE fp16 numerical instability | Decode in fp32; update VAE checkpoint |
| ControlNet not working | Pose/edge not respected | Wrong preprocessor; scale too low | Check ControlNet scale (>= 0.7); verify preprocessor |
| Inconsistent style across batch | Wild variation in aesthetics | No seed control | Fix seed; use style LoRA |
| NSFW outputs in production | Inappropriate generations | No safety filter | Enable NSFW classifier; use negative prompts |
Interview questions — with model answers¶
Q1. Explain the forward process. What is the noise schedule and why does it matter?
The forward process adds Gaussian noise to a clean image over T steps (T ≈ 1000). The noise schedule {β_t} controls how fast noise accumulates. A linear schedule adds noise uniformly; a cosine schedule adds noise more gently at first, preserving structure longer. The schedule matters because it determines how the model learns the reverse process — a schedule that destroys information too fast forces the model to learn from very noisy inputs, which is harder.
Q2. What does the U-Net predict — noise (ε) or clean image (x₀)?
Typically noise (ε-prediction). The loss is L = ‖ε - εθ(x_t, t)‖². x₀-prediction is equivalent mathematically but sometimes less stable. V-prediction (a linear blend) is used in some modern models like Imagen for better numerical conditioning at extreme timesteps.
Q3. What is classifier-free guidance and what does the guidance scale control?
CFG trains a single model to do both conditional and unconditional denoising (by randomly dropping text during training). At inference, the final noise prediction is: ε_guided = ε_uncond + w × (ε_cond - ε_uncond). The guidance scale w controls the tradeoff: higher w → stronger text adherence, less diversity, eventually artifacts. Typical production value: w = 7.
Q4. Why latent diffusion instead of pixel-space diffusion?
Pixel space (512×512×3) has 786K values. Running 1000 U-Net passes is computationally prohibitive. A VAE compresses images to latent (64×64×4) — 64x fewer spatial elements. The diffusion model denoises latents. At the end, the VAE decoder renders latents to pixels. The VAE contributes both compression efficiency and perceptual quality (it is trained to preserve semantically important features, not just every pixel).
Q5. What is DDIM and when would you use it?
DDIM (Denoising Diffusion Implicit Models) reformulates the reverse process as a deterministic ODE solver rather than a stochastic sampler. This allows 20-50 steps with near-identical quality to 1000-step DDPM. Use DDIM when you need fast inference without quality loss. It also enables image interpolation in latent space (mix two images at a fixed seed).
Q6. How does cross-attention connect text to the image generation?
At each resolution level of the U-Net, cross-attention computes Q from image features and K, V from text embeddings. Attention output replaces the image features with a text-informed representation. Regions of the image that must match specific words (e.g., "red") will produce high attention scores for those tokens. This is how text semantics become spatial signals.
Q7. What is the difference between a U-Net and a DiT?
U-Net: encoder-decoder with skip connections, convolutional blocks, cross-attention layers injected at multiple resolutions. DiT: patchify the latent, run through a pure Transformer with adaptive LayerNorm for conditioning. DiT scales more predictably with model size and is the current preferred architecture for large-scale models (FLUX, SD3).
Production notes: GPU memory and latency¶
Running Stable Diffusion 1.5 (SD1.5) on A10G (24 GB VRAM):
| Setting | VRAM | Time/image | Notes |
|---|---|---|---|
| fp32, 50 DDIM steps, 512×512 | ~12 GB | ~8s | Reference baseline |
| fp16, 50 DDIM steps | ~6 GB | ~4s | Default for most deployments |
| fp16 + xformers | ~4.5 GB | ~2.5s | Memory-efficient attention |
| fp16 + LCM (8 steps) | ~4.5 GB | ~0.5s | Quality tradeoff acceptable |
Running SDXL on A10G (1024×1024):
| Setting | VRAM | Time/image |
|---|---|---|
| fp16, 30 DPM++ steps | ~12 GB | ~10s |
| fp16 + LCM (4 steps) | ~12 GB | ~2s |
| fp16, attention slicing | ~8 GB | ~14s |
Latency bottleneck order (most to least impactful): 1. Number of sampling steps (quadratic in quality improvement, linear in time) 2. Image resolution (quadratic in pixels) 3. Precision (fp16 vs fp32 = 2x memory and ~1.5x speed) 4. VAE decode (10–20% of total time; use fp32 decode to avoid artifacts) 5. CLIP encode (<5% of total time; cacheable for repeated prompts)
For production API targeting <2s latency at 512×512: - Use SDXL-Turbo, LCM-LoRA, or FLUX-Schnell (4-step models). - Compile with torch.compile() for 20–30% additional speedup. - Cache CLIP embeddings for recurring prompts. - Batch requests: 4 images concurrently often faster than 4 sequential. - Enable xformers or sdp_attention.
Exercises¶
-
Noise schedule plot. Compute ᾱ_t for t ∈ [0, 1000] under the linear schedule. Plot it. At what step does ᾱ_t drop below 0.01 (effectively pure noise)?
-
Manual reverse step. Given x_T = [0.1, 0.9, -0.5], β_T = 0.02, ᾱ_T = 0.001, and predicted noise εθ = [0.3, -0.1, 0.2], compute x_{T-1}. Show your work.
-
CFG ablation. Using the HuggingFace diffusers library, generate the same prompt 5 times with w ∈ {1, 3, 7, 12, 20}. Document how the image changes with each scale.
-
Latent inspection. Encode a photograph with the SD1.5 VAE. Print the shape of the latent. Visualize each of the 4 channels. Describe what each channel appears to represent.
-
Speed benchmark. Time SD1.5 inference with 50, 20, 10, and 4 DDIM steps. Plot quality vs latency. At what step count does quality degrade noticeably for a portrait prompt?
-
Flow matching vs DDPM. Read the latent consistency model paper. Describe in two paragraphs why straight flow trajectories require fewer steps than curved DDPM trajectories.
Foundation-Gap Audit¶
Module 15 (capstone project) treats diffusion models as a building block. You are expected to use them, not re-derive them. This section audits what you must be solid on before proceeding.
What module 15 capstone assumes¶
| Concept | What you must know | Self-check |
|---|---|---|
| How diffusion generates images | Forward process destroys signal by adding noise. Reverse process denoises in T steps. Start from noise, end at a sample that looks like training data. | ☐ |
| Latent space concept | VAE encodes image to compact latent. Diffusion runs entirely in latent space. VAE decoder renders latent to final pixels. Image is never directly touched by diffusion. | ☐ |
| Conditioning mechanism | Text → CLIP/T5 encoder → token embeddings → cross-attention layers in U-Net at each resolution level → spatial regions attend to relevant text tokens. | ☐ |
| Speed vs quality trade-offs | More steps = slower, better quality. Fewer steps = faster, slightly lower quality. DDIM for deterministic fast sampling. LCM/consistency for 1-8 step generation. | ☐ |
Gaps that will cost you in module 15¶
Gap 1: Fuzzy understanding of ᾱ_t. If you cannot state that ᾱ_T ≈ 0 means x_T ≈ pure noise, the forward-process logic will not transfer. Reread the variance-preserving formulation above.
Gap 2: Not knowing what guidance scale does in practice. The capstone involves generating content to spec. Prompt engineering without guidance-scale intuition produces unpredictable results. Do the CFG ablation exercise.
Gap 3: Confusing pixel-space and latent-space operations. When you use the diffusers Pipeline, it returns a PIL image. Internally, all the denoising happened in 64×64×4 space. Knowing this helps debug shape errors and understand why latent interpolation works.
Gap 4: No hands-on time. If you have only read about diffusion and never run a single inference, you will struggle in the capstone. Install diffusers, run the quickstart, generate one image. This takes 15 minutes.
Retrieval Prompts¶
Test yourself without re-reading this document. These are representative exam-style questions.
Retrieval prompt 1:
Write the closed-form expression for x_t given x_0. Define every variable. Explain what variance-preserving means in this context and why it matters for training stability.
Retrieval prompt 2:
Explain classifier-free guidance from scratch. What is trained differently? What computation happens at inference time that does not happen during training? What does the guidance scale w control, and what goes wrong at w = 20?
Retrieval prompt 3:
You are deploying Stable Diffusion XL to a production API. Target: fewer than 2 seconds per 1024×1024 image on a single A10G GPU. Walk through your optimization decisions in priority order. Name specific techniques and their expected impact.
Retrieval prompt 4:
What is the difference between a U-Net and a DiT for diffusion? In what way does DiT scale better? Name two production models that use DiT. When would you still choose a U-Net?
What Comes Next¶
This module closes the reasoning-and-multimodal phase of the curriculum. You now have the theoretical foundation and practical intuition for the most important generative architecture in use today.
Next module — 33_capstone_project — brings everything together. You will build a complete AI system using multiple techniques from all prior modules.
Diffusion will be relevant when the capstone includes any image generation, image-to-image transformation, or latent-space reasoning component. The VAE/latent-space intuition also applies to any compressed representation task. The conditioning mechanism (cross-attention, CLIP embeddings) directly transfers to multimodal retrieval work from earlier modules.
Before moving to module 15: - Complete the foundation-gap audit above. Tick all four checkboxes. - Do at least exercises 1 and 3 from Chapter 6. - Finish the end-of-phase reflection in 06_revision.md. - Ensure the hands_on_lab in 05_hands_on_lab.md is shipped.
Companion files: 01_weekly_plan.md · 03_study_material.md · 04_daily_recall.md · 05_hands_on_lab.md · 06_revision.md