01. Week 14 — Diffusion Models¶

Key concepts to master¶

Forward diffusion
Add Gaussian noise over T steps
Use a fixed noise schedule
Know the closed-form expression for x_t
Reverse diffusion
Learn to predict noise from a noisy sample
Use a U-Net or DiT to denoise iteratively
Understand why the simplified MSE loss works
Conditioning
Text prompt → CLIP/T5 embeddings
Cross-attention injects text into image generation
CFG trades diversity for prompt fidelity
Latent diffusion
VAE encoder compresses images
Diffusion happens in latent space, not pixel space
VAE decoder renders the final image
Modern speedups
DDIM and DPM-Solver reduce steps
LCM and consistency models distill fast generation
DiT and flow matching are the modern scaling path

🧠 Mental models¶

Forward diffusion: "slowly fog a photo until only static remains"
Reverse diffusion: "restore the scene by removing fog one careful step at a time"
DDPM: "take many tiny denoising stair steps instead of one giant leap"
Latent diffusion: "edit the compressed sketch, then render the full poster"
Guidance scale: "the prompt-obedience dial that can be turned too far"
ControlNet: "guide rails that keep generation attached to a pose, edge map, or depth map"

⚠️ Common traps¶

Mixing up the fixed noise schedule with the learned denoiser parameters.
Cranking classifier-free guidance too high and causing oversaturation, artifacts, or reduced diversity.
Assuming more diffusion steps always help, even when samplers hit diminishing returns.
Forgetting that latent diffusion quality depends on the VAE bottleneck and decoder, not just the denoiser.
Treating DDIM or solver-based sampling as training tricks when they mainly change inference speed/trajectory.
Expecting ControlNet to rescue weak conditioning inputs or preserve identity perfectly on its own.

🔗 Prerequisites & connections¶

Builds on: Module 13 ideas about CLIP/text conditioning, latent spaces, and the difference between signal and noise in visual modeling.

Feeds into: Module 15 capstone decisions about image features, controlled generation, latency-quality trade-offs, and demo-worthy multimodal products.

💬 Interview phrasing¶

What does a DDPM actually learn during training?
Why is latent diffusion so much faster than pixel-space diffusion?
What does classifier-free guidance do, and why can it fail when the scale is too high?
When would you add ControlNet instead of relying on prompt engineering alone?
If inference is too slow, what levers would you pull first without destroying quality?

⏱️ Difficulty markers¶

🟢 forward diffusion intuition
🟡 reverse denoising objective
🟡 latent diffusion pipeline
🟡 DDIM / DPM-Solver speedups
🔴 classifier-free guidance tuning
🔴 ControlNet conditioning

Foundation-gap audit¶

Before moving to Module 15, make sure all four are true:

[ ] I can explain how diffusion generates an image from pure noise.
[ ] I can explain what latent space is and why Stable Diffusion uses it.
[ ] I can explain how text conditioning enters the model.
[ ] I can explain the main speed-vs-quality tradeoffs at inference time.

If any box is shaky, revisit 02_explainer.md Chapter 6 before moving on.

Self-check questions¶

Write the closed-form formula for x_t given x_0. Name every variable.
What does the U-Net predict and what loss do we minimize?
Why does classifier-free guidance need two passes at inference?
Why is latent diffusion dramatically faster than pixel-space diffusion?
What breaks when guidance scale becomes too large?
DiT vs U-Net — what changes architecturally?
Name two practical ways to reduce latency with limited quality loss.

Completion gate¶

[ ] 02_explainer.md read end-to-end
[ ] 04_daily_recall.md answered aloud across the week
[ ] 05_hands_on_lab.md first version shipped
[ ] Foundation-gap audit complete
[ ] 06_revision.md completed honestly

Bridge forward¶

Next module — 33_capstone_project — brings everything together. You will build a complete AI system using multiple techniques from all prior modules.