01. Week 14 — Diffusion Models¶
Key concepts to master¶
- Forward diffusion
- Add Gaussian noise over T steps
- Use a fixed noise schedule
- Know the closed-form expression for x_t
- Reverse diffusion
- Learn to predict noise from a noisy sample
- Use a U-Net or DiT to denoise iteratively
- Understand why the simplified MSE loss works
- Conditioning
- Text prompt → CLIP/T5 embeddings
- Cross-attention injects text into image generation
- CFG trades diversity for prompt fidelity
- Latent diffusion
- VAE encoder compresses images
- Diffusion happens in latent space, not pixel space
- VAE decoder renders the final image
- Modern speedups
- DDIM and DPM-Solver reduce steps
- LCM and consistency models distill fast generation
- DiT and flow matching are the modern scaling path
🧠 Mental models¶
- Forward diffusion: "slowly fog a photo until only static remains"
- Reverse diffusion: "restore the scene by removing fog one careful step at a time"
- DDPM: "take many tiny denoising stair steps instead of one giant leap"
- Latent diffusion: "edit the compressed sketch, then render the full poster"
- Guidance scale: "the prompt-obedience dial that can be turned too far"
- ControlNet: "guide rails that keep generation attached to a pose, edge map, or depth map"
⚠️ Common traps¶
- Mixing up the fixed noise schedule with the learned denoiser parameters.
- Cranking classifier-free guidance too high and causing oversaturation, artifacts, or reduced diversity.
- Assuming more diffusion steps always help, even when samplers hit diminishing returns.
- Forgetting that latent diffusion quality depends on the VAE bottleneck and decoder, not just the denoiser.
- Treating DDIM or solver-based sampling as training tricks when they mainly change inference speed/trajectory.
- Expecting ControlNet to rescue weak conditioning inputs or preserve identity perfectly on its own.
🔗 Prerequisites & connections¶
Builds on: Module 13 ideas about CLIP/text conditioning, latent spaces, and the difference between signal and noise in visual modeling.
Feeds into: Module 15 capstone decisions about image features, controlled generation, latency-quality trade-offs, and demo-worthy multimodal products.
💬 Interview phrasing¶
- What does a DDPM actually learn during training?
- Why is latent diffusion so much faster than pixel-space diffusion?
- What does classifier-free guidance do, and why can it fail when the scale is too high?
- When would you add ControlNet instead of relying on prompt engineering alone?
- If inference is too slow, what levers would you pull first without destroying quality?
⏱️ Difficulty markers¶
- 🟢 forward diffusion intuition
- 🟡 reverse denoising objective
- 🟡 latent diffusion pipeline
- 🟡 DDIM / DPM-Solver speedups
- 🔴 classifier-free guidance tuning
- 🔴 ControlNet conditioning
Foundation-gap audit¶
Before moving to Module 15, make sure all four are true:
- [ ] I can explain how diffusion generates an image from pure noise.
- [ ] I can explain what latent space is and why Stable Diffusion uses it.
- [ ] I can explain how text conditioning enters the model.
- [ ] I can explain the main speed-vs-quality tradeoffs at inference time.
If any box is shaky, revisit 02_explainer.md Chapter 6 before moving on.
Self-check questions¶
- Write the closed-form formula for x_t given x_0. Name every variable.
- What does the U-Net predict and what loss do we minimize?
- Why does classifier-free guidance need two passes at inference?
- Why is latent diffusion dramatically faster than pixel-space diffusion?
- What breaks when guidance scale becomes too large?
- DiT vs U-Net — what changes architecturally?
- Name two practical ways to reduce latency with limited quality loss.
Completion gate¶
- [ ] 02_explainer.md read end-to-end
- [ ] 04_daily_recall.md answered aloud across the week
- [ ] 05_hands_on_lab.md first version shipped
- [ ] Foundation-gap audit complete
- [ ] 06_revision.md completed honestly
Bridge forward¶
Next module — 33_capstone_project — brings everything together. You will build a complete AI system using multiple techniques from all prior modules.