Skip to content

01. Week 14 — Diffusion Models

Key concepts to master

  • Forward diffusion
  • Add Gaussian noise over T steps
  • Use a fixed noise schedule
  • Know the closed-form expression for x_t
  • Reverse diffusion
  • Learn to predict noise from a noisy sample
  • Use a U-Net or DiT to denoise iteratively
  • Understand why the simplified MSE loss works
  • Conditioning
  • Text prompt → CLIP/T5 embeddings
  • Cross-attention injects text into image generation
  • CFG trades diversity for prompt fidelity
  • Latent diffusion
  • VAE encoder compresses images
  • Diffusion happens in latent space, not pixel space
  • VAE decoder renders the final image
  • Modern speedups
  • DDIM and DPM-Solver reduce steps
  • LCM and consistency models distill fast generation
  • DiT and flow matching are the modern scaling path

🧠 Mental models

  • Forward diffusion: "slowly fog a photo until only static remains"
  • Reverse diffusion: "restore the scene by removing fog one careful step at a time"
  • DDPM: "take many tiny denoising stair steps instead of one giant leap"
  • Latent diffusion: "edit the compressed sketch, then render the full poster"
  • Guidance scale: "the prompt-obedience dial that can be turned too far"
  • ControlNet: "guide rails that keep generation attached to a pose, edge map, or depth map"

⚠️ Common traps

  • Mixing up the fixed noise schedule with the learned denoiser parameters.
  • Cranking classifier-free guidance too high and causing oversaturation, artifacts, or reduced diversity.
  • Assuming more diffusion steps always help, even when samplers hit diminishing returns.
  • Forgetting that latent diffusion quality depends on the VAE bottleneck and decoder, not just the denoiser.
  • Treating DDIM or solver-based sampling as training tricks when they mainly change inference speed/trajectory.
  • Expecting ControlNet to rescue weak conditioning inputs or preserve identity perfectly on its own.

🔗 Prerequisites & connections

Builds on: Module 13 ideas about CLIP/text conditioning, latent spaces, and the difference between signal and noise in visual modeling.

Feeds into: Module 15 capstone decisions about image features, controlled generation, latency-quality trade-offs, and demo-worthy multimodal products.

💬 Interview phrasing

  • What does a DDPM actually learn during training?
  • Why is latent diffusion so much faster than pixel-space diffusion?
  • What does classifier-free guidance do, and why can it fail when the scale is too high?
  • When would you add ControlNet instead of relying on prompt engineering alone?
  • If inference is too slow, what levers would you pull first without destroying quality?

⏱️ Difficulty markers

  • 🟢 forward diffusion intuition
  • 🟡 reverse denoising objective
  • 🟡 latent diffusion pipeline
  • 🟡 DDIM / DPM-Solver speedups
  • 🔴 classifier-free guidance tuning
  • 🔴 ControlNet conditioning

Foundation-gap audit

Before moving to Module 15, make sure all four are true:

  • [ ] I can explain how diffusion generates an image from pure noise.
  • [ ] I can explain what latent space is and why Stable Diffusion uses it.
  • [ ] I can explain how text conditioning enters the model.
  • [ ] I can explain the main speed-vs-quality tradeoffs at inference time.

If any box is shaky, revisit 02_explainer.md Chapter 6 before moving on.

Self-check questions

  1. Write the closed-form formula for x_t given x_0. Name every variable.
  2. What does the U-Net predict and what loss do we minimize?
  3. Why does classifier-free guidance need two passes at inference?
  4. Why is latent diffusion dramatically faster than pixel-space diffusion?
  5. What breaks when guidance scale becomes too large?
  6. DiT vs U-Net — what changes architecturally?
  7. Name two practical ways to reduce latency with limited quality loss.

Completion gate

Bridge forward

Next module — 33_capstone_project — brings everything together. You will build a complete AI system using multiple techniques from all prior modules.