Skip to content

01. Week 13 — Image & Video Models

Key concepts to master

  • Patch tokenization: how a 2D image becomes a 1D token sequence
  • Vision Transformer (ViT): patch embeddings, positional embeddings, self-attention over patches
  • CLIP: contrastive training that aligns image and text in one embedding space
  • Vision-language bridge: projection layer / adapter connecting vision features to an LLM
  • LLaVA pattern: frozen or lightly-tuned vision encoder + connector + LLM instruction tuning
  • Generation families: GANs, VAEs, diffusion; where each fits historically
  • Latent space intuition: compressed meaning space, not raw pixels
  • Noise vs signal: crucial pre-req before Module 14 diffusion math
  • Video modeling: spatiotemporal tokens, temporal attention, frame consistency
  • Multimodal failure modes: hallucination, weak grounding, poor counting, brittle spatial reasoning

🧠 Mental models

  • Patch tokenization: "cut the image into word-sized tiles before reading it"
  • Vision Transformer (ViT): "treat a picture like a sentence made of patches"
  • CLIP: "train an image reader and text reader to meet in the same coordinate system"
  • Vision-language bridge: "an adapter cable between the vision encoder and the LLM"
  • Latent space: "a zip file of visual meaning rather than raw pixels"
  • Video modeling: "image understanding plus memory of what changed over time"

⚠️ Common traps

  • Assuming strong CLIP similarity means the model is grounded well enough for counting or spatial reasoning.
  • Forgetting that patch size and resolution choices can erase small objects before attention even starts.
  • Treating the projection layer as plumbing only; dimension mismatch and weak alignment can bottleneck the whole VLM.
  • Expecting image-generation intuitions to transfer directly to video, where temporal consistency becomes a separate problem.
  • Overtrusting VLMs on OCR, counting, left-right relations, or fine-grained localization.
  • Evaluating on clean benchmarks only and missing domain shift from real cameras, lighting, or motion blur.

🔗 Prerequisites & connections

Builds on: Module 12 reasoning-model judgment about when extra compute, verification, or routing is worth paying for.

Feeds into: Module 14 diffusion models through latent-space intuition, conditioning, and the signal-vs-noise view of image generation.

💬 Interview phrasing

  • How does a Vision Transformer convert an image into something attention can process?
  • What exactly does CLIP learn, and why is it useful beyond image search demos?
  • In a LLaVA-style architecture, why do you need a projection layer between vision features and the LLM?
  • Why is video understanding or generation harder than single-image modeling?
  • What multimodal failure modes would you expect first in production?

⏱️ Difficulty markers

  • 🟢 patch tokenization
  • 🟡 ViT self-attention over patches
  • 🟡 CLIP contrastive alignment
  • 🟡 vision-language bridge / LLaVA connector
  • 🔴 video spatiotemporal modeling
  • 🔴 multimodal grounding failures

Self-check questions

For fuller explanations, see 02_explainer.md chapter 6.

  1. Why can a VLM miss an obvious defect in an image? (§1.1, §3.5)
  2. How does ViT turn an image into tokens? (§2.2)
  3. What does CLIP optimize during training? (§2.4)
  4. Why do modern VLMs need a projection layer? (§3.2)
  5. What is the LLaVA training recipe in plain English? (§3.3)
  6. GANs vs VAEs vs diffusion — what is the core trade-off? (§4.1-§4.3)
  7. Why is video generation harder than image generation? (§5.2, §5.4)
  8. What four concepts from this module does Module 14 assume? (§6.5)

Health check

  • [ ] Read all 6 explainer chapters
  • [ ] Can draw ViT patching and CLIP training from memory
  • [ ] Can explain vision encoder → bridge → LLM without notes
  • [ ] Assignment shipped with retrieval metrics and failure analysis
  • [ ] Honest limitations list written in your own words
  • [ ] Foundation-gap audit completed before Module 14
  • [ ] Ready to move to ../02_diffusion_media_generation/