01. Week 13 — Image & Video Models¶
Key concepts to master¶
- Patch tokenization: how a 2D image becomes a 1D token sequence
- Vision Transformer (ViT): patch embeddings, positional embeddings, self-attention over patches
- CLIP: contrastive training that aligns image and text in one embedding space
- Vision-language bridge: projection layer / adapter connecting vision features to an LLM
- LLaVA pattern: frozen or lightly-tuned vision encoder + connector + LLM instruction tuning
- Generation families: GANs, VAEs, diffusion; where each fits historically
- Latent space intuition: compressed meaning space, not raw pixels
- Noise vs signal: crucial pre-req before Module 14 diffusion math
- Video modeling: spatiotemporal tokens, temporal attention, frame consistency
- Multimodal failure modes: hallucination, weak grounding, poor counting, brittle spatial reasoning
🧠 Mental models¶
- Patch tokenization: "cut the image into word-sized tiles before reading it"
- Vision Transformer (ViT): "treat a picture like a sentence made of patches"
- CLIP: "train an image reader and text reader to meet in the same coordinate system"
- Vision-language bridge: "an adapter cable between the vision encoder and the LLM"
- Latent space: "a zip file of visual meaning rather than raw pixels"
- Video modeling: "image understanding plus memory of what changed over time"
⚠️ Common traps¶
- Assuming strong CLIP similarity means the model is grounded well enough for counting or spatial reasoning.
- Forgetting that patch size and resolution choices can erase small objects before attention even starts.
- Treating the projection layer as plumbing only; dimension mismatch and weak alignment can bottleneck the whole VLM.
- Expecting image-generation intuitions to transfer directly to video, where temporal consistency becomes a separate problem.
- Overtrusting VLMs on OCR, counting, left-right relations, or fine-grained localization.
- Evaluating on clean benchmarks only and missing domain shift from real cameras, lighting, or motion blur.
🔗 Prerequisites & connections¶
Builds on: Module 12 reasoning-model judgment about when extra compute, verification, or routing is worth paying for.
Feeds into: Module 14 diffusion models through latent-space intuition, conditioning, and the signal-vs-noise view of image generation.
💬 Interview phrasing¶
- How does a Vision Transformer convert an image into something attention can process?
- What exactly does CLIP learn, and why is it useful beyond image search demos?
- In a LLaVA-style architecture, why do you need a projection layer between vision features and the LLM?
- Why is video understanding or generation harder than single-image modeling?
- What multimodal failure modes would you expect first in production?
⏱️ Difficulty markers¶
- 🟢 patch tokenization
- 🟡 ViT self-attention over patches
- 🟡 CLIP contrastive alignment
- 🟡 vision-language bridge / LLaVA connector
- 🔴 video spatiotemporal modeling
- 🔴 multimodal grounding failures
Self-check questions¶
For fuller explanations, see 02_explainer.md chapter 6.
- Why can a VLM miss an obvious defect in an image? (§1.1, §3.5)
- How does ViT turn an image into tokens? (§2.2)
- What does CLIP optimize during training? (§2.4)
- Why do modern VLMs need a projection layer? (§3.2)
- What is the LLaVA training recipe in plain English? (§3.3)
- GANs vs VAEs vs diffusion — what is the core trade-off? (§4.1-§4.3)
- Why is video generation harder than image generation? (§5.2, §5.4)
- What four concepts from this module does Module 14 assume? (§6.5)
Health check¶
- [ ] Read all 6 explainer chapters
- [ ] Can draw ViT patching and CLIP training from memory
- [ ] Can explain vision encoder → bridge → LLM without notes
- [ ] Assignment shipped with retrieval metrics and failure analysis
- [ ] Honest limitations list written in your own words
- [ ] Foundation-gap audit completed before Module 14
- [ ] Ready to move to
../02_diffusion_media_generation/