01. Week 13 — Image & Video Models¶

Key concepts to master¶

Patch tokenization: how a 2D image becomes a 1D token sequence
Vision Transformer (ViT): patch embeddings, positional embeddings, self-attention over patches
CLIP: contrastive training that aligns image and text in one embedding space
Vision-language bridge: projection layer / adapter connecting vision features to an LLM
LLaVA pattern: frozen or lightly-tuned vision encoder + connector + LLM instruction tuning
Generation families: GANs, VAEs, diffusion; where each fits historically
Latent space intuition: compressed meaning space, not raw pixels
Noise vs signal: crucial pre-req before Module 14 diffusion math
Video modeling: spatiotemporal tokens, temporal attention, frame consistency
Multimodal failure modes: hallucination, weak grounding, poor counting, brittle spatial reasoning

Patch tokenization: "cut the image into word-sized tiles before reading it"
Vision Transformer (ViT): "treat a picture like a sentence made of patches"
CLIP: "train an image reader and text reader to meet in the same coordinate system"
Vision-language bridge: "an adapter cable between the vision encoder and the LLM"
Latent space: "a zip file of visual meaning rather than raw pixels"
Video modeling: "image understanding plus memory of what changed over time"

Assuming strong CLIP similarity means the model is grounded well enough for counting or spatial reasoning.
Forgetting that patch size and resolution choices can erase small objects before attention even starts.
Treating the projection layer as plumbing only; dimension mismatch and weak alignment can bottleneck the whole VLM.
Expecting image-generation intuitions to transfer directly to video, where temporal consistency becomes a separate problem.
Overtrusting VLMs on OCR, counting, left-right relations, or fine-grained localization.
Evaluating on clean benchmarks only and missing domain shift from real cameras, lighting, or motion blur.

Builds on: Module 12 reasoning-model judgment about when extra compute, verification, or routing is worth paying for.

Feeds into: Module 14 diffusion models through latent-space intuition, conditioning, and the signal-vs-noise view of image generation.

How does a Vision Transformer convert an image into something attention can process?
What exactly does CLIP learn, and why is it useful beyond image search demos?
In a LLaVA-style architecture, why do you need a projection layer between vision features and the LLM?
Why is video understanding or generation harder than single-image modeling?
What multimodal failure modes would you expect first in production?

For fuller explanations, see 02_explainer.md chapter 6.

Why can a VLM miss an obvious defect in an image? (§1.1, §3.5)
How does ViT turn an image into tokens? (§2.2)
What does CLIP optimize during training? (§2.4)
Why do modern VLMs need a projection layer? (§3.2)
What is the LLaVA training recipe in plain English? (§3.3)
GANs vs VAEs vs diffusion — what is the core trade-off? (§4.1-§4.3)
Why is video generation harder than image generation? (§5.2, §5.4)
What four concepts from this module does Module 14 assume? (§6.5)