03. Week 13 — Image & Video Models¶
For deep understanding see
02_explainer.md— narrative with the burned-capacitor failure, ViT patch diagrams, VLM architecture, retrieval prompts, and Module 14 bridge. This file is the quick-reference glossary.
Section 1 — Why vision models need tokenization¶
Text already arrives as discrete tokens. Images do not.
A model sees an image first as a giant grid of numbers. Those numbers are pixel intensities, not meanings.
So the first job is representation. We must convert pixels into units a transformer can process.
That unit is usually the patch token. See explainer §2.1-§2.2.
Section 2 — Vision Transformer (ViT)¶
Core idea¶
Split the image into fixed-size patches. Flatten each patch. Project each patch into an embedding vector. Add positional embeddings. Run standard transformer layers.
Canonical pipeline¶
Image 224×224×3
→ split into 16×16 patches
→ 14×14 = 196 patches
→ flatten each patch
→ linear projection
→ add [CLS] token + positions
→ transformer encoder
→ pooled representation / patch features
Why this mattered¶
- Reused transformer scaling behavior from NLP
- Reduced handcrafted CNN bias
- Worked especially well at scale
Trade-off¶
- Needs lots of data or strong pretraining
- Early CNNs were more data-efficient on smaller datasets
Section 3 — CLIP¶
CLIP = Contrastive Language-Image Pre-training.
Train two encoders together: - image encoder - text encoder
Objective: - matched image-caption pairs should land close together - mismatched pairs should land far apart
Result¶
A shared embedding space. That shared space enables: - zero-shot classification - image search by text - image-to-image semantic search - strong visual backbones for VLMs
Zero-shot classification workflow¶
- Encode image.
- Encode label prompts like "a photo of a cat".
- Compare cosine similarity.
- Highest score wins.
See explainer §2.4.
Section 4 — How images become token sequences¶
Two levels matter:
Level A — encoder input tokens¶
The image becomes patch embeddings. These are consumed by the vision encoder.
Level B — language-side visual tokens¶
The output features are compressed or projected. Then they become tokens the LLM can attend to.
Do not confuse these two. A common interview mistake is collapsing them together.
Section 5 — Vision-Language Models (VLMs)¶
Generic architecture¶
Image
↓
Vision encoder (ViT / CLIP / SigLIP / EVA)
↓
Projection layer / adapter / Q-former
↓
LLM token space
↓
Decoder generates text
Why the bridge exists¶
Vision embeddings and text embeddings live in different spaces. The bridge aligns dimensions and semantics.
Common bridge choices¶
- linear projection
- MLP adapter
- Q-Former style learned queries
- resampler modules
See explainer §3.1-§3.2.
Section 6 — LLaVA recipe¶
LLaVA became a canonical open VLM recipe:
- Start with a pretrained vision encoder.
- Start with a pretrained LLM.
- Learn a connector between them.
- Train on image-caption or image-instruction data.
- Fine-tune for multimodal chat.
Key lesson: You do not train everything from scratch every time. Reuse pretrained parts aggressively.
Section 7 — Frontier multimodal systems¶
Representative systems: - GPT-4V / GPT-4o vision - Claude vision models - Gemini multimodal models - LLaVA / Pixtral / Qwen-VL / Molmo open families
Shared pattern¶
All of them combine: - a strong visual front-end - a bridge module - a powerful language decoder - multimodal instruction tuning
Common weaknesses¶
- OCR can still fail on tiny text
- counting is brittle
- fine-grained spatial reasoning is inconsistent
- domain-specific diagnostics require special data
Section 8 — Image generation overview¶
GANs¶
Generator vs discriminator. Fast generation. Historically beautiful images. Training instability was the tax.
VAEs¶
Encoder compresses. Decoder reconstructs. Useful for latent-space intuition. Often blurrier outputs.
Diffusion¶
Gradually corrupt data with noise. Learn to reverse the corruption. Dominant modern image-generation family. Deep dive happens in Module 14.
See explainer chapter 4.
Section 9 — Latent space intuition¶
A latent space is a compressed representation space. Nearby points often mean similar semantics.
Example: - cat photo A and cat photo B sit nearby - dog photos form another region - style variations move locally
In generation systems: - VAE latents compress images - CLIP embeddings align text and images - diffusion often operates in latent space for speed
Section 10 — Noise vs signal¶
This matters for Module 14.
- Signal = meaningful structure you care about
- Noise = random corruption hiding that structure
For images, signal includes shape, texture, layout, and objects. Noise destroys those patterns gradually.
If this idea is fuzzy now, diffusion will feel mystical later. Fix that before Week 14.
Section 11 — Video models¶
What changes from image to video¶
Images add spatial structure. Videos add time.
So a video model must preserve: - what is in each frame - how objects move across frames - consistency of identity, lighting, and geometry
Common mechanisms¶
- spatiotemporal patches
- temporal attention
- 3D latent grids
- frame interpolation or latent consistency tricks
Why it is harder¶
- sequence length explodes
- compute and memory explode
- coherence errors accumulate over time
See explainer chapter 5.
Section 12 — Evaluation and failure modes¶
Retrieval¶
- Recall@K
- Precision@K
- mean reciprocal rank
Captioning / VQA¶
- exactness of object recognition
- grounding to visible evidence
- hallucination rate
- OCR accuracy
Generation¶
- prompt faithfulness
- image quality
- diversity
- human preference
Video generation¶
- temporal consistency
- motion realism
- object persistence
- editability
Section 13 — Honest failure modes to memorize¶
- "Looks fine" when a defect is visible but tiny
- wrong object count in cluttered scenes
- confident hallucination about hidden regions
- left/right confusion in complex layouts
- text reading failures on rotated or low-resolution labels
- video identity drift across frames
Section 14 — Module 14 handoff¶
Before starting diffusion, make sure four ideas feel automatic:
- How images become tokens
- What a vision encoder does
- What latent space means
- What signal versus noise means
If you cannot explain those cleanly, revisit: - explainer §2.2 - explainer §2.4 - explainer §4.2 - explainer §6.5
Reading list¶
- ViT paper (Dosovitskiy et al., 2020)
- CLIP paper (Radford et al., 2021)
- LLaVA paper (Liu et al., 2023)
- GPT-4V system card or equivalent multimodal report
- Recent text-to-video overview or Sora report
Reference material¶
YouTube¶
- Vision Transformers Explained Visually — visual intuition for patching, self-attention, and why ViT differs from CNNs.
- CLIP, Embeddings, and Multimodal Search — practical explanation of contrastive learning and shared embedding spaces.
Blogs¶
- An Image is Worth 16x16 Words — short overview of the ViT idea from the original work.
- The Illustrated Stable Diffusion — especially useful for connecting CLIP, latents, and image generation before Module 14.
Self-check¶
- Why does vision need patch tokenization at all?
- ViT vs CNN — what changed conceptually?
- What does CLIP optimize during training?
- Why do VLMs need a projection or adapter layer?
- What does LLaVA fine-tune, in broad terms?
- GAN vs VAE vs diffusion — one-line difference each?
- Why is video generation computationally brutal?
- What four concepts does Module 14 assume from this week?