03. Week 13 — Image & Video Models¶

For deep understanding see 02_explainer.md — narrative with the burned-capacitor failure, ViT patch diagrams, VLM architecture, retrieval prompts, and Module 14 bridge. This file is the quick-reference glossary.

Section 1 — Why vision models need tokenization¶

Text already arrives as discrete tokens. Images do not.

A model sees an image first as a giant grid of numbers. Those numbers are pixel intensities, not meanings.

So the first job is representation. We must convert pixels into units a transformer can process.

That unit is usually the patch token. See explainer §2.1-§2.2.

Section 2 — Vision Transformer (ViT)¶

Core idea¶

Split the image into fixed-size patches. Flatten each patch. Project each patch into an embedding vector. Add positional embeddings. Run standard transformer layers.

Canonical pipeline¶

Image 224×224×3
→ split into 16×16 patches
→ 14×14 = 196 patches
→ flatten each patch
→ linear projection
→ add [CLS] token + positions
→ transformer encoder
→ pooled representation / patch features

Why this mattered¶

Reused transformer scaling behavior from NLP
Reduced handcrafted CNN bias
Worked especially well at scale

Trade-off¶

Needs lots of data or strong pretraining
Early CNNs were more data-efficient on smaller datasets

Section 3 — CLIP¶

CLIP = Contrastive Language-Image Pre-training.

Train two encoders together: - image encoder - text encoder

Objective: - matched image-caption pairs should land close together - mismatched pairs should land far apart

Result¶

A shared embedding space. That shared space enables: - zero-shot classification - image search by text - image-to-image semantic search - strong visual backbones for VLMs

Zero-shot classification workflow¶

Encode image.
Encode label prompts like "a photo of a cat".
Compare cosine similarity.
Highest score wins.

See explainer §2.4.

Section 4 — How images become token sequences¶

Two levels matter:

Level A — encoder input tokens¶

The image becomes patch embeddings. These are consumed by the vision encoder.

Level B — language-side visual tokens¶

The output features are compressed or projected. Then they become tokens the LLM can attend to.

Do not confuse these two. A common interview mistake is collapsing them together.

Section 5 — Vision-Language Models (VLMs)¶

Generic architecture¶

Image
  ↓
Vision encoder (ViT / CLIP / SigLIP / EVA)
  ↓
Projection layer / adapter / Q-former
  ↓
LLM token space
  ↓
Decoder generates text

Why the bridge exists¶

Vision embeddings and text embeddings live in different spaces. The bridge aligns dimensions and semantics.

Common bridge choices¶

linear projection
MLP adapter
Q-Former style learned queries
resampler modules

See explainer §3.1-§3.2.

Section 6 — LLaVA recipe¶

LLaVA became a canonical open VLM recipe:

Start with a pretrained vision encoder.
Start with a pretrained LLM.
Learn a connector between them.
Train on image-caption or image-instruction data.
Fine-tune for multimodal chat.

Key lesson: You do not train everything from scratch every time. Reuse pretrained parts aggressively.

Section 7 — Frontier multimodal systems¶

Representative systems: - GPT-4V / GPT-4o vision - Claude vision models - Gemini multimodal models - LLaVA / Pixtral / Qwen-VL / Molmo open families

Shared pattern¶

All of them combine: - a strong visual front-end - a bridge module - a powerful language decoder - multimodal instruction tuning

Common weaknesses¶

OCR can still fail on tiny text
counting is brittle
fine-grained spatial reasoning is inconsistent
domain-specific diagnostics require special data

Section 8 — Image generation overview¶

GANs¶

Generator vs discriminator. Fast generation. Historically beautiful images. Training instability was the tax.

VAEs¶

Encoder compresses. Decoder reconstructs. Useful for latent-space intuition. Often blurrier outputs.

Diffusion¶

Gradually corrupt data with noise. Learn to reverse the corruption. Dominant modern image-generation family. Deep dive happens in Module 14.

See explainer chapter 4.

Section 9 — Latent space intuition¶

A latent space is a compressed representation space. Nearby points often mean similar semantics.

Example: - cat photo A and cat photo B sit nearby - dog photos form another region - style variations move locally

In generation systems: - VAE latents compress images - CLIP embeddings align text and images - diffusion often operates in latent space for speed

Section 10 — Noise vs signal¶

This matters for Module 14.

Signal = meaningful structure you care about
Noise = random corruption hiding that structure

For images, signal includes shape, texture, layout, and objects. Noise destroys those patterns gradually.

If this idea is fuzzy now, diffusion will feel mystical later. Fix that before Week 14.

Section 11 — Video models¶

What changes from image to video¶

Images add spatial structure. Videos add time.

So a video model must preserve: - what is in each frame - how objects move across frames - consistency of identity, lighting, and geometry

Common mechanisms¶

spatiotemporal patches
temporal attention
3D latent grids
frame interpolation or latent consistency tricks

Why it is harder¶

sequence length explodes
compute and memory explode
coherence errors accumulate over time

See explainer chapter 5.

Section 12 — Evaluation and failure modes¶

Retrieval¶

Recall@K
Precision@K
mean reciprocal rank

Captioning / VQA¶

exactness of object recognition
grounding to visible evidence
hallucination rate
OCR accuracy

Generation¶

prompt faithfulness
image quality
diversity
human preference

Video generation¶

temporal consistency
motion realism
object persistence
editability

Section 13 — Honest failure modes to memorize¶

"Looks fine" when a defect is visible but tiny
wrong object count in cluttered scenes
confident hallucination about hidden regions
left/right confusion in complex layouts
text reading failures on rotated or low-resolution labels
video identity drift across frames

Section 14 — Module 14 handoff¶

Before starting diffusion, make sure four ideas feel automatic:

How images become tokens
What a vision encoder does
What latent space means
What signal versus noise means

If you cannot explain those cleanly, revisit: - explainer §2.2 - explainer §2.4 - explainer §4.2 - explainer §6.5

Reading list¶

ViT paper (Dosovitskiy et al., 2020)
CLIP paper (Radford et al., 2021)
LLaVA paper (Liu et al., 2023)
GPT-4V system card or equivalent multimodal report
Recent text-to-video overview or Sora report

Reference material¶

YouTube¶

Vision Transformers Explained Visually — visual intuition for patching, self-attention, and why ViT differs from CNNs.
CLIP, Embeddings, and Multimodal Search — practical explanation of contrastive learning and shared embedding spaces.

Blogs¶

An Image is Worth 16x16 Words — short overview of the ViT idea from the original work.
The Illustrated Stable Diffusion — especially useful for connecting CLIP, latents, and image generation before Module 14.

Self-check¶

Why does vision need patch tokenization at all?
ViT vs CNN — what changed conceptually?
What does CLIP optimize during training?
Why do VLMs need a projection or adapter layer?
What does LLaVA fine-tune, in broad terms?
GAN vs VAE vs diffusion — one-line difference each?
Why is video generation computationally brutal?
What four concepts does Module 14 assume from this week?