Skip to content

03. Week 13 — Image & Video Models

For deep understanding see 02_explainer.md — narrative with the burned-capacitor failure, ViT patch diagrams, VLM architecture, retrieval prompts, and Module 14 bridge. This file is the quick-reference glossary.

Section 1 — Why vision models need tokenization

Text already arrives as discrete tokens. Images do not.

A model sees an image first as a giant grid of numbers. Those numbers are pixel intensities, not meanings.

So the first job is representation. We must convert pixels into units a transformer can process.

That unit is usually the patch token. See explainer §2.1-§2.2.

Section 2 — Vision Transformer (ViT)

Core idea

Split the image into fixed-size patches. Flatten each patch. Project each patch into an embedding vector. Add positional embeddings. Run standard transformer layers.

Canonical pipeline

Image 224×224×3
→ split into 16×16 patches
→ 14×14 = 196 patches
→ flatten each patch
→ linear projection
→ add [CLS] token + positions
→ transformer encoder
→ pooled representation / patch features

Why this mattered

  • Reused transformer scaling behavior from NLP
  • Reduced handcrafted CNN bias
  • Worked especially well at scale

Trade-off

  • Needs lots of data or strong pretraining
  • Early CNNs were more data-efficient on smaller datasets

Section 3 — CLIP

CLIP = Contrastive Language-Image Pre-training.

Train two encoders together: - image encoder - text encoder

Objective: - matched image-caption pairs should land close together - mismatched pairs should land far apart

Result

A shared embedding space. That shared space enables: - zero-shot classification - image search by text - image-to-image semantic search - strong visual backbones for VLMs

Zero-shot classification workflow

  1. Encode image.
  2. Encode label prompts like "a photo of a cat".
  3. Compare cosine similarity.
  4. Highest score wins.

See explainer §2.4.

Section 4 — How images become token sequences

Two levels matter:

Level A — encoder input tokens

The image becomes patch embeddings. These are consumed by the vision encoder.

Level B — language-side visual tokens

The output features are compressed or projected. Then they become tokens the LLM can attend to.

Do not confuse these two. A common interview mistake is collapsing them together.

Section 5 — Vision-Language Models (VLMs)

Generic architecture

Image
Vision encoder (ViT / CLIP / SigLIP / EVA)
Projection layer / adapter / Q-former
LLM token space
Decoder generates text

Why the bridge exists

Vision embeddings and text embeddings live in different spaces. The bridge aligns dimensions and semantics.

Common bridge choices

  • linear projection
  • MLP adapter
  • Q-Former style learned queries
  • resampler modules

See explainer §3.1-§3.2.

Section 6 — LLaVA recipe

LLaVA became a canonical open VLM recipe:

  1. Start with a pretrained vision encoder.
  2. Start with a pretrained LLM.
  3. Learn a connector between them.
  4. Train on image-caption or image-instruction data.
  5. Fine-tune for multimodal chat.

Key lesson: You do not train everything from scratch every time. Reuse pretrained parts aggressively.

Section 7 — Frontier multimodal systems

Representative systems: - GPT-4V / GPT-4o vision - Claude vision models - Gemini multimodal models - LLaVA / Pixtral / Qwen-VL / Molmo open families

Shared pattern

All of them combine: - a strong visual front-end - a bridge module - a powerful language decoder - multimodal instruction tuning

Common weaknesses

  • OCR can still fail on tiny text
  • counting is brittle
  • fine-grained spatial reasoning is inconsistent
  • domain-specific diagnostics require special data

Section 8 — Image generation overview

GANs

Generator vs discriminator. Fast generation. Historically beautiful images. Training instability was the tax.

VAEs

Encoder compresses. Decoder reconstructs. Useful for latent-space intuition. Often blurrier outputs.

Diffusion

Gradually corrupt data with noise. Learn to reverse the corruption. Dominant modern image-generation family. Deep dive happens in Module 14.

See explainer chapter 4.

Section 9 — Latent space intuition

A latent space is a compressed representation space. Nearby points often mean similar semantics.

Example: - cat photo A and cat photo B sit nearby - dog photos form another region - style variations move locally

In generation systems: - VAE latents compress images - CLIP embeddings align text and images - diffusion often operates in latent space for speed

Section 10 — Noise vs signal

This matters for Module 14.

  • Signal = meaningful structure you care about
  • Noise = random corruption hiding that structure

For images, signal includes shape, texture, layout, and objects. Noise destroys those patterns gradually.

If this idea is fuzzy now, diffusion will feel mystical later. Fix that before Week 14.

Section 11 — Video models

What changes from image to video

Images add spatial structure. Videos add time.

So a video model must preserve: - what is in each frame - how objects move across frames - consistency of identity, lighting, and geometry

Common mechanisms

  • spatiotemporal patches
  • temporal attention
  • 3D latent grids
  • frame interpolation or latent consistency tricks

Why it is harder

  • sequence length explodes
  • compute and memory explode
  • coherence errors accumulate over time

See explainer chapter 5.

Section 12 — Evaluation and failure modes

Retrieval

  • Recall@K
  • Precision@K
  • mean reciprocal rank

Captioning / VQA

  • exactness of object recognition
  • grounding to visible evidence
  • hallucination rate
  • OCR accuracy

Generation

  • prompt faithfulness
  • image quality
  • diversity
  • human preference

Video generation

  • temporal consistency
  • motion realism
  • object persistence
  • editability

Section 13 — Honest failure modes to memorize

  • "Looks fine" when a defect is visible but tiny
  • wrong object count in cluttered scenes
  • confident hallucination about hidden regions
  • left/right confusion in complex layouts
  • text reading failures on rotated or low-resolution labels
  • video identity drift across frames

Section 14 — Module 14 handoff

Before starting diffusion, make sure four ideas feel automatic:

  1. How images become tokens
  2. What a vision encoder does
  3. What latent space means
  4. What signal versus noise means

If you cannot explain those cleanly, revisit: - explainer §2.2 - explainer §2.4 - explainer §4.2 - explainer §6.5

Reading list

  1. ViT paper (Dosovitskiy et al., 2020)
  2. CLIP paper (Radford et al., 2021)
  3. LLaVA paper (Liu et al., 2023)
  4. GPT-4V system card or equivalent multimodal report
  5. Recent text-to-video overview or Sora report

Reference material

YouTube

Blogs

Self-check

  1. Why does vision need patch tokenization at all?
  2. ViT vs CNN — what changed conceptually?
  3. What does CLIP optimize during training?
  4. Why do VLMs need a projection or adapter layer?
  5. What does LLaVA fine-tune, in broad terms?
  6. GAN vs VAE vs diffusion — one-line difference each?
  7. Why is video generation computationally brutal?
  8. What four concepts does Module 14 assume from this week?