00. Image & Video Models: Plain English First¶
Read this before you look at any diagram.
Imagine an art-school painter standing in front of a photo. The teacher first asks only one thing. “What is in front of you?” That is recognition. In our module, the helper doing this is the eye. The eye does not see magic. It sees many small squares. Each square is the patch. Many patches together become a usable picture.
Then the teacher changes the task. Now the painter must describe the photo in words. “A red scooter near a tea stall.” Seeing alone is not enough now. The visual signals must cross into language. That bridge is the translator. If the eye is sharp but the translator is weak, the answer sounds fluent but misses the picture. Simple, no?
Then the teacher removes the photo. Only the prompt remains. “Paint monsoon clouds over a city road.” Now the painter must create, not just recognize. This is where the canvas matters. The canvas is the learned space where the model paints a possible image before decoding it back to pixels. A weak canvas gives blur, broken hands, and confused geometry. A strong canvas keeps structure steady.
Finally the teacher asks for motion. “Show the bicycle turning and the scarf moving in wind.” One frame is not enough now. Many frames must agree with each other. That timeline is the frame tape. The frame tape is the temporal glue that keeps identity, motion, and lighting aligned across a clip. So the ladder is clear. Recognition comes first. Description comes next. Generation comes after that. Video is the hardest because time multiplies every mistake.
Look. We will keep using five placeholders. They are memory anchors. They stop the math from floating away. Use them again and again. the eye sees. the translator explains. the canvas paints. the patch is the small unit. the frame tape keeps the movie together.
| Placeholder | What it means |
|---|---|
| the eye | vision encoder — sees raw pixels, outputs numbers |
| the translator | vision-language bridge — turns those numbers into words |
| the canvas | generation space — the mathematical space where images are painted |
| the patch | image token — a small square chunk of the image, like a brush stroke |
| the frame tape | temporal dimension — the timeline that links video frames |
Top Resources¶
- Vision Transformer paper — the cleanest starting point for patch-based vision.
- CLIP paper — shows shared image-text embedding space clearly.
- LLaVA project page — practical recipe for connecting vision to a language model.
- GPT-4V system card — useful for grounding and failure modes.
- Gemini technical report — good for thinking about native multimodal training choices.
- The Illustrated Stable Diffusion — end-to-end text-to-image picture first.
- Annotated Diffusion blog — practical diffusion mechanics in plain language.
- ControlNet paper — best starting point for controllable image generation.
- Sora technical report — good mental picture for modern video generation.
What's coming¶
- 01-opening-failure.md — why raw pixels plus a dense network is the wrong opening move.
- 02-vision-encoders-vit.md — how the patch becomes the basic unit inside the eye.
- 03-clip-contrastive-alignment.md — how images and text are pulled into one shared space.
- 04-vision-language-models.md — how the eye and the translator connect to a language model.
- 05-llava-and-frontier-vlms.md — how modern assistants are tuned to answer about images.
- 06-training-vlms-failure-points.md — what still goes wrong even after training looks strong.
- 07-image-generation-landscape.md — the big families that try to paint on the canvas.
- 08-text-to-image-pipeline.md — the full prompt-to-image loop in latent diffusion.
- 09-image-editing-and-control.md — how masks, depth, and pose steer generation.
- 10-video-tokenization-temporal.md — why the frame tape changes the token game.
- 11-text-to-video-systems.md — how modern systems generate moving clips.
- 12-production-multimodal-systems.md — what latency, batching, and cost do to real products.
- 13-honest-admission.md — what still feels unresolved and uncertain.
Bridge. Failure teaches the shape of a system faster than a shiny demo. → 01-opening-failure.md