Skip to content

02. Image & Video Models — Narrative Explainer

Companion to 03_study_material.md. That file gives you definitions and lookup tables. This file gives you the picture in your head.


Table of contents

  • ELI5 — the whole ladder in kid words

  • Chapter 1: The burned capacitor failure

  • 1.1 The first miss

  • 1.2 Why the model said “looks fine”

  • 1.3 Why this matters for Lead roles

  • 1.4 The debugging lens

  • Chapter 2: Vision encoders — teaching the eye to see

  • 2.1 Pixels are not meanings yet

  • 2.2 ViT — how patches become tokens

  • 2.3 Why patching works

  • 2.4 CLIP — aligning images and language

  • 2.5 How images become token sequences

  • 2.6 Where the eye still fails

  • Chapter 3: Vision-Language Models — teaching the translator to speak

  • 3.1 From seeing to answering

  • 3.2 Generic VLM architecture

  • 3.3 The LLaVA recipe

  • 3.4 Frontier multimodal systems

  • 3.5 Training process and failure points

  • 3.6 Retrieval prompts for memory

  • Chapter 4: Image generation — teaching the canvas to paint

  • 4.1 GANs — the duel

  • 4.2 VAEs — the compress-and-rebuild story

  • 4.3 Diffusion — the dominant modern path

  • 4.4 Current SOTA map

  • 4.5 The minimum concepts Module 14 assumes

  • Chapter 5: Video models — teaching the canvas to move

  • 5.1 From image to frame tape

  • 5.2 Video tokenization and temporal attention

  • 5.3 Text-to-video systems now

  • 5.4 Why video is brutally expensive

  • 5.5 Honest admission

  • Chapter 6: Recap and application

  • 6.1 Failure-fix chain

  • 6.2 Key points to remember

  • 6.3 Important interview questions

  • 6.4 Production experience

  • 6.5 Foundation-gap audit for Module 14

  • 6.6 Apply now — exercises and bridge


ELI5 — the whole ladder in kid words

Imagine a young painter in an art school. On day one, the teacher shows a cat photo. The teacher asks, “What is this?” The painter says, “Cat.” Good. That is classification. The painter is not painting anything yet. The painter is simply learning to recognize patterns. In our story, the painter has five helpers. The first helper is the eye. The eye looks at the picture carefully. Technically, this is the vision encoder. The second helper is the translator. The translator turns what the eye noticed into language-friendly signals. Technically, this is the vision-language bridge. The third helper is the canvas. The canvas is where new pictures can be imagined and generated. Technically, this is the generation space.

The fourth helper is the patch. A patch is one little square cut from the image. Technically, this is the image token. The fifth helper is the frame tape. A frame tape is a strip of frames laid through time. Technically, this is the temporal dimension. Very important. Do not mix these helpers up. The eye sees. The translator connects seeing to speaking. The canvas creates. The patch is the small unit of seeing. The frame tape adds time and motion. Now watch the painter grow.

Step 1 — learning to recognize

First, the painter learns recognition. The teacher shows many photos. Cat. Dog. Car. Tree. Capacitor. Burn mark. The painter slowly learns visual patterns. Whiskers often mean cat. Wheels often mean car. A dark swollen cylinder on a board may mean damaged capacitor. This is the first skill. The painter is learning what visual things look like. This is what image classification models did very well. They answer questions like, “What is in this image?” They are not chatting yet. They are just recognizing.

Step 2 — learning to describe

Then the teacher upgrades the task. Now the teacher asks, “Describe what you see.” The painter says, “A small orange cat sleeping on a blue chair.” Now we moved beyond recognition. Now vision and language are touching each other. The eye still looks first. But now the translator becomes essential. The translator takes visual understanding and connects it to words. This is captioning, visual question answering, and VLM behavior. The painter can now answer, “What color is the cat?” “Where is it sitting?” “Is it indoors or outdoors?” This is much harder than classification. Because language needs detail. Language also needs grounding. The answer must stay tied to actual pixels. Otherwise the painter starts bluffing.

Step 3 — learning to paint from words

Next, the teacher gives no photo. Only words. “Paint a red bicycle near a tea stall in rain.” Now the painter must imagine. The canvas becomes central. The painter must turn text into a visual plan. Then turn that plan into a picture. This is image generation. Historically, people tried several ways. Some used rival networks. Those were GANs. Some compressed and rebuilt images. Those were VAEs. Then diffusion became dominant. Do not worry. This module only introduces the landscape. Module 14 will go deep on diffusion.

Step 4 — learning to paint moving scenes

Finally, the teacher raises the difficulty again. “Paint a child running through puddles for six seconds.” Now a single picture is not enough. Now the painter must keep motion consistent. The shoes should stay the same shoes. The child should not gain a third arm midway. The puddle should splash naturally over time. Here the frame tape matters. A video is not just one image. A video is many images tied together through time. That tying-together is hard. Very hard. This is why video models are expensive and fragile.

The whole ladder in one picture

Recognition        Description         Generation         Video
   ↓                   ↓                   ↓                ↓
"cat"        →   "orange cat sleeping" → "paint this" → "make it move"
   eye              eye + translator     canvas          canvas + frame tape
Each skill builds on the previous one. First, the system must recognize visual structure. Then it must link structure to language. Then it must create new structure. Then it must keep that structure stable over time. That is the ladder for this module.

Why the patch matters so much

You may ask, “Why are we talking about little squares?” Because transformers need tokens. Words already come chopped into tokens. Images do not. So we cut images into patches. Each patch is like one little visual word. No patch alone tells the whole story. But many patches together form the image sentence. The eye reads those patch tokens. Then the eye builds a richer representation. That representation becomes useful for search, answers, and generation.

Why the translator matters so much

Suppose the eye notices a burned capacitor. Good. But if the translator is weak, the language model may still answer badly. It may say, “The board appears normal.” So seeing and speaking are different skills. A model can be decent at one and weak at the other. This is one of the biggest practical lessons. Never assume a fluent answer means accurate seeing.

Why the canvas matters so much

Generation feels magical. Actually, it is learned structure plus a decoding process. The canvas is the space where the model can propose images. Good canvases preserve useful structure. Bad canvases create blurry or incoherent outputs. That is why latent spaces matter. A good latent space makes image creation easier. A bad latent space makes everything unstable.

Why the frame tape matters so much

A single perfect frame is easier. A hundred consistent frames are much harder. Video models must remember identity, geometry, motion, lighting, and timing. Across many frames. Under heavy compute limits. That is why text-to-video is still an active frontier.

ELI5 recap

The eye learns to see. The translator learns to explain. The canvas learns to paint. The patch is the small visual piece. The frame tape makes pictures move through time. Keep this analogy alive. Every chapter will map back to it.


Chapter 1: The burned capacitor failure

1.1 The first miss

You send an image to GPT-4V. You ask, “What’s wrong with this circuit board?” The model answers, “Looks fine.” But the board clearly has a burned capacitor. Any competent technician sees it immediately. So what happened? Did the model fail to see? Did it see but fail to describe? Did it over-trust common patterns? Did the image get downsampled too aggressively? All of these are possible. This is the correct opening question for this module. Not, “What shiny multimodal demo can I build?” But, “Why does the system fail on an obvious real-world case?”

That mindset is senior.

1.2 Why the model said “looks fine”

Let us unpack the failure carefully.

Reason 1 — the defect was too small relative to the full image

Many models first resize the image. Fine detail can shrink or blur away. If the capacitor occupies few effective pixels, trouble begins early. The eye may never get a strong signal.

Reason 2 — patching can dilute local evidence

Suppose one patch contains both normal board traces and a defect edge. That patch becomes one mixed token. If the defect is tiny, its evidence gets diluted. The eye now sees a weak clue, not a clear object.

Reason 3 — training data may not cover this domain well

Web-scale image data contains many natural photos. It contains fewer annotated industrial defects. A burned capacitor is not a common consumer caption. So the model may lack strong domain grounding.

Reason 4 — the question demands diagnosis, not captioning

“Looks fine” is not merely a recognition failure. It is also a reasoning and diagnosis failure. The model must know what “wrong” means for circuit boards. That requires domain knowledge plus grounded inspection. Captioning skill alone is insufficient.

Reason 5 — language priors can overpower visual evidence

This is subtle and very important. The language side often knows that many boards in examples are normal. If the visual evidence is weak, the language model may default to a high-probability bland answer. So the answer sounds fluent. But the grounding is weak.

Reason 6 — no zoom, crop, or tool use

A careful human inspector zooms in. They compare suspicious components. They ask for another angle. A plain multimodal prompt may do none of that. One shot in, one answer out. That is often not enough.

Reason 7 — hallucination by omission

We think of hallucination as “seeing things that are not there.” There is another kind. Failing to mention what is clearly there. That is also dangerous. Especially in inspection, medical, legal, or safety settings.

1.3 Why this matters for Lead roles

Multimodal AI is not a side quest anymore. It is becoming a product layer. Search. Support. Inspection. Robotics. Retail. Design. Safety. Healthcare. Education. Media. Lead AI Engineers will increasingly be asked, “Can we use image or video models here?” If your mental model is shallow, you will over-promise. Then the demo works. Then production fails.

That pattern is common. A Lead role requires three upgrades. First, you must know what these systems can do. Second, you must know exactly where they break. Third, you must design workflows around those breaks. That is the real job. Not just calling an API.

1.4 The debugging lens

When a multimodal system fails, ask four questions.

Question A — did the eye fail?

Meaning: Did the vision encoder fail to represent the relevant visual signal? Tiny object. Low contrast. Weird domain. Poor crop. Bad resolution. If yes, the problem starts before language even enters.

Question B — did the translator fail?

Meaning: Did the bridge from vision to language lose or distort information? The eye may have a decent feature vector. But the language model receives only a compressed summary. Important local detail may disappear there.

Question C — did the language model reason badly?

Meaning: Given visual evidence, did it still draw the wrong conclusion? This happens when commonsense priors overwhelm visual grounding. Or when the prompt is too vague. Or when the task needs domain expertise.

Question D — did the workflow fail?

Meaning: Did we ask a one-shot system to do a multi-step job? Inspection often needs zoom, crop, compare, and re-check. A safer workflow may solve what a single prompt cannot. This four-part lens is worth memorizing. Eye. Translator. Reasoner. Workflow. We will use it again in chapter 6.

1.5 What a strong engineer would do next

A strong engineer does not stop at disappointment. They run controlled follow-ups. Try a crop focused on the suspicious component. Try a higher-resolution image. Ask a narrower question. Compare answers across models. Force grounded output. For example: “List five visible components. Then state which one appears damaged. Quote the exact visual evidence.” This prompt is better. Because it forces a visible-evidence chain. Still imperfect. But better. The deeper lesson is this. Multimodal models are not magical eyes. They are learned pipelines with bottlenecks.

We now study those bottlenecks.


Chapter 2: Vision encoders — teaching the eye to see

2.1 Pixels are not meanings yet

A raw image is a grid. Nothing more, initially. For a color image, each location has numbers for red, green, and blue. Useful for storage. Not yet useful for semantics. A model must transform pixels into features. Features are patterns the system can use. Edges. Corners. Textures. Shapes. Parts. Objects. Relations. This transformation job belongs to the vision encoder. Historically, CNNs dominated this role. Then vision transformers changed the story.

For this module, the key idea is not historical loyalty. The key idea is representation. How do we convert image structure into tokens and features?

2.2 ViT — how patches become tokens

ViT means Vision Transformer. The clever move was surprisingly simple. Take the transformer idea from text. Then ask, “What is the image equivalent of a word token?” Answer: Use a patch. A patch is a small square crop. For example, 16×16 pixels. Now let us do one concrete case. Suppose the input image is 224×224. Suppose patch size is 16×16. How many patches? Along width: 224 / 16 = 14. Along height: 224 / 16 = 14. Total patches: 14 × 14 = 196. So one image becomes 196 visual pieces. That is the first conversion.

ASCII patch grid

+----+----+----+----+
| p1 | p2 | p3 | p4 |
+----+----+----+----+
| p5 | p6 | p7 | p8 |
+----+----+----+----+
| p9 |p10 |p11 |p12 |
+----+----+----+----+
|p13 |p14 |p15 |p16 |
+----+----+----+----+

Real ViT uses many more patches.
This tiny grid only shows the idea.
Each patch is flattened into a vector. Then a learned linear layer projects it into embedding space. Now each patch has a dense representation. This is analogous to token embeddings in NLP. Then we add positional information. Why? Because p1 and p16 are not interchangeable. Location matters in vision. A wheel at the bottom matters differently than a sun at the top. So patch embedding plus position embedding gives us an ordered sequence. Then standard transformer blocks process the sequence. Self-attention lets one patch look at another patch. A capacitor patch can attend to neighboring solder patches. A face patch can attend to eye and mouth patches. That global interaction is powerful.

Slightly more formal pipeline

Image
→ split into fixed patches
→ flatten patches
→ linear projection
→ add position embeddings
→ transformer encoder blocks
→ pooled output or patchwise features
That is ViT in one breath.

2.3 Why patching works

At first, patching sounds crude. Why should chopping an image into squares help? Because transformers are good at reasoning over token sequences. The patch trick converts vision into that language. There are trade-offs, of course.

Strength 1 — global attention

CNNs build locality strongly from the start. ViTs can connect distant regions more directly. A patch in the top-left can attend to bottom-right early. That helps with long-range structure.

Strength 2 — scaling behavior

Transformers have shown excellent scaling patterns in language. ViTs benefited from similar scaling logic. Given enough data and compute, they became very strong.

Strength 3 — modular reuse

Once everything is token-like, it becomes easier to connect vision with language models. That matters enormously for VLMs.

Weakness 1 — data hunger

Early ViTs needed large datasets or strong pretraining. CNNs could be more efficient on smaller labeled datasets.

Weakness 2 — local detail can blur in coarse patching

If the patch size is large, a tiny defect may get swallowed. This links directly back to our burned capacitor story. A defect smaller than the patch granularity may weaken the signal.

Weakness 3 — computation grows with token count

More patches mean finer detail. But more patches also mean more attention cost. So there is always a resolution-versus-compute trade-off.

2.4 CLIP — aligning images and language

Now we add a second big idea. ViT teaches the eye to encode images. CLIP teaches the eye to agree with language. CLIP stands for Contrastive Language-Image Pre-training. The setup is elegant. Take many image-caption pairs. Feed images to an image encoder. Feed captions to a text encoder. Learn embeddings such that matched pairs sit close together. Mismatched pairs should sit farther apart. That is the whole spirit.

ASCII CLIP training picture

      image i1  ──► image encoder ──► v1 ──┐
      text  t1  ──►  text encoder ──► u1 ───┼─ maximize similarity
      image i2  ──► image encoder ──► v2 ──┤
      text  t2  ──►  text encoder ──► u2 ───┘

Matched pairs:   (v1,u1), (v2,u2) should be close.
Mismatched pairs: (v1,u2), (v2,u1) should be far.
In practice, training uses batches. Every image competes against many text candidates. Every text competes against many image candidates. The model learns a shared semantic space. This shared space is extremely useful.

Why CLIP mattered so much

Before CLIP, zero-shot image classification was much weaker. After CLIP, you could classify an image without task-specific retraining. You just compare the image embedding against text label prompts. For example: “a photo of a cat” “a photo of a dog” “a photo of a circuit board with damage” Whichever text embedding sits closest wins. This is conceptually beautiful. And very practical.

Retrieval use case

Suppose you embed one million images. Now a user types, “red vintage scooter near rain-wet street.” Encode the query. Find nearest image embeddings. Done. That is multimodal retrieval.

Why CLIP became foundational for later systems

CLIP does not generate prose answers by itself. But it learns strong aligned vision representations. Those representations became valuable building blocks. Image search. Reranking. Filtering. Conditioning. VLM front-ends. Diffusion text-image alignment. You will see CLIP again in Module 14.

2.5 How images become token sequences

This section is small but critical. People often say, “The image becomes tokens.” That sentence hides two separate stages.

Stage A — image into patch tokens for the eye

The raw image is split into patches. These become patch embeddings. The vision encoder processes them. This is the eye’s tokenization.

Stage B — visual features into language-side tokens

After the eye encodes the image, we often have a feature grid or pooled representation. That output may be compressed, resampled, projected, or queried. Then it becomes something the language model can attend to. This is not identical to raw patching. It is a second interface step. The translator usually lives here. If you blur these two stages, your architectural understanding remains fuzzy. Keep them separate.

One clean mental picture

Raw image
→ patch tokens for vision encoder
→ visual features
→ projected visual tokens for language model
→ textual answer
That is the full chain.

2.6 Where the eye still fails

Vision encoders are strong. They are not perfect. Here are failure categories worth remembering.

Tiny-object failure

Small signals vanish under resizing or coarse patching. Burn marks. Hairline cracks. Small text. Remote objects.

Domain-gap failure

Industrial images differ from web photos. Medical scans differ from product photos. If training data mismatches deployment data, representation quality falls.

Occlusion failure

If relevant structure is hidden, the encoder sees incomplete evidence. Then downstream reasoning becomes shaky.

Counting and exact geometry failure

Vision encoders often capture gist well. Exact counts and spatial relations remain harder. “Three screws” versus “four screws.” “Left of” versus “slightly behind.” Humans underestimate how brittle this can be.

Text-in-image failure

OCR-like tasks demand sharp local detail. Tiny rotated text is still annoying. That matters in documents, packaging, and dashboards.

Attention-distraction failure

Busy scenes contain many plausible objects. The encoder may represent salient but irrelevant regions strongly. Then the model answers the wrong question confidently. This is why prompting and cropping matter. The eye is not isolated. It is the first stage of the whole system. Next we study how the eye gets connected to language.


Chapter 3: Vision-Language Models — teaching the translator to speak

3.1 From seeing to answering

A good vision encoder can represent an image. A good LLM can generate language. A VLM joins those two powers. That sounds simple. In practice, the join is the interesting part. Because the eye and the language model were usually trained differently. They use different embedding spaces. They encode different biases. They may even expect different sequence structures. So we need a bridge. That bridge is the translator.

3.2 Generic VLM architecture

Here is the common blueprint.

             +-------------------+
Image ─────► |  Vision encoder   |  ── visual features ──┐
             +-------------------+                       │
                                                +----------------+
Text prompt ───────────────────────────────────►| Projection /    |
                                                | adapter / bridge|
                                                +----------------+
                                                +----------------+
                                                |      LLM       |
                                                +----------------+
                                                   Text answer
Sometimes the bridge is a simple linear layer. Sometimes it is an MLP. Sometimes it is a Q-Former or resampler. But conceptually, the same job remains. Take visual features. Convert them into language-usable tokens.

Why raw image vectors are not enough

You cannot just dump raw pixels into an LLM. The sequence would be huge and meaningless. The language model also was not trained on pixel statistics. It expects compact embeddings with semantic structure. So the eye must work first. Then the translator compresses and aligns. Then the LLM reasons and speaks.

A senior-level distinction

The vision encoder is not “the multimodal model.” The LLM is not “the multimodal model” either. The multimodal system is the entire chain. Eye. Translator. LLM. Prompt. Workflow. Keep that systems view.

3.3 The LLaVA recipe

LLaVA became a famous open recipe. Why did it matter? Because it showed a workable pattern for building VLMs cheaply. Here is the broad idea. Take a pretrained vision encoder. Take a pretrained LLM. Train a connector between them. Then instruction-tune the joint system. That is it in plain English.

Stage 1 — learn the connector

At first, you mainly want the two modules to communicate. So you use image-caption style data. The connector learns how visual features should enter language space.

Stage 2 — instruction tuning

Now train on image-instruction-response data. Examples: “What is the person holding?” “Describe the scene.” “Read the text on the sign.” “Which object seems damaged?” This stage teaches conversation behavior. Not just alignment. The model becomes useful for chat-like multimodal tasks.

Why this is engineering-friendly

Training everything from scratch is expensive. Reusing pretrained parts is practical. This mirrors many real AI systems. We combine strong modules. Then fine-tune the interfaces and behaviors that matter.

3.4 Frontier multimodal systems

Now think of systems like GPT-4V, Claude, and Gemini. Public details differ. Exact architectures are partially private. But the broad pattern is familiar. A strong vision front-end. A bridge or fusion mechanism. A powerful language model. Multimodal post-training and safety layers.

What these systems do well

  • general image description

  • chart and document understanding

  • question answering over visible content

  • broad OCR

  • screenshot and UI understanding

  • multimodal reasoning on common scenes

What they still do inconsistently

  • exact counts in cluttered scenes

  • small hidden defects

  • left-right and spatial precision

  • specialist diagnosis without domain tuning

  • fine-grained measurement from images

  • long videos with consistent reasoning

This dual truth matters. The capabilities are impressive. The weaknesses are also real.

3.5 Training process and failure points

Let us simplify the training stack.

Step 1 — pretrain the eye

Use supervised or contrastive objectives. Learn strong visual features. CLIP-style pretraining is especially common and useful.

Step 2 — align vision outputs to language space

Train the bridge. Maybe freeze most of the eye and LLM initially. This keeps costs manageable.

Step 3 — multimodal instruction tuning

Give the joint model paired examples. Image plus question. Image plus dialogue. Image plus desired answer. Now the model learns usable interaction behavior.

Step 4 — safety and preference tuning

Reduce obvious harmful outputs. Improve helpfulness. Improve refusal behavior. Possibly improve groundedness through curated data. That is the broad training story. Now the failure points.

Failure point A — weak visual grounding

The answer sounds smooth. But the answer is not tightly linked to the pixels. This is common.

Failure point B — bridge bottleneck

The visual encoder may output rich detail. The bridge passes only a compressed subset. Important local cues may vanish.

Failure point C — language prior override

The model answers what is statistically plausible. Not what is visually present. This causes confident nonsense.

Failure point D — poor instruction data for rare tasks

If the model rarely saw industrial inspection prompts, it may generalize badly there.

Failure point E — evaluation mismatch

Benchmarks may reward broad correctness. Production may demand exactness. “Mostly right” is unacceptable in diagnosis. This is where many teams get trapped.

3.6 Retrieval prompts for memory

Use these retrieval prompts to test whether the ideas stuck.

Prompt 1

“Explain the burned-capacitor failure using the eye, translator, and workflow lens. Give one mitigation for each layer.”

Prompt 2

“From memory, draw the generic VLM architecture and label where information can be lost.”

Prompt 3

“Contrast CLIP and LLaVA in one paragraph. Which one aligns embeddings? Which one answers questions?”

Prompt 4

“Give one example where the eye succeeds but the translator fails. Give one example where the translator succeeds but the eye fails.” If you can answer these cleanly, you understand the chapter well.


Chapter 4: Image generation — teaching the canvas to paint

4.1 GANs — the duel

Before diffusion dominated, GANs were the glamour models of image generation. GAN means Generative Adversarial Network. Two networks compete. A generator creates fake images. A discriminator tries to detect fakes. The generator improves by fooling the discriminator. The discriminator improves by catching it. This duel can create beautiful samples. It also creates training headaches.

Why GANs felt exciting

Generation is fast once trained. Results could look sharp and realistic. StyleGAN especially pushed image quality impressively.

Why GANs were painful

Training instability. Mode collapse. Sensitive balancing between the two networks. Difficult objective dynamics. So GANs mattered historically. But many teams found them hard to control and scale.

4.2 VAEs — the compress-and-rebuild story

VAE means Variational Autoencoder. A VAE has two main parts. An encoder compresses the image into a latent vector. A decoder reconstructs the image from that latent. Why is this useful? Because it teaches a compressed representation space. That is the latent space idea. Nearby latent points often represent similar images. Smooth movement in latent space gives controlled variation.

The practical intuition

A VAE says, “Do not store every pixel directly. Store a compressed meaning-like code. Then rebuild the image from that code.” That code is the canvas-friendly representation.

The trade-off

VAEs are elegant and useful. But classic VAE outputs could look blurry. They optimized reconstruction plus a distributional regularizer. Not always crisp realism. Still, the concept is vital. Latent space will matter heavily in diffusion. So do not skip this intuition.

4.3 Diffusion — the dominant modern path

We will keep this brief here. Because Module 14 owns the deep dive. But you need the big picture now. A diffusion model learns to reverse noise. Start with a real image. Gradually add noise. Eventually the image becomes mostly noise. Then train a model to undo that process. At generation time, start from noise. Remove noise step by step. A coherent image emerges. That is the core story.

Why diffusion won

Excellent sample quality. Stable training compared with GANs. Flexible conditioning with text and other controls. Strong controllability and ecosystem growth. So when people say modern text-to-image, they usually mean diffusion-style systems.

Why you still care about GANs and VAEs

Because they teach the conceptual landscape. GANs teach adversarial generation. VAEs teach latent compression. Diffusion teaches iterative denoising. All three sharpen your generative intuition.

4.4 Current SOTA map

Treat this as a moving landscape, not eternal truth.

Closed or hosted leaders

  • DALL-E 3

  • Midjourney

  • OpenAI image generation in GPT-4o-style stacks

  • Google Imagen family

Open or semi-open strong families

  • Stable Diffusion 3 / 3.5 lineage

  • Flux family

  • other rapidly changing open ecosystems

Why this matters strategically

As an engineer, you rarely pick “best model” in the abstract. You pick based on trade-offs. Quality. Latency. Cost. License. Control. Deployment style. Safety needs. Customization. That same engineering mindset from LLM selection applies here too.

4.5 The minimum concepts Module 14 assumes

This is the handoff section. Module 14 will assume four ideas from this chapter and earlier chapters.

Assumption 1 — how images become tokens

If you cannot explain patching, you will struggle with modern multimodal pipelines.

Assumption 2 — what a vision encoder does

You need a stable picture of the eye. The eye turns pixels into useful features.

Assumption 3 — what latent space means

You need a stable picture of the canvas. Compressed representations make generation efficient and structured.

Assumption 4 — what noise versus signal means

Diffusion literally lives on this distinction. Signal is meaningful image structure. Noise is random corruption hiding that structure. If this feels vague, fix it now. We will formalize it again in chapter 6.


Chapter 5: Video models — teaching the canvas to move

5.1 From image to frame tape

An image model creates one frame. A video model must create many linked frames. That linking is the frame tape idea. Imagine sticking image patches onto a strip of time. Now each patch also belongs to a frame index. So the model must reason in space and time. A person in frame one should still be the same person later. A car turning left should continue the motion sensibly. Light should not flicker randomly unless the scene justifies it. Very simple to say. Very hard to do.

5.2 Video tokenization and temporal attention

There are many architectural variants. But the core idea is understandable.

Option A — treat each frame like an image, then connect frames

Encode frame one. Encode frame two. And so on. Then use temporal modules to connect them.

Option B — use spatiotemporal patches directly

Instead of 2D patches, use 3D chunks. Width. Height. Time. Now one token may represent a little cube of video. That cube carries both appearance and motion clues.

What temporal attention does

Temporal attention lets one frame attend to other frames. A face now can connect with the same face later. A moving ball can connect across time. An object identity can persist. Without some mechanism like this, videos become flickery nonsense.

Tiny ASCII picture

Frame 1: [p1][p2][p3]
Frame 2: [p1][p2][p3]
Frame 3: [p1][p2][p3]

Time links:
[p1]─[p1]─[p1]
[p2]─[p2]─[p2]
[p3]─[p3]─[p3]
Real systems are much richer. But this shows the frame tape idea.

5.3 Text-to-video systems now

Systems like Sora, Runway, Veo, and others pushed attention here. They showed that high-quality short videos are possible. They also revealed how hard consistency remains.

What users want

  • faithful prompt following

  • realistic motion

  • stable identity

  • editable scenes

  • long duration

  • controllable camera movement

What makes this difficult

Each extra second adds massive state. Every frame must agree with previous frames. Physics errors become visible quickly. Object transformations must stay coherent. Hands, reflections, shadows, and contact points all matter. Humans notice motion mistakes sharply.

5.4 Why video is brutally expensive

This section is interview gold. You should answer it crisply.

Reason 1 — token count explodes

One image already contains many tokens. A video contains many images. Multiply accordingly. Now memory and compute shoot upward.

Reason 2 — attention cost grows badly

Attention over many space-time tokens is expensive. Even optimized variants struggle at long durations or high resolution.

Reason 3 — error accumulation

A small inconsistency in one frame can amplify later. Identity drift is common. Background morphing is common. Physics weirdness is common.

Reason 4 — evaluation is harder

A nice single image is easy to judge visually. A full video must be judged for motion realism and temporal consistency too. That is a richer failure surface.

Reason 5 — editing and control remain hard

Users do not only want generation. They want revision. “Keep the same actor, but change the jacket.” “Keep the camera static, but increase rain.” Such editing demands tight control over latent structure through time. That is still hard.

5.5 Honest admission

Let us be very honest. This module should not make you overconfident.

Honest admission on image understanding

Vision models still hallucinate. They can describe plausible objects not actually present. They can omit visible defects. They can answer with confidence under weak evidence.

Honest admission on spatial reasoning

Left-right relations are still brittle. Counting is still surprisingly weak. Precise geometry from casual prompts is unreliable. Fine-grained grounding is not guaranteed.

Honest admission on video generation

Current video results can look astonishing. They can also fall apart under scrutiny. Identity drift. Melting objects. Inconsistent contact physics. Broken text rendering. Temporal flicker. Long-horizon story weakness.

Honest admission on production reality

A polished demo can hide these issues. Production traffic will expose them. Users upload blurry, rotated, dark, cluttered, domain-specific images. They ask ambiguous questions. They expect reliability anyway. That is why a Lead engineer needs skepticism. Respect the capability. Respect the failure modes more.


Chapter 6: Recap and application

6.1 Failure-fix chain

Every important concept in this module exists because something breaks. Use this table like a memory anchor.

| Failure | What breaks | Fix / idea | Placeholder |

|---|---|---|---|

| Raw pixels have no direct semantics | Model sees numbers, not objects | vision encoder learns features | the eye |

| Transformers need sequences, not grids | 2D image is not tokenized | patch tokenization / ViT | the patch |

| Image and text embeddings mismatch | language model cannot use visual features | projection / adapter / bridge | the translator |

| Web captions are weak for retrieval | text and image do not align well | CLIP contrastive training | eye + translator |

| Fluent answers ignore actual pixels | language prior overrides weak evidence | grounded prompts, better data, evaluation | translator + workflow |

| Tiny defects vanish in full images | detail diluted by resize or coarse patches | crop, zoom, higher resolution, task-specific data | the eye |

| One-shot captioning is not diagnosis | workflow lacks inspection steps | multi-step prompting and tools | workflow |

| GANs produce unstable training | generator and discriminator mismatch | diffusion-style denoising became dominant | the canvas |

| Raw image generation is expensive | pixel space too large | latent-space generation | the canvas |

| Video frames drift over time | no temporal consistency | temporal attention / spatiotemporal modeling | frame tape |

| Video compute explodes | too many space-time tokens | compression, latent video spaces, better samplers | frame tape + canvas |

Read the table slowly. Each row is a failure-to-method mapping. That is the senior way to remember techniques.

6.2 Key points to remember

Point 1

Vision begins with representation. Before language, before generation, before video, the system must convert pixels into useful features.

Point 2

Patch tokenization is the bridge from images to transformers. This is why ViT matters conceptually. It turns images into sequence-like inputs.

Point 3

CLIP taught the eye and language to share a space. That shared space unlocked retrieval and zero-shot behavior.

Point 4

A VLM is not just “an LLM with an image input.” It is a pipeline with a vision encoder and a bridge.

Point 5

Fluent multimodal output is not the same as grounded seeing. Never confuse language confidence with visual correctness.

Point 6

Generation families differ in training dynamics and representation assumptions. GANs duel. VAEs compress and rebuild. Diffusion denoises.

Point 7

Video adds time, not just more pixels. Time multiplies compute and multiplies consistency demands.

6.3 Important interview questions

Here are strong interview prompts. Answer them aloud.

Question 1

“How does a Vision Transformer turn an image into tokens?” Senior answer: Split into fixed patches, flatten, project to embeddings, add positional embeddings, run transformer blocks. Mention the resolution-compute trade-off too.

Question 2

“What exactly does CLIP optimize?” Senior answer: Matched image-text pairs get similar embeddings. Mismatched pairs are pushed apart. The result is a shared semantic embedding space.

Question 3

“Why do VLMs need a projection layer?” Senior answer: Vision features and language token embeddings are not naturally aligned. The projection or adapter maps visual features into a space the LLM can consume.

Question 4

“Why can a multimodal model miss an obvious defect?” Senior answer: Small object or low resolution, patch dilution, domain gap, weak grounding, language prior override, and missing inspection workflow.

Question 5

“GANs vs VAEs vs diffusion?” Senior answer: GANs generate via adversarial competition, VAEs via latent compression and reconstruction, diffusion via iterative denoising. Trade-offs are stability, fidelity, speed, and control.

Question 6

“Why is video generation harder than image generation?” Senior answer: Space-time token explosion, temporal consistency demands, higher attention cost, and harder evaluation.

Question 7

“What does this module give you before diffusion?” Senior answer: A foundation in image tokenization, vision encoders, latent-space intuition, and the signal-noise distinction. That is exactly the required handoff.

6.4 Production experience

This section is practical. It is not theory-only.

Production lesson 1 — crop before you trust

If the task depends on small details, crop likely regions first. Do not ask a giant full-scene question blindly.

Production lesson 2 — separate retrieval from reasoning

CLIP-style retrieval can be strong. Free-form explanation may still be weak. Measure them separately. Do not merge scores lazily.

Production lesson 3 — log uncertainty and failure slices

Track where the system fails. Tiny text. Low light. Occlusion. Manufacturing defects. Motion blur. This is how models improve in practice.

Production lesson 4 — compare workflows, not just models

A weaker model with crop-and-rerank may beat a stronger model one-shot. Workflow design is often the hidden multiplier.

Production lesson 5 — multimodal evals need visible evidence checks

Ask whether the answer cited actual visible clues. Not just whether the prose sounded reasonable. Grounding matters.

Production lesson 6 — humans remain essential in high-stakes settings

Inspection, medical, legal, and safety tasks need review loops. Multimodal assistance is useful. Blind automation is risky.

6.5 Foundation-gap audit for Module 14

This is the audit you must pass. If any answer feels vague, do not rush ahead.

Audit item 1 — how images become tokens

Can you explain patch tokenization cleanly? Can you compute patch counts for a simple example? Can you say why positional embeddings matter? If not, revisit chapter 2.2.

Audit item 2 — vision encoder concept

Can you explain what the eye does? Not implementation trivia. The conceptual job. Pixels in. Features out. Useful semantic representation formed. If not, revisit chapter 2.1 and 2.6.

Audit item 3 — latent space idea

Can you explain why compressed representation spaces help generation? Can you explain why nearby latent points often mean similar samples? If not, revisit chapter 4.2.

Audit item 4 — noise versus signal distinction

Can you describe signal as meaningful image structure? Can you describe noise as corruption hiding that structure? Can you see why denoising could become a generation strategy? If not, revisit chapter 4.3. This audit is not decorative. It is the bridge to diffusion.

6.6 Apply now — exercises and bridge

Exercise 1 — easy

Take any product photo. Describe how it becomes patch tokens. Then explain what the eye outputs conceptually.

Exercise 2 — easy

Explain CLIP to a PM in five sentences. Do not use the word “contrastive.”

Exercise 3 — medium

Design a multimodal support agent for e-commerce returns. Separate the eye, translator, and workflow explicitly. List three failure modes.

Exercise 4 — medium

Design an industrial inspection assistant for circuit boards. State what you would never automate fully. State what you would crop first. State what metric you would track weekly.

Exercise 5 — hard

Compare two architectures: A pure CLIP retrieval pipeline, and a VLM chat pipeline for the same image catalog. Which parts should be trusted for retrieval? Which parts should be trusted for explanation? Why?

Exercise 6 — hard

Take a short generated video. List every temporal failure you observe. Identity drift. Lighting drift. Broken motion. Object melting. Camera inconsistency. This trains your eye for video evaluation.

Final bridge

You now have the minimum ladder. Images become tokens. The eye encodes them. The translator connects them to language. The canvas creates images from structured representations. The frame tape extends that creation through time. Next module — 02_diffusion_media_generation — dives deep into the dominant image generation technique: how adding and removing noise creates photorealistic images from text. That next module will feel much easier now. Because the painter already learned to see.