04. Vision-Language Models — where seeing meets speaking¶
12 min. Hold one picture in mind: a camera hands notes to a chatty student.
Built on the ELI5 in 00-eli5.md. The the translator — vision-language bridge — matters here because it lets visual features enter a language model cleanly.
1) Picture first: two specialists, one handoff¶
See. A VLM is usually two smart parts with a handshake. The first part is the eye. It looks at pixels and produces dense vectors. The second part is the language model. It predicts the next word, not the next pixel. Between them sits the translator. That bridge converts visual features into something the LLM can read. Simple, no? If you skip the bridge, the stacks mishear each other. The image side speaks one geometry. The text side speaks another geometry. Both use vectors, yes. But their coordinates, scales, and habits differ. So what to do? We add a learned adapter in the middle. That adapter can be tiny or fancy. But it must align spaces well. A good mental model is this. the eye is a surveyor measuring a room. the translator rewrites those measurements as instructions. The LLM then answers in fluent language. Look. The LLM never sees raw RGB values directly. It sees tokens or token-like embeddings after the bridge. That difference matters in interviews. Here is the full pipeline picture. ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ image pixels │ ──→│ patching │ ──→│ vision │ ──→│ projector or │ │ H×W×3 │ │ 14×14 grid │ │ encoder │ │ adapter │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ ▼ ┌──────────────────┐ │ visual tokens in │ │ LLM hidden space │ └──────────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ user prompt │ ──→│ text tokens │ ──→│ decoder-only │ ──→│ answer words │ │ "What?" │ │ + visual │ │ LLM │ │ or actions │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
2) Why the bridge exists at all¶
Now what is the problem? The encoder outputs vectors shaped for vision pretraining. The LLM expects embeddings shaped for next-token prediction. Those are not automatically compatible. Sometimes even the width is different. Often the meaning axes are different too. Suppose the vision encoder emits 1024 numbers per patch. Suppose the LLM hidden size is 4096. Even before meaning, dimensions already mismatch. A 1024-wide row cannot be dropped into a 4096-wide slot. That is the easy part. The harder part is semantic mismatch. A visual direction like edge texture may not match a text direction like plural nouns. This is why the translator is learned, not hard-coded. It discovers a useful mapping during training. It learns what the LLM can exploit. It also learns ordering and compression choices. Yes, ordering matters. The encoder may keep 196 patch outputs. The LLM may only want 32 or 64 visual tokens. A resampler can shrink them. A Q-Former can query them. An MLP can reshape them. One more interview trap sits here. Encoder-side tokens are not automatically LLM-side visual tokens. Please separate them clearly. The encoder may produce one token per the patch. The bridge may compress, mix, or reweight them. So the LLM can receive fewer, denser visual tokens. Same image. Different token set after adaptation. Think of it like this. A camera sensor produces many measurements. A briefing memo contains fewer, cleaner bullets. The memo is not the sensor itself. Likewise, LLM-side visual tokens are adapted summaries.
3) Common bridge choices and why people pick them¶
Look at the design menu. There are four common bridge families. Each solves the same handoff with different cost and flexibility.
Linear projection¶
This is the simplest bridge. Take each vision vector and multiply by one matrix. Cheap. Fast. Easy to train. LLaVA-style systems often start here. If the vision encoder is already strong, this can work surprisingly well. The weakness is limited expressiveness. One matrix cannot do rich token selection.
MLP adapter¶
Now add one or two nonlinear layers. You get more capacity. The adapter can bend the space more. It can learn sharper feature mixes. Cost rises a bit. Training stays manageable. This is a practical middle road. Simple, no?
Q-Former¶
Now the bridge becomes query-based. A small transformer sends learned queries into vision features. Those queries pull out the parts useful for language. This reduces token count smartly. It also separates perception from summarization. BLIP-2 made this pattern easy to remember. The tradeoff is more moving parts.
Resampler¶
A resampler compresses many vision tokens into fewer latent tokens. Perceiver-style modules do this well. They are helpful when images are large. They are also helpful for video, where tokens explode. The LLM sees a fixed token budget. The resampler decides what survives. So which one is best? There is no universal winner. If you want speed and a clean recipe, pick linear or MLP. If you need token compression with attention, pick Q-Former or resampler. If the LLM context is tight, compression matters more. If latency is tight, simpler bridges matter more.
4) Worked example: 196 patch vectors enter a 4096-wide LLM¶
See the concrete path. Take an image of size 224×224. Use patch size 16×16. Along each side, patches per side = 224 ÷ 16 = 14. Total patches = 14 × 14 = 196. So the eye outputs 196 patch embeddings. Assume the vision encoder width is 1024. So the encoder output matrix is: E ∈ R^(196 × 1024) Count the numbers inside E. 196 × 1024 = 200,704 values. Now assume the LLM hidden size is 4096. Use a linear projection matrix: W ∈ R^(1024 × 4096) Count the weights inside W. 1024 × 4096 = 4,194,304 weights. Add a bias too if you like: b ∈ R^(4096) Now multiply: V = EW + b Shapes: (196 × 1024) × (1024 × 4096) = (196 × 4096) So the projected visual token matrix is: V ∈ R^(196 × 4096) Count the output numbers. 196 × 4096 = 802,816 values. That is the real dimension story. Now let us show one tiny numeric row, fully worked. Pretend one encoder token is only 1×4. Pretend the projector maps 4 dimensions to 3. x = [2, 1, 0, 3] W_toy = [ 1, 0, 2 2, 1, -1 0, 3, 1 -1, 2, 0 ] Compute each output coordinate. y1 = 2×1 + 1×2 + 0×0 + 3×(-1) = 2 + 2 + 0 - 3 = 1 y2 = 2×0 + 1×1 + 0×3 + 3×2 = 0 + 1 + 0 + 6 = 7 y3 = 2×2 + 1×(-1) + 0×1 + 3×0 = 4 - 1 + 0 + 0 = 3 So: y = [1, 7, 3] Real projectors do the same idea. They just do it at much larger width. And they do it for every the patch token row. One more subtlety. The 196 encoder outputs are encoder-side tokens. After projection, they become LLM-side visual embeddings. If a resampler reduces 196 to 32, then the LLM sees 32 visual tokens. Do not say the LLM saw 196 patches directly. That answer is sloppy.
Where this lives in the wild¶
- ChatGPT image upload uses a vision stack so GPT can answer about photos and screenshots.
- Claude 3 in claude.ai turns document pages and images into visual context before text generation.
- Google Gemini inside Workspace can read slides or screenshots because visual features enter the dialogue stream.
- Alibaba Qwen-VL chat demos use a connector so the language model can talk about image regions.
- Amazon Rufus product search uses image understanding signals to ground retail answers from uploaded photos.
Pause and recall¶
- Why can we not feed 1024-wide vision features straight into a 4096-wide LLM?
- What is the difference between encoder-side tokens and LLM-side visual tokens?
- When would you prefer a Q-Former over a plain linear projection?
- In the worked example, how many patch embeddings came from a 224×224 image?
Interview Q&A¶
Q1. Why use a learned bridge and not just copy vision features into the prompt? A. Because the LLM expects embeddings in its own hidden space. Raw vision outputs may differ in width, scale, and semantics. The bridge learns a usable mapping and sometimes token compression. Common wrong answer to avoid: "They are both vectors, so direct concatenation is fine." Q2. Why might a resampler beat a linear layer on large images? A. Large images create many encoder tokens. A resampler can compress them into a fixed token budget. That protects LLM context length and latency. Common wrong answer to avoid: "Resamplers are always more accurate because they are deeper." Q3. Why is saying 'the LLM reads patch tokens' often imprecise? A. The encoder reads patches first. The bridge may mix, pool, or compress those outputs. So the LLM usually reads adapted visual tokens, not raw patch embeddings. Common wrong answer to avoid: "Patch tokens and visual tokens are the same thing." Q4. Why keep the vision encoder separate from the decoder-only LLM at all? A. Vision encoders are optimized for spatial structure. Decoder-only LLMs are optimized for causal text generation. Specializing the parts often gives better reuse and cheaper training. Common wrong answer to avoid: "Because nobody knows how to train one joint model."
Apply now (5 min)¶
Quick exercise: Take a 336×336 image with 14×14 patches. Compute patches per side, total patch count, and the projector output shape for an LLM width of 4096. Say each step aloud. Sketch from memory: Draw the pipeline from pixels to the eye, then the translator, then the LLM answer. Mark clearly where encoder-side tokens stop and LLM-side visual tokens begin.
Bridge. The architecture is clear. But how do we actually train this stack? LLaVA showed the simplest recipe. → 05-llava-and-frontier-vlms.md