Skip to content

05. LLaVA and Frontier VLMs — the simple recipe behind smart image assistants

14 min. Think of a strong camera expert sitting beside a fluent tutor.

Built on the ELI5 in 00-eli5.md. The the translator — vision-language bridge — is central here because LLaVA learns it, then teaches the full stack to follow image-grounded instructions.


1) Picture first: why LLaVA felt so clean

See. Many multimodal systems sound mysterious from outside. LLaVA is easier to picture. Take a pretrained vision encoder. Take a pretrained LLM. Insert a learned connector between them. Now train that connector first. Then tune the combined system on instruction data. Simple, no? This works because both big parts already know something useful. the eye already knows visual structure. The LLM already knows grammar, world facts, and dialogue. The new work is mostly the handshake. That handshake is the translator. It learns how image features should enter the text stream. The LLM still remains a next-token machine. It does not suddenly become a camera. It receives visual embeddings as if they were prefix context. Then it continues with ordinary autoregressive decoding. Look. That is why people say the LLM "sees" the image. Strictly speaking, it sees adapted embeddings, not raw pixels. A practical benefit appears here. You do not need to train both towers from scratch. You reuse expensive pretrained models. You add a relatively small connector. Then you collect paired data and instruction examples. For many teams, this is the first workable path.

2) The two-stage LLaVA training recipe

Now what is the recipe? Stage one is alignment pretraining. Stage two is instruction tuning. The order matters. In stage one, image-caption or image-text pairs teach the connector. The goal is not rich conversation yet. The goal is compatibility. Visual features must land in the LLM space cleanly. So the system learns basic cross-modal alignment first. In stage two, the model sees instruction-style examples. A user asks something about an image. The assistant replies helpfully. Now the model learns format, refusal behavior, grounded answering, and multi-turn style. This stage teaches behavior on top of alignment. Here is the picture. ┌──────────────────────┐ │ Stage 1: alignment │ └──────────────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ image │ ──→│ vision │ ──→│ connector │ │ caption pair │ │ encoder │ │ learns map │ └──────────────┘ └──────────────┘ └──────────────┘ │ ▼ ┌────────────────┐ │ LLM predicts │ │ paired text │ └────────────────┘ │ ▼ ┌──────────────────────┐ │ Stage 2: instruction │ └──────────────────────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ image + user │ ──→│ visual and │ ──→│ assistant │ │ question │ │ text context │ │ answer │ └──────────────┘ └──────────────┘ └──────────────┘ Notice the separation. Stage one says, "Can these parts talk at all?" Stage two says, "Can this combined system answer helpfully?" If you skip stage one, stage two wastes effort learning the handshake. If you skip stage two, the model may caption well but answer poorly. This division also explains many open models. A strong frozen the eye plus a modest connector can go far. But instruction tuning still matters for final user experience. Otherwise answers sound generic or brittle. So what to do? Use both stages, then evaluate grounded behavior carefully.

3) Worked example: what the LLM actually sees in a LLaVA-style stack

Let us trace one image end to end. Assume the image is resized to 336×336. Assume the vision encoder uses 14×14 patches. Patches per side = 336 ÷ 14 = 24. Total patch positions = 24 × 24 = 576. So the eye produces 576 patch embeddings. Suppose each embedding has width 1024. Then the encoder output is: E ∈ R^(576 × 1024) Now use a learned projector into an LLM with width 4096. The projector matrix is: W ∈ R^(1024 × 4096) Multiply them: V = EW Shapes: (576 × 1024) × (1024 × 4096) = (576 × 4096) Count the raw numbers. Encoder output values = 576 × 1024 = 589,824. Projected values = 576 × 4096 = 2,359,296. Now add the user text. Suppose the prompt is: "Describe the safety hazards in this workshop image." Assume tokenization gives 10 text tokens. Then the decoder input prefix is roughly: 576 visual tokens + 10 text tokens = 586 total prefix positions. What does the LLM see at each of those 576 visual positions? Not words. Not pixels. It sees 4096-wide vectors inside its own hidden space. That is the key idea. Let us do a tiny toy row to make it concrete. Take one projected visual row from a tiny example. x = [1, 2, 1] W_toy = [ 2, 0 1, 3 -1, 4 ] Compute: y1 = 1×2 + 2×1 + 1×(-1) = 2 + 2 - 1 = 3 y2 = 1×0 + 2×3 + 1×4 = 0 + 6 + 4 = 10 So one tiny visual row becomes: y = [3, 10] Scale that idea up. Each the patch vector becomes an LLM-space embedding. The decoder then attends over those positions while generating text. If the assistant says, "There is a ladder near exposed wiring," that sentence came from attention over visual and text context together. Now note one subtle point. Some systems keep almost all visual positions. Some compress them with a resampler or query module. So "how many tokens does the LLM see?" is a design choice. Do not answer it as if one fixed number exists. Yes?

4) Frontier VLMs: shared pattern, different product choices

Look at the big families now. Frontier systems share a broad idea. They all need perception, alignment, and language generation. But they differ in where multimodality enters. A bolt-on system starts with separate pretrained parts. LLaVA is the clean teaching example. A pretrained vision encoder feeds a pretrained LLM through a learned connector. This is modular. It is easier to assemble and study. It can move fast on open weights. A native multimodal system mixes modalities much earlier. Gemini is the common reference point here. During pretraining, image and text tokens are interleaved more deeply. The model learns joint behavior from the start. This can improve coordination across modalities. The cost is much higher training complexity. GPT-4V style systems, from public evidence, look highly integrated at product level. They handle images, charts, screenshots, and documents in one assistant flow. Claude vision is especially strong in document reading and long-context analysis. Qwen-VL variants are notable for practical open deployment and region-aware chat. So the shared pattern is clear. Perception enters a language-like sequence. The key difference is when and how that fusion was learned. Here is a compact comparison. Bolt-on: reuse strong old parts, learn connector, tune behavior. Native multimodal: train a joint token world earlier, with tighter co-adaptation. Bolt-on wins on simplicity and reproducibility. Native often wins on deeper coordination, if you can afford it. Now what is the interview trap? People say "frontier models are just LLaVA but bigger." That is too loose. Some are closer to that recipe. Some are not. The correct answer is to compare training strategy, tokenization, and fusion depth. Simple, no?


Where this lives in the wild

  • ChatGPT with image upload lets users ask about charts, receipts, and whiteboards through one assistant flow.
  • Google Gemini in the Gemini app and Workspace side panel handles screenshot and slide understanding with multimodal context.
  • Claude 3 on claude.ai reads photos, PDFs, and diagrams, then answers in long-form grounded text.
  • Qwen-VL-Max on Alibaba Cloud supports image chat and region grounding for enterprise and commerce use cases.
  • Microsoft Copilot can reason over screenshots and UI captures, turning visual input into assistant actions and explanations.

Pause and recall

  1. Why does LLaVA separate alignment pretraining from instruction tuning?
  2. In a 336×336 image with 14×14 patches, how do we get 576 visual positions?
  3. What does the LLM actually receive from the image in a LLaVA-style model?
  4. Why is calling Gemini "just a bolt-on connector model" usually inaccurate?

Interview Q&A

Q1. Why do LLaVA-style systems often freeze the vision encoder and LLM first, instead of tuning everything immediately? A. Freezing stabilizes training and focuses learning on the connector. It is cheaper and needs less data. Once alignment works, broader tuning can be added carefully. Common wrong answer to avoid: "Because frozen models are always more accurate than tuned ones." Q2. Why does instruction tuning matter if alignment pretraining already matches images and text? A. Alignment teaches compatibility, not assistant behavior. Instruction tuning teaches grounded dialogue, format following, refusal style, and task framing. Without it, captions may improve while answers stay awkward. Common wrong answer to avoid: "Instruction tuning only makes responses sound nicer." Q3. Why might a native multimodal model outperform a bolt-on one on complex image dialogue? A. Earlier joint training can let visual and text representations co-adapt more deeply. That can improve cross-modal reasoning and long interaction consistency. But it costs much more to train. Common wrong answer to avoid: "Native multimodal always means there is no separate vision stack anywhere." Q4. Why is token count analysis important when comparing frontier VLMs? A. Token count affects latency, memory, and what detail survives into reasoning. A model that compresses aggressively may answer faster but lose small cues. A model that keeps more visual tokens may reason better but cost more. Common wrong answer to avoid: "Token count is only an inference engineering detail."


Apply now (5 min)

Quick exercise: Take a 448×448 image and assume 14×14 patches. Compute the number of visual positions, then add a 12-token user prompt. State exactly what the decoder receives at each position. Sketch from memory: Draw the two-stage LLaVA pipeline. Label stage one alignment, stage two instruction tuning, and show where the translator sits between the eye and the LLM.


Bridge. The recipe looks clean. But trained VLMs still fail in predictable, embarrassing ways. → 06-training-vlms-failure-points.md