Skip to content

06. Training VLMs: Failure Points — where strong demos still crack

13 min. Imagine a bright student who talks fast but glances too quickly.

Built on the ELI5 in 00-eli5.md. The the eye — vision encoder — is strong but imperfect, and these failures appear when its signals reach language through an overconfident bridge.


1) Picture first: why good training still leaves silly mistakes

See. A trained VLM can feel impressive for ten minutes. Then one odd image breaks the spell. It names objects that are not there. It miscounts obvious items. It misses tiny text. It flips left and right. It confuses one similar bird for another. Simple, no? The reason is not only bad training. The task itself is awkward. the eye compresses a rich scene into limited vectors. the translator maps those vectors into language space. Then the decoder keeps predicting words one by one. At each step, language priors can overpower weak visual evidence. Look at the tension. Language models are rewarded for plausible continuation. Vision questions need faithful grounding. Those are related goals, but not identical. A plausible answer can still be false. That is why failures feel embarrassing. The sentence sounds confident and smooth. Another problem sits earlier. A single the patch may contain mixed evidence. Thin objects, small text, and crowded scenes get blurred together. Once detail is lost there, later layers cannot magically recover it. So what to do? First learn the failure shapes clearly. Then design evaluation and mitigation around them.

2) The common failure families

Now let us map the territory. Keep this tree in mind during debugging. ┌──────────────────────────────┐ │ VLM failure taxonomy │ └──────────────────────────────┘ ├── Hallucination │ ├── object invented │ ├── attribute invented │ └── action invented ├── Counting │ ├── duplicates merged │ ├── crowded items skipped │ └── small items ignored ├── OCR │ ├── tiny text missed │ ├── rotated text misread │ └── stylized font confused ├── Spatial reasoning │ ├── left/right flipped │ ├── above/below mixed │ └── relative position blurred └── Fine-grained grounding ├── similar classes merged ├── exact region not localized └── part-level cue ignored

Hallucination

This is the loudest failure. The model describes a dog, vest, or sign that does not exist. Why? Because weak visual evidence meets strong language prior. If the scene roughly looks like a kitchen, "fork" feels plausible. If it roughly looks like a roadwork area, "helmet" feels plausible. The answer sounds natural. That is exactly the danger.

Counting failures

Now what is the problem with counting? Transformers are good at pattern summaries. Counting needs discrete, instance-level separation. Crowded objects overlap. Similar objects get pooled into one broad feature. Two dogs lying close may become one blob. Five bottles in a rack may become four. Yes?

OCR failures

OCR looks easy until text becomes small or rotated. A few pixels decide one letter. If patching or resizing blurs those pixels, meaning collapses. "8" becomes "3". "EXIT" becomes noise. Stylized fonts make it worse. Screenshots with tiny menus are common failure cases.

Spatial reasoning

Left, right, behind, and between sound basic. But they require stable reference frames. A VLM may know the objects correctly and still relate them wrongly. Ask, "Is the cup left of the plate?" It may answer from a muddled global summary. This is worse when the image is busy.

Fine-grained grounding

Some tasks need exact localization. Which screw is loose? Which sentence in the contract mentions warranty? Which bird species has the white ring near the eye? These need precise region-to-token links. Many VLMs are better at scene gist than surgical grounding.

3) Worked failure case: how a hallucination happens token by token

Let us walk through one toy failure. Image: a white plate, one spoon, and a napkin on a table. Question: "Is there a fork on the plate?" Ground truth is simple. There is no fork. But the model answers: "Yes, there is a fork next to the plate." Why does this happen? We can sketch the token-level story. Assume the image becomes 16 visual tokens after compression. We only track four of them. v05 = plate rim highlights v06 = spoon bowl v07 = spoon handle glare v08 = napkin fold Now look at the text tokens. t1 = Is t2 = there t3 = a t4 = fork t5 = on t6 = the t7 = plate t8 = ? The decoder forms a query around the meaning of "fork". That query attends to visual tokens. Suppose the attention weights are: a(v05) = 0.18 a(v06) = 0.25 a(v07) = 0.42 a(v08) = 0.15 These weights sum to: 0.18 + 0.25 + 0.42 + 0.15 = 1.00 Now assign toy "fork-likeness" scores from the visual side. s(v05) = 0.10 s(v06) = 0.20 s(v07) = 0.30 s(v08) = 0.05 Compute the visual evidence score. visual_score = 0.18×0.10 + 0.25×0.20 + 0.42×0.30 + 0.15×0.05 visual_score = 0.018 + 0.050 + 0.126 + 0.0075 visual_score = 0.2015 Now add a language prior. Table settings often include forks. The text tokens "fork" and "plate" activate that prior. Let the prior score be: prior_score = 0.70 Combine them in a toy way. combined_score = visual_score + prior_score combined_score = 0.2015 + 0.70 combined_score = 0.9015 Assume the answer threshold for "yes" is 0.80. Since 0.9015 > 0.80, the model says "yes." Look carefully. The visual evidence alone was weak. 0.2015 is not strong. But the language prior pushed the answer across threshold. That is hallucination in plain form. Why were the visual scores weak? Because the patch representation blurred thin shiny shapes. The spoon handle glare resembled generic metal cutlery. The model did not isolate prongs, because there were none to see. Yet the prior said, "plates often have forks." So the decoder committed early. Once the first token becomes "Yes," the rest follows easily. Autoregressive decoding likes consistency. The next words then reinforce the mistake. "Yes, there is a fork" sounds more internally coherent than backing out mid-sentence. This is why confident tone is not proof of grounding.

4) Why these failure points persist, and what they reveal

See the deeper pattern. Many failures come from information bottlenecks. Resolution gets reduced. Tokens get compressed. Fine details disappear. Then language fills the gaps. Counting fails because the model stores gist better than instances. OCR fails because letters are tiny geometric patterns. Spatial reasoning fails because relative positions are harder than object names. Fine-grained grounding fails because global captions need less precision than region labels. Different symptom. Same family of bottleneck. Training data also matters. Many datasets reward coarse descriptions. "Two people in a park" is enough for captioning. But it does not teach exact count, exact side, or exact serial number. So what to do? Add harder data, denser supervision, and evaluation that punishes confident mistakes. Another issue is answer style. Helpful assistants are trained to respond, not to pause. When evidence is weak, abstention is healthier. Yet many systems still answer anyway. This creates polished nonsense. Look. Calibration is as important as raw capability. A useful mental checklist is short. If the miss involves tiny detail, suspect resolution loss. If the miss involves many similar objects, suspect instance collapse. If the miss sounds plausible but ungrounded, suspect prior override. If the miss flips relations, suspect spatial encoding weakness. This checklist saves debugging time. And remember one more thing. Even a stronger the eye does not solve everything. The handoff through the translator still matters. If the connector compresses too aggressively, detail vanishes before reasoning starts. So failures are system failures, not just encoder failures. Simple, no?


Where this lives in the wild

  • ChatGPT image questions on receipts and menus show how tiny text quality controls answer reliability.
  • Claude 3 document analysis on dense PDFs reveals OCR and fine-grained grounding limits on footnotes.
  • Google Gemini screenshot reasoning shows spatial mistakes when many interface elements look similar.
  • Qwen-VL commerce demos highlight counting and fine detail issues in crowded product grids.
  • Be My Eyes integrations with multimodal assistants make hallucination control critical because users act on the answer.

Pause and recall

  1. Why can a fluent VLM answer confidently even when visual evidence is weak?
  2. Which failure family is most tied to tiny or rotated text?
  3. In the toy hallucination example, what pushed the combined score above threshold?
  4. Why are counting errors related to instance separation rather than only weak language ability?

Interview Q&A

Q1. Why do hallucinations happen even when the model has a strong vision encoder? A. Because the final answer is generated by a language decoder that mixes visual evidence with strong priors. If the evidence is weak or compressed, plausible language can dominate grounded perception. Common wrong answer to avoid: "Hallucination means the image encoder failed completely." Q2. Why is counting usually harder than object presence detection? A. Presence needs coarse recognition. Counting needs discrete instance boundaries and stable separation across similar items. Pooling and compression hurt that second requirement. Common wrong answer to avoid: "Counting is hard only because datasets forgot arithmetic." Q3. Why does OCR break before ordinary object recognition breaks? A. Letters depend on tiny strokes and exact orientation. A small blur or rotation can erase the decisive feature. Objects like chairs survive coarser summaries much better. Common wrong answer to avoid: "OCR is just another classification task, so difficulty should be similar." Q4. Why is abstention training important for multimodal assistants? A. Many real questions arrive with weak evidence. A calibrated model should say uncertainty clearly instead of inventing detail. This matters more than fluent guessing in safety-critical or accessibility settings. Common wrong answer to avoid: "If the model is smart enough, abstention becomes unnecessary."


Apply now (5 min)

Quick exercise: Take one image around you and write four questions. Make one counting question, one OCR question, one spatial question, and one fine-grained grounding question. Predict which one a VLM is most likely to miss, and say why. Sketch from memory: Draw the failure tree. Then add arrows showing where the eye, the patch, and the translator can each introduce or amplify error.


Bridge. Recognition and description are half the story. The other half is creation. Now the model must paint, not just see. → 07-image-generation-landscape.md