13. Honest admission — what still feels unresolved in image and video models¶
~12 min read. Good engineers keep one shelf for doubt.
Built on the ELI5 in 00-eli5.md. The the translator — the vision-language bridge that turns visual numbers into words — can sound confident even when the underlying visual understanding is shaky or incomplete.
1) Some visible failures are still stubborn¶
Let us start with the blunt truth. Reliable counting is not solved. Fine-grained spatial reasoning is not solved. Hands and fingers in image generation are improved. They are not solved. Look.
Ask a VLM how many small screws lie in a cluttered tray. It may answer fast. It may answer fluently.
It may still be wrong. Why? Because detecting tiny instances, separating overlaps, and keeping exact count are different skills. A model can capture gist without exact cardinality. The same happens in spatial questions. Which cup is left of the red book and behind the phone?
That requires precise relation binding. A rough scene embedding is not enough. In generation, fingers expose the same weakness differently.
A hand must satisfy part count, geometry, and pose consistency together. Local realism is easy. Global structure is stricter. So the canvas can look beautiful while still making six fingers. Simple, no?
good local texture ──→ skin looks plausible
bad global structure ──→ finger count or joint layout breaks
same image, two very different skill levels
They are not reliable geometric reasoners at fine detail. And no clean benchmark has ended that story.
2) Some tricks work well, but theory still trails practice¶
Now the uncomfortable theory gap. Classifier-free guidance works very well. Why does it work this well? We have stories. We do not have full closure. The practical recipe is familiar.
Run a conditional prediction. Run an unconditional prediction. Push harder in the conditional direction.
Done. That part is easy to use. The deeper why is still foggier than many interview answers admit. The same feeling appears with diffusion image sharpness. Why does iterative denoising produce such crisp, rich images so consistently? We have mechanistic intuitions.
We have scaling evidence. We do not have one neat final explanation that makes every behavior obvious beforehand. So yes, engineers can ship with partial theory.
That happens often. But we should say it clearly. Some of our best tools are stronger than our cleanest explanations. That is normal science. That is not failure. It is just unfinished understanding.
And the canvas still surprises us more than it should.
unconditional score ──┐
├── difference × guidance scale ──→ sharper prompt pull
conditional score ───┘
3) Some big design debates are still active¶
One debate is native multimodal versus bolt-on vision. Should the language model learn images from the start? Or should we attach a vision encoder and projector later?
Native multimodal training may promise cleaner shared representations. Bolt-on systems are often easier to build and reuse. Which wins? Depends on data, compute, and product goals. That is the honest answer. Another debate is end-to-end video versus frame-by-frame or lightly stitched generation.
End-to-end video models can, in principle, reason jointly across time. But they are expensive. Framewise methods are cheaper and easier to control.
But they flicker and drift more easily. Again, no universal winner exists yet. The choice is task-shaped. This matters in interviews. Do not speak like the field has one settled architecture destiny. It does not.
And the frame tape is still where many design arguments become painful.
4) Hallucination and physics remain awkward truths¶
Vision-language hallucination has no reliable universal detector yet. That sentence should be said plainly. A model may describe an object that is absent.
Or miss one that is present. Or invent a relation that the image does not support. Confidence alone does not save you. Calibration is imperfect. Self-checking helps sometimes. External tools help sometimes.
No method gives dependable detection across all images and tasks today. Now video generation. People say the model understands physics.
Be careful. Often it imitates surface regularities from training data. That is not the same as learned physical causality. A generated glass may wobble convincingly. A ball may arc nicely. Then some later frame violates mass, collision, or continuity.
So the clip looks physical. It is not truly reasoning from first principles. Look at a toy example.
Suppose a prompt asks for three bouncing balls.
Frames are 1 to 6.
Ball A and Ball B bounce on time.
Ball C disappears near the floor and returns larger.
The video still feels alive.
Physics is still fake.
frame 1 ──→ three balls visible
frame 2 ──→ downward motion looks fine
frame 3 ──→ one ball intersects floor slightly
frame 4 ──→ same ball vanishes behind nothing
frame 5 ──→ ball reappears larger
frame 6 ──→ viewer still says 'nice motion'
5) What to say honestly in an interview¶
Suppose the interviewer asks what we still do not know. A strong answer is calm. Not dramatic.
You can say this. We know these models are powerful pattern learners. We do not fully know how to guarantee exact counting, grounded spatial reasoning, or hallucination detection. We also do not have a settled theory for why some generation tricks work so strongly. And in video, much apparent physics is imitation, not robust world modeling. That answer is mature.
It shows respect for evidence. It avoids hype. It avoids fake pessimism too.
See. Real seniority is not sounding certain about unresolved things. Real seniority is naming the limits without losing the engineering thread.
Where this lives in the wild¶
-
GPT-4V system card discussions — examples show strong image understanding with remaining hallucination and grounding failures.
-
Google Gemini visual reasoning demos — broad capability is real, yet exact counting and fine relation questions still expose weakness.
-
Midjourney generations — beautiful scenes still invite users to inspect hands, text, and object counts carefully.
-
Runway video outputs — motion can feel cinematic while deeper physical consistency remains brittle across frames.
-
Adobe Firefly commercial image workflows — production users still care about exact fingers, exact text, and exact spatial edits, not only style.
Pause and recall¶
-
Why can a model be useful for vision tasks but still fail at reliable counting?
-
What is the honest theory gap behind classifier-free guidance and sharp diffusion outputs?
-
Why is native multimodal versus bolt-on vision still an open design debate?
-
Why should we say video physics is often faked rather than learned?
Interview Q&A¶
Q: Why is counting still hard and not just a solved subproblem of recognition?
A: Because gist recognition, instance separation, occlusion handling, and exact cardinality are different demands, and current models often optimize the first more than the last. Common wrong answer to avoid: If the model can name the object class, it can automatically count all instances correctly.
Q: Why be cautious about claims that classifier-free guidance is fully understood and optimal?
A: Because we have practical explanations and strong empirical evidence, but not a single complete theory that predicts all its best behaviors cleanly. Common wrong answer to avoid: Guidance works, so the underlying theory must already be settled.
Q: Why choose native multimodal training and not bolt-on vision, or the reverse?
A: Because each approach trades off shared representation quality, reuse of existing LLMs, data needs, and engineering cost, and the best answer depends on the product regime. Common wrong answer to avoid: One architecture style has already won forever for every multimodal system.
Q: Why say video models fake physics and not truly understand it?
A: Because many outputs match the appearance of physical motion without reliably preserving causal rules, object permanence, or stable interactions over time. Common wrong answer to avoid: Smooth motion always proves the model learned real-world physics.
Apply now (5 min)¶
Quick exercise. Write one example each for counting failure, spatial-relation failure, hand-generation failure, and VLM hallucination. Then write one sentence on why each failure is hard to detect automatically.
Sketch from memory a split between local plausibility and global correctness. Under the sketch, write the interview answer you would give for what we still do not know.
Bridge. We know what images and videos can and cannot do today. The dominant generation engine — diffusion — deserves its own deep dive. → ../02_diffusion_media_generation/00-eli5.md