13. Honest admission — what still breaks, surprises, and stays unsettled¶
~12 min read. The thing that keeps us humble after the clean pipeline diagram.
Built on the ELI5 in 00-eli5.md. The blueprint — the text conditioning signal — can be strong, but strong prompting still does not solve every failure, and some of diffusion's best tricks still outpace our theory.
1) Why hands, text in images, and counting still break¶
Diffusion is very good at local plausibility.
One patch of skin can look excellent.
One letter can look almost right.
But global symbolic correctness is a stricter exam.
Consider a toy prompt:
"five apples on a tray."
The model may render two apples on the left.
Two on the right.
And the fifth becomes a half-hidden reddish blob near the napkin.
Now the image looks plausible locally.
But the clear count is only four.
That is how we end up with the wrong total.
left cluster ┌── apple ── apple ──┐
right cluster └── apple ── apple ──┘
missing fifth → blurred blob near cloth
local realism ≠ exact global count
Hands and text suffer for related reasons.
Fingers require precise part relations.
Text requires symbolic sequence consistency.
Diffusion often optimizes "looks right nearby" before "is exactly right everywhere."
Ideogram-like systems improved text a lot.
Still, the failure family has not vanished.
2) Why classifier-free guidance works so well is still partly mysterious¶
We have a practical story for CFG.
Compare conditional and unconditional predictions.
Amplify the difference.
Fine.
But the empirical strength of this trick still feels larger and cleaner than our simplest classroom intuition fully explains.
Why do sweet spots appear so reliably across model families?
Why does the guidance direction help so much without a separate classifier?
Why does it often improve prompt adherence before quality collapses?
Those answers are not fully settled.
Good theory exists.
Complete theory does not.
Mature engineers can live with that.
Use the tool.
Keep the humility.
3) Will diffusion stay on top, or will other generators replace it?¶
No serious engineer should promise permanent victory.
Diffusion is strong today because quality is high, tooling is rich, and controllability is good.
But other families are improving.
Flow-matching models may offer simpler continuous-time views and faster paths.
Autoregressive image models may shine when exact token-like structure matters.
Hybrid systems may borrow the best of multiple camps.
diffusion ──→ strong quality, strong controls, heavier iterative cost
flow matching ──→ elegant paths, promising speed-quality trade-offs
autoregressive ──→ stronger discrete structure in some settings
The winning family may differ by product.
Concept art has one taste.
Brand-logo generation has another.
Structured document graphics have another.
So the real question is not,
"Who wins forever?"
It is,
"Who wins for this latency budget, this quality bar, and this control need?"
4) Compute cost, quality ceiling, and the mature engineering stance¶
Let us end with arithmetic, not hype.
Suppose a heavy model call costs 120 ms.
Then a 30-step sampler costs 3.6 s.
A 4-step shortcut costs 0.48 s.
Nice.
But imagine the slower system gives quality score 8.8 / 10 on your task and the faster one gives 7.9 / 10.
Now the decision depends on product reality.
A mobile sticker app may happily take 0.48 s.
A premium print workflow may demand the slower ceiling.
Adobe Firefly brand work, realtime mobile demos, and internal design copilots do not share the same tolerance.
more compute ──→ more repair room ──→ often higher ceiling
less compute ──→ better latency ──→ often tougher compromises
The mature stance is simple.
Measure quality.
Measure cost.
Match the generator family and sampler to the product job.
Respect what diffusion does well.
Stay honest about where it still struggles.
Where this lives in the wild¶
-
Midjourney outputs — often look stunning globally, yet users still inspect hands and small symbolic details carefully.
-
Stable Diffusion community models — text inside signs or packaging frequently exposes the gap between visual plausibility and exact symbol control.
-
Ideogram-style products — strong text rendering becomes a competitive differentiator precisely because generic diffusion often struggles there.
-
Adobe Firefly brand imagery — commercial workflows care about exact logos, counts, and structured edits, not only pleasing style.
-
Realtime mobile generation demos — latency pressure makes the quality ceiling visible very quickly when shortcuts are pushed too hard.
Pause and recall¶
-
Why can diffusion look globally plausible but still fail on fingers, text, or counting?
-
In the toy apple example, how did we end up with the wrong total?
-
Why do people say CFG works better than our clean theory fully explains?
-
What practical question decides whether diffusion, flow matching, or autoregressive image models win in a product?
Interview Q&A¶
Q: Why do hands, text, and counting remain hard for diffusion systems? A: Because these tasks require exact global or symbolic consistency, while diffusion models often optimize local visual plausibility more strongly. Common wrong answer to avoid: "If the image looks realistic overall, symbolic details will automatically be correct too."
Q: Why is CFG still discussed as a theory gap? A: Because its empirical strength across many models and settings is larger and cleaner than our simplest intuitive story fully captures. Common wrong answer to avoid: "We already have a complete settled theory for why CFG always works optimally."
Q: Why might diffusion be challenged by flow matching or autoregressive image models? A: Because alternative model families may offer better speed-quality trade-offs, stronger symbolic structure, or easier controllability in some regimes. Common wrong answer to avoid: "Diffusion already won permanently, so comparisons no longer matter."
Q: Why is compute cost tied to the quality ceiling? A: Because better quality often demands more steps, bigger models, or heavier refinement, all of which spend more memory, time, or money. Common wrong answer to avoid: "Quality and compute are basically independent once the model is trained."
Apply now (5 min)¶
Quick exercise. List three product tasks where local realism is enough and three where exact symbolic correctness is required.
Then write which failure family each one would expose: anatomy, text rendering, counting, or cost ceiling.
Sketch from memory the contrast local patch realism ≠ global symbolic correctness.
Under the sketch, write one line on why the speed shortcut always has to answer to a quality ceiling in real products.
Bridge. Good. We end this module with honesty, not hype. The next step is to build something practical and judge trade-offs with your own hands. → ../33_capstone_project/00-eli5.md