01. Opening failure — the smart judge who still cannot paint¶
~11 min read. The thing that proves image scoring is not the same as image generation.
Built on the ELI5 in 00-eli5.md. The blueprint — the text conditioning signal telling us what to reveal — is helpful, but by itself it cannot sculpt a believable image out of chaos.
1) A caption judge is not a pixel sculptor¶
Look. Suppose we start from TV static. Then we ask a smart judge, "Does this look more like a red fox in snow now?" The judge can score the image. That does not mean the judge can paint the image. This is the opening failure.
CLIP is very good at saying, "this picture matches this text more than that one." But naïve image generation asks a harder thing. It asks CLIP to pull random pixels toward realism. So what happens? The optimizer finds cheats. It finds brittle patterns that raise score. It does not reliably find natural images.
The blueprint gives direction. The blueprint does not supply a full image prior. So the marble block of noise stays wild for too long. See the mismatch.
┌──────────────┐ text prompt ┌───────────────┐
│ random image │ ────────────────→ │ CLIP scorer │
└──────┬───────┘ └──────┬────────┘
│ │
└──── gradient ascent on pixels ◄──┘
│
▼
strange high-score artifact
2) A tiny worked example shows the cheat¶
Picture a toy 2 × 2 image.
The left column being bright helps the prompt score.
The right column being dark also helps the prompt score.
Our fake judge ignores whether the result looks natural.
Simple, no?
Only after that picture, write the toy formula.
score = left_avg - right_avg
That is a caricature of a prompt-matching objective.
It rewards one narrow feature.
It says nothing about natural image statistics.
start pixels = [0.2, 0.1, 0.2, 0.1]
left_avg = (0.2 + 0.2) / 2 = 0.2
right_avg = (0.1 + 0.1) / 2 = 0.1
score s0 = 0.2 - 0.1 = 0.1
after step 1 = [0.4, -0.1, 0.4, -0.1]
left_avg = (0.4 + 0.4) / 2 = 0.4
right_avg = (-0.1 + -0.1) / 2 = -0.1
score s1 = 0.4 - (-0.1) = 0.5
after step 2 = [0.8, -0.6, 0.8, -0.6]
left_avg = (0.8 + 0.8) / 2 = 0.8
right_avg = (-0.6 + -0.6) / 2 = -0.6
score s2 = 0.8 - (-0.6) = 1.4
The score shoots up. The picture becomes nonsense. See. A narrow judge can be exploited. Real CLIP is much richer than this toy. Still, the failure shape is similar. Prompt score alone is not enough.
3) Why CLIP-guided pixel optimization drifts into weirdness¶
Now what is the real problem? Pixel space is huge. Natural images occupy a tiny corner of it. A score function can point toward a prompt. It may still point outside the natural-image manifold. That is why weird textures appear.
After the picture, we can write the real objective.
We maximize something like
cos(E_image(I), E_text(t)).
Good.
But that only says,
"make the embedding align better."
It does not say,
"stay on the manifold of photographs."
So we get adversarial behavior. Tiny repeated textures. Over-sharpened edges. Floating fragments. High confidence. Low realism. The blueprint is being obeyed in embedding space. The image is not being repaired in pixel space.
4) What we actually need instead¶
We need a generator with a prior over natural images. We need a process that knows, step by step, what realistic structure looks like. We need many careful repairs. Not one giant jump.
That is why diffusion works better. It starts from the marble block of pure noise. Then the model applies one chisel stroke at a time. The sculptor's training teaches each repair. CLIP-like guidance can help later. But the denoiser does the real carving.
So remember the diagnosis. A good similarity judge is useful. A good image generator is different. One scores. The other knows how to travel from noise to realism. Our next file builds the corruption process first. Then the learned repair will make sense.
Notice the logic. First we admitted the judge cannot paint. Next we must manufacture the noisy starting point ourselves. If generation begins from pure noise, we need a clean recipe for turning real images into that noise. That recipe is the forward process. The next file builds it step by step.
Where this lives in the wild¶
-
VQGAN+CLIP notebooks — CLIP supplies the score, but the generator prior is what keeps outputs from collapsing into pure adversarial mush.
-
OpenAI CLIP playground experiments — prompt-matching can be strong even when the raw optimized pixels look strange.
-
Hugging Face diffusers with CLIP guidance hooks — CLIP acts as an extra steering signal, not the whole generator.
-
Creative coding tools in TouchDesigner — direct prompt-score optimization often produces psychedelic textures instead of stable scenes.
-
Internal brand-image rankers — ranking candidate images by text match is easier than synthesizing them from scratch.
Pause and recall¶
-
Why does a strong image-text scorer still fail as a one-shot image generator?
-
In the toy
2 × 2example, why did the score rise while realism fell? -
What crucial thing is missing from prompt-score maximization alone?
-
Why do we say the blueprint gives direction but not a full image prior?
Interview Q&A¶
Q: Why is CLIP useful for retrieval and ranking but insufficient for naïve pixel generation? A: Because ranking only needs relative semantic comparison, while generation needs a path that stays inside the space of realistic images as pixels change. Common wrong answer to avoid: "If CLIP understands images, it can directly draw them."
Q: Why do optimized prompt scores often produce adversarial-looking pictures? A: Because the optimizer exploits whatever visual cues raise the score, even when those cues are brittle textures instead of natural object structure. Common wrong answer to avoid: "Higher similarity score automatically means higher visual quality."
Q: Why is a generative prior so important? A: Because the model must know what natural images usually look like, not just what direction increases text-image similarity. Common wrong answer to avoid: "The text embedding already contains the full image manifold."
Q: Why move from one-shot optimization to gradual denoising? A: Because many small learned repairs are easier to keep realistic than one giant unconstrained leap in pixel space. Common wrong answer to avoid: "More gradient steps on CLIP eventually solve the realism problem by themselves."
Apply now (5 min)¶
Quick exercise. Open any image editor and imagine starting from full static.
Write three things a scorer can say about the static, and three things a generator must know that the scorer does not.
Sketch from memory the loop: random image → CLIP scorer → pixel update → weird artifact.
Under the sketch, write one sentence on why the blueprint is not the same thing as a generator prior.
Bridge. Good. We now know why direct score chasing fails. So what do diffusion models do instead? They first learn a very controlled way to destroy images. → 02-forward-process.md