08. Classifier-free guidance — steering with and without the prompt¶

~11 min read. The thing that makes the prompt bite harder without training a separate classifier.

Built on the ELI5 in 00-eli5.md. The blueprint — the text conditioning signal — becomes much stronger here because we compare a prompt-aware denoiser and a prompt-free denoiser inside the same model.

1) Two opinions about the same noisy image¶

Classifier-free guidance starts with a simple social picture.

Ask two people to judge the same noisy draft.

One person sees the prompt.

One person does not.

The difference between their answers tells you what the prompt specifically wants.

That is the whole trick.

noisy x_t ──→ unconditional branch ──→ eps_u
     │
     └──→ conditional branch   ────→ eps_c
                                   │
                                   └── combine ─→ guided epsilon

The unconditional branch says,

"make this look like a plausible image."

The conditional branch says,

"make this look like the prompt."

Subtract them.

What remains is prompt-specific push.

Scale that push.

Now the blueprint speaks louder.

Stable Diffusion UIs expose this as the CFG scale slider.

Firefly-like systems often hide the knob, but the same idea matters.

2) A scalar worked example with every intermediate¶

Use the standard formula.

epsilon_guided = eps_u + w (eps_c - eps_u)

Take eps_u = 0.80.

Take eps_c = 0.60.

Take guidance scale w = 3.

First compute the prompt-specific difference:

eps_c - eps_u = 0.60 - 0.80 = -0.20

Then amplify it:

w (eps_c - eps_u) = 3 × (-0.20) = -0.60

Then add it back to the unconditional prediction:

epsilon_guided = 0.80 + (-0.60)
                = 0.20

Good.

The guided noise value is 0.20.

That is the recall answer.

In words,

the sampler will denoise much more in the prompt-favored direction than the unconditional branch alone would suggest.

3) Why training can do this without a separate classifier¶

Earlier guidance methods used a separate classifier.

CFG avoids that extra model.

During training, we sometimes drop the prompt.

Same U-Net.

Two operating modes.

With prompt.

Without prompt.

That is why it is called classifier-free.

training batch:
some examples keep text ──→ conditional behavior
some examples drop text ──→ unconditional behavior
one shared denoiser learns both

This is elegant for engineering too.

No second network to serve.

No classifier gradients to tune.

Just prompt dropout and a combined inference formula.

That simplicity is a big reason CFG spread everywhere.

4) Why too much guidance creates its own problems¶

Louder is not always better.

If CFG scale is too low, the prompt feels weak.

If CFG scale is too high, the image can become oversharp, oversaturated, or strangely repetitive.

Diversity drops too.

Many seeds collapse toward one literal interpretation.

Faces can look crispy.

Backgrounds can look forced.

Text prompts may be obeyed in spirit but not in naturalness.

small w  ──→ softer prompt adherence, more freedom
medium w ──→ useful balance
huge w   ──→ brittle literalism, reduced diversity

That is why Canva, Firefly, and open-source UIs tune CFG carefully instead of blindly maximizing it.

Guidance is a steering wheel.

Not a magic turbo button.

Where this lives in the wild¶

Stable Diffusion WebUI CFG slider — directly exposes guidance scale as a core user control for prompt adherence.
Leonardo AI prompt-strength controls — product knobs often wrap CFG-style steering in friendlier language.
Playground AI negative prompts — unconditional versus conditional contrast is part of how unwanted features are suppressed.
Adobe Firefly prompt tuning — strong text adherence relies on guidance-like mechanisms even when the knob is hidden.
Canva Magic Media — consumer tools balance prompt following against image naturalness through guidance tuning.

Pause and recall¶

Why does CFG use both a prompt-aware and prompt-free prediction?
In the worked example, what was the guided noise value when eps_u = 0.80, eps_c = 0.60, and w = 3?
Why is prompt dropout important during training?
What goes wrong when CFG scale becomes too large?

Interview Q&A¶

Q: Why is classifier-free guidance called classifier-free? A: Because the prompt steering comes from conditional and unconditional denoiser predictions rather than from a separately trained classifier. Common wrong answer to avoid: "Because no text encoder is used at all."

Q: Why does subtracting unconditional from conditional predictions help? A: Because that difference isolates the direction in denoising space that is specifically caused by the prompt. Common wrong answer to avoid: "The subtraction simply removes noise and nothing more."

Q: Why can large CFG scales reduce sample diversity? A: Because they over-amplify prompt-specific directions, pushing many samples toward the same sharp literal solution. Common wrong answer to avoid: "Higher CFG always gives a strictly better image."

Q: Why is CFG so widely used in text-to-image products? A: Because it gives a simple and effective knob for prompt adherence without needing a separate classifier model. Common wrong answer to avoid: "CFG is popular only because libraries happened to default to it."

Apply now (5 min)¶

Quick exercise. Choose your own eps_u, eps_c, and guidance scale, then compute a guided value by hand.

Repeat it once with a much larger guidance scale and ask yourself what kind of visual overcorrection that might cause.

Sketch from memory the formula eps_u + w (eps_c - eps_u).

Under the sketch, write one line on how the blueprint becomes louder without replacing the denoiser.

Bridge. Good. We can now steer toward the prompt. But doing all this in raw pixel space would be expensive. So modern systems usually move the whole game into a smaller latent space. → 09-latent-diffusion.md