08. Text-to-image pipeline — from prompt words to painted pixels¶

14 minutes. Follow one prompt through the whole latent diffusion loop.

Built on the ELI5 in 00-eli5.md. The the translator — vision-language bridge that turns meaning into numbers — is what keeps the denoiser aligned with the prompt.

1) The full picture before the formulas¶

See the whole machine first. A text-to-image system does not paint pixels directly, at least not in latent diffusion. It first turns words into embeddings. Then it paints inside a compressed hidden space. Then it decodes that hidden result back into pixels. So there are really four actors. The text encoder. The denoiser. The scheduler. The VAE decoder. One more anchor helps. The denoiser keeps working inside the canvas. The final pixel image appears only at the end. That is why latent diffusion is efficient. It avoids pushing a giant pixel grid through every step. Here is the full flow. ┌───────────────┐ │ text prompt │ └──────┬────────┘ │ tokens ▼ ┌───────────────┐ │ text encoder │ CLIP or T5 └──────┬────────┘ │ text embeddings │ │ ┌──────────────────────────────┐ │ │ random latent noise z_T │ │ └──────────────┬───────────────┘ │ │ │ ▼ │ ┌──────────────────────────┐ └──────────────────→ │ U-Net or DiT denoiser │ cross-attention └──────────────┬───────────┘ │ predicted noise ▼ ┌────────────────────┐ │ scheduler update │ └─────────┬──────────┘ │ repeat many steps ▼ ┌────────────────────┐ │ clean latent z_0 │ └─────────┬──────────┘ │ ▼ ┌────────────────────┐ │ VAE decoder │ └─────────┬──────────┘ │ ▼ ┌────────────────────┐ │ final pixel image │ └────────────────────┘ Simple, no? Words do not hit pixels directly. Words guide denoising inside a smaller space.

2) Text encoder: turning words into useful vectors¶

Start from the prompt. Suppose the user writes, “a red bicycle on a beach.” The model tokenizes the sentence. That means it breaks text into pieces the model knows. Then a text encoder converts those pieces into embeddings. Each embedding is a vector of numbers. This is where the translator earns its salary. It turns language into something the image model can attend to. Common encoder choices are CLIP and T5. CLIP is strong at image-text alignment. T5 is strong at richer language representation. Different products choose different text encoders. But the job stays the same. Make text numerically useful. A typical Stable Diffusion style setup might use 77 tokens. Each token may become a 768-dimensional embedding. So the text output shape is: 77 × 768 How many numbers is that? 77 × 768 = 59,136 Those 59,136 values are not the image. They are instructions. Think of them as soft constraints. One token may push toward “bicycle.” Another may push toward “red.” Another may push toward “beach.” The denoiser does not read plain English. It reads these vectors. That is why prompt wording matters. A better sentence gives the translator cleaner signals.

3) Start from pure noise in latent space¶

Now come to the image side. Latent diffusion does not usually start from a blank canvas of pixels. It starts from noise inside a latent tensor. Suppose the target image size is 512 × 512. A common latent size is 64 × 64 × 4. Let us compare the raw counts. Pixel image values: 512 × 512 × 3 = 786,432 Latent values: 64 × 64 × 4 = 16,384 Now divide. 786,432 ÷ 16,384 = 48 So the model denoises about 48 times fewer values. That is the big win. The heavy work happens on a much smaller board. That smaller board is the canvas for latent diffusion. The final image is hidden there first. At step T, we start with random latent noise, often written as z_T. Every location is basically meaningless static. Then the model runs a denoising loop. Each step predicts how much noise is present. The scheduler uses that prediction to update the latent. The rough rule is: noisy latent in, slightly cleaner latent out. Repeat many times. Early steps fix broad composition. Middle steps fix object layout. Late steps sharpen local detail. Look. This is why you often see a ghost image appear first. Then it becomes more coherent. Then it becomes crisp.

4) U-Net or DiT: the denoiser doing the hard work¶

The denoiser is the heart. In classic Stable Diffusion, it is a U-Net. In newer systems, it may be a DiT, a diffusion transformer. Do not get stuck on names first. The job is the same. Given a noisy latent, predict the noise pattern to remove. How does the prompt enter this loop? Through cross-attention. Cross-attention lets image-side features look at text embeddings. So each denoising block can ask, “Which words matter here?” One region may attend more to “bicycle.” Another may attend more to “beach.” Another may attend more to “red.” That is how text conditioning gets injected everywhere. Not only once. At many layers. At many steps. A small mental picture: ┌──────────── noisy latent ────────────┐ │ shape: 64 × 64 × 4 │ └────────────────┬─────────────────────┘ │ ▼ ┌─────────────────┐ │ denoiser block │ ◄──── text embeddings └─────────────────┘ via cross-attention │ ▼ ┌─────────────────┐ │ denoiser block │ ◄──── text embeddings again └─────────────────┘ │ ▼ ┌─────────────────┐ │ predicted noise │ └─────────────────┘ So what to do when prompt following is weak? Better text encoder. Better cross-attention training. Better guidance. Maybe better captions in training too. Because if text signals are weak, the canvas drifts. The image may look pretty, but not obedient.

5) Guidance, decoding, and one worked example¶

Now the last two pieces. Classifier-free guidance. And VAE decoding. Classifier-free guidance means we run two text conditions. One with the actual prompt. One with an empty or null prompt. Why both? Because the empty prompt shows what the model wants naturally. The real prompt shows what the user wants specifically. Then we combine them. A common form is: guided noise = uncond + scale × (cond - uncond) If scale is larger, prompt pressure becomes stronger. Too large, and the image may become harsh or distorted. Now let us trace one prompt fully. Prompt: “a red bicycle on a beach.” Step 1: tokenize the text. Suppose the system uses 77 slots. The actual words fill 8 slots. The rest are padding. Step 2: encode the text. Shape becomes 77 × 768. Total numbers = 59,136. Step 3: sample starting latent noise. Shape becomes 64 × 64 × 4. Total numbers = 16,384. Step 4: first denoising step. The denoiser reads the noisy latent, and cross-attends to the 77 × 768 text embeddings. Output shape stays 64 × 64 × 4. It predicts noise, not pixels. Step 5: apply classifier-free guidance. Suppose at one latent location, the unconditional prediction is 0.50. The text-conditioned prediction is 0.35. Guidance scale = 4.0. Now compute carefully. cond - uncond = 0.35 - 0.50 = -0.15 scale × difference = 4.0 × (-0.15) = -0.60 guided noise = 0.50 + (-0.60) = -0.10 That means this location gets pulled more strongly toward the prompt-driven direction. Yes, a noise value can be negative. That is normal. Step 6: scheduler update. Suppose the current latent value at that location is 0.82. Suppose this scheduler subtracts 0.1 × guided noise. 0.1 × guided noise = 0.1 × (-0.10) = -0.01 new latent = 0.82 - (-0.01) = 0.83 One tiny change means nothing alone. But the model does this across all 16,384 latent values, for many steps. That is the magic. Step 7: repeat the loop. Maybe 30 denoising steps. At step 25, you mostly see rough coast, wheel placement, and horizon. At step 10, the bicycle frame becomes clearer. At step 1, small texture details settle. Step 8: decode with the VAE decoder. Input shape: 64 × 64 × 4 Output shape: 512 × 512 × 3 The decoder expands the hidden image into pixels. So the user finally sees a normal bitmap, not a latent tensor. That is the complete loop. Text meaning from the translator. Iterative painting on the canvas. Then VAE decoding back to the visible world.

Where this lives in the wild¶

Stability AI SDXL uses a latent diffusion pipeline with text encoders plus a denoiser operating in compressed latent space.
Black Forest Labs FLUX.1 uses strong text conditioning with a transformer-style denoiser, pushing prompt following and image quality.
Google Imagen 3 uses powerful text understanding to improve prompt fidelity before and during denoising.
Adobe Firefly uses text-to-image diffusion inside Photoshop and Express, where editability and pixel decode quality matter.
Microsoft Designer and Copilot Image Creator serve prompt-to-image generation through a production text-conditioning pipeline similar in spirit.

Pause and recall¶

Why does latent diffusion start from noise instead of a blank pixel image?
Why is 64 × 64 × 4 cheaper to denoise than 512 × 512 × 3?
What job does cross-attention do inside the denoiser?
Why do we need both unconditional and conditional predictions for guidance?

Interview Q&A¶

Q1. Why use latent diffusion and not pixel-space diffusion for many products?¶

Because latent diffusion cuts the working size sharply, which lowers compute and memory cost. You still recover pixels at the end through the VAE decoder. So you keep much of the quality with much better efficiency. Common wrong answer to avoid: “Latent diffusion is only a compression trick with no modeling impact.”

Q2. Why inject text with cross-attention instead of only concatenating the prompt once at the input?¶

Because prompt relevance changes by region, layer, and denoising step. Cross-attention lets different image features consult different words repeatedly. That gives much tighter control. Common wrong answer to avoid: “One text vector at the start is always enough.”

Q3. Why use classifier-free guidance and not just trust the text-conditioned pass alone?¶

Because the unconditional pass estimates the model’s natural prior. The difference between conditional and unconditional predictions tells you how hard to push toward the prompt. That controllable push is the whole point. Common wrong answer to avoid: “Guidance just makes images sharper, nothing more.”

Q4. Why keep a VAE decoder at the end if the denoiser already made the image internally?¶

Because the denoiser works in latent space, not in visible pixels. The VAE decoder is what turns the compact hidden representation back into a standard image grid. Without it, you do not have a user-facing picture. Common wrong answer to avoid: “The denoiser already outputs pixels directly in latent diffusion.”

Apply now (5 min)¶

Quick exercise. Take the prompt, “a blue auto-rickshaw in heavy rain.” Write the shape at each stage. Text embeddings. Starting latent. Predicted noise. Decoded image. Now sketch from memory. Draw one arrow chain from prompt to pixels. Add one side arrow from text embeddings into the denoiser. Then write the guidance formula below it. If you can redraw that loop, you understand the pipeline, not just the buzzwords.

Bridge. The pipeline generates freely. But what if you want control — specific poses, edges, or regions? That needs extra steering. → 09-image-editing-and-control.md