10. Text-to-image architecture — Stable Diffusion as a whole factory¶
~12 min read. The thing that turns separate ideas into one product pipeline.
Built on the ELI5 in 00-eli5.md. The blueprint — the text conditioning signal — flows through a full factory here: text encoder, noisy latent, U-Net denoiser, sampler, and VAE decoder.
1) See the whole factory before any formula¶
Do not stare at one network first.
See the whole factory.
A text-to-image system is a relay race.
The prompt is encoded.
Noise lives in latent space.
The U-Net denoises while reading text context.
The scheduler decides the step path.
The VAE decoder turns the final latent into pixels.
┌────────┐ tokens ┌────────────┐ context ┌──────────┐ latent path ┌────────┐
│ prompt │ ───────→ │ text encoder│ ───────→ │ U-Net │ ────────────→ │ decoder│
└────────┘ └────────────┘ └────┬─────┘ └───┬────┘
▲ │
noisy latent z_t ────┘ ▼
image
That picture saves hours of confusion.
If prompt adherence is weak, maybe the text side is weak.
If composition is fine but textures are muddy, maybe the sampler or decoder is the bottleneck.
If repeated runs differ wildly, maybe scheduler choice or seed handling is the issue.
SDXL, Adobe Firefly, Microsoft Designer, and Leonardo-style tools all live or die by this system view.
2) A worked shape example across the pipeline¶
Use a standard 512 × 512 generation setup.
Suppose the tokenizer keeps up to 77 text tokens.
Suppose the text encoder outputs width 768 per token.
Then the text context shape is 77 × 768.
Good.
That is the recall shape.
Now the image side.
The noisy latent might be 64 × 64 × 4.
The U-Net takes that latent and the 77 × 768 context together.
Its output is another 64 × 64 × 4 tensor, usually an epsilon prediction.
After many denoising steps, the final latent z0 still has shape 64 × 64 × 4.
Then the VAE decoder maps it back to 512 × 512 × 3 pixels.
prompt tokens : 77
text context : 77 × 768
noisy latent z_t : 64 × 64 × 4
U-Net output : 64 × 64 × 4
decoded image : 512 × 512 × 3
This is why cross-modal architecture work matters.
Two streams with different shapes meet inside one denoising loop.
3) Why cross-attention is the hinge of text-to-image¶
Cross-attention is where language actually touches visual denoising.
Without it, the prompt would be a weak global tag.
With it, latent features can query token features directly.
That lets the model decide,
"this region should listen to 'red umbrella', that region should listen to 'night street'."
So cross-attention is the hinge.
Not because it is fashionable.
Because it gives the blueprint a precise place to act.
That is why prompt wording, tokenization, and text encoder quality matter so much in real products.
4) Where failures can enter the pipeline¶
Failures can enter almost anywhere.
The tokenizer may split a phrase badly.
The text encoder may under-represent a style concept.
The U-Net may miss object relations.
The scheduler may use too few steps.
The VAE may wash out detail.
weak tokens ──→ weak meaning
weak attention──→ weak prompt binding
weak sampler ──→ weak refinement time
weak decoder ──→ weak final crispness
Mature debugging goes stage by stage.
Do not blame the prompt for everything.
Do not blame the U-Net for everything either.
A text-to-image product is a system.
Treat it like one.
Where this lives in the wild¶
-
Stable Diffusion XL — combines text encoder, latent U-Net, scheduler, and VAE decoder into a full text-to-image product stack.
-
Adobe Firefly — production quality depends on the coordination of prompt encoding, denoising, and final decoding.
-
Microsoft Designer — end-to-end user experience comes from the whole pipeline, not just the U-Net checkpoint.
-
Canva Magic Media — prompt interpretation, denoising speed, and decode quality all shape perceived usefulness.
-
Leonardo AI — prompt adherence and style control live at the architecture level, not only in prompt wording.
Pause and recall¶
-
What are the core modules in a Stable Diffusion style text-to-image system?
-
In the worked example, what text-context shape did we use?
-
Why is cross-attention the hinge between language and image generation?
-
What are three different places where the pipeline can fail?
Interview Q&A¶
Q: Why is a text-to-image model really a system rather than a single network? A: Because prompt encoding, denoising, scheduling, and decoding are separate stages that must all work together. Common wrong answer to avoid: "Once you have the U-Net, the rest is packaging."
Q: Why does cross-attention matter so much? A: Because it lets latent image features query token information directly, so prompt meaning can influence local denoising decisions. Common wrong answer to avoid: "Cross-attention is only a minor optimization for speed."
Q: Why can two products with similar checkpoints feel different? A: Because text encoder setup, sampler defaults, CFG tuning, and VAE quality change the full system behavior. Common wrong answer to avoid: "If the checkpoint is the same, the product result must be identical."
Q: Why should engineers debug the pipeline stage by stage? A: Because different failures originate in different modules, and blaming the prompt for everything wastes time. Common wrong answer to avoid: "All image quality issues are really prompt-writing issues."
Apply now (5 min)¶
Quick exercise. Draw the pipeline from prompt to final image and label each module with one job.
Then invent one failure mode for the text encoder, one for the U-Net, and one for the VAE.
Sketch from memory the two streams: prompt stream and noisy-latent stream meeting inside the U-Net.
Under the sketch, write one line on how the blueprint becomes useful only after the architecture gives it a place to act.
Bridge. Good. We have the full factory now. But products often need stronger structural control than a prompt alone can give. That is where ControlNet and image-to-image workflows enter. → 11-controlnet-image-to-image.md