07. Image generation families — four ways to paint on the same canvas¶

12 minutes. One prompt. Four model families. See the trade-offs clearly.

Built on the ELI5 in 00-eli5.md. The the canvas — generation space where images are painted — is exactly what each family tries to shape differently.

1) One prompt, four painting habits¶

Start with a simple picture. Prompt: “a red bicycle parked on a beach at sunset.” All four families want the final image. But they walk to it very differently. Think of four painters in one classroom. One painter keeps bluffing past a strict critic. That is GAN. One painter squeezes ideas through a small memory bottle. That is VAE. One painter starts with static, then slowly cleans the mess. That is diffusion. One painter writes the picture one unit at a time. That is autoregressive generation. So first keep the mental movie clear. The goal is the same. The route is not the same. Here is the side-by-side map. ┌─────────────────────┬─────────────────────┬────────────────────────┬──────────────────────────┐ │ GAN │ VAE │ Diffusion │ Autoregressive │ ├─────────────────────┼─────────────────────┼────────────────────────┼──────────────────────────┤ │ z noise │ image → encoder │ pure noise in │ start token │ │ │ │ │ │ │ │ │ │ │ ▼ │ ▼ │ ▼ │ ▼ │ │ generator │ latent bottleneck │ denoiser step 1 │ predict next token │ │ │ │ │ │ │ │ │ │ │ ▼ │ ▼ │ ▼ │ ▼ │ │ fake image ──→ │ decoder │ denoiser step 2 │ append token │ │ discriminator │ │ │ │ │ │ │ │ ▲ │ ▼ │ ▼ │ ▼ │ │ real image ─────────│ reconstruction │ ... many steps ... │ ... many steps ... │ │ │ │ │ │ │ │ │ train by contest │ train by compress │ reverse learned noise │ sequential prediction │ └─────────────────────┴─────────────────────┴────────────────────────┴──────────────────────────┘ Notice one thing. GAN and diffusion are mainly generators. VAE is compression plus generation. Autoregressive is sequence modeling turned toward images. Also notice where the canvas lives. GAN shapes it by adversarial pressure. VAE shapes it by compression. Diffusion shapes it by denoising paths. Autoregressive shapes it as a long sequence of decisions.

2) GANs: generator versus discriminator¶

GAN means Generative Adversarial Network. Two networks fight during training. The generator makes fake images. The discriminator checks if they look real. If the discriminator is strong, it punishes obvious nonsense. If the generator improves, it fools the discriminator more often. So training feels like a cat-and-mouse game. Simple, no? Picture it like this. ┌──────────────┐ fake image ┌──────────────────┐ │ random z │ ───────────────────→ │ discriminator │ └──────┬───────┘ └──────┬───────────┘ │ │ ▼ │ real or fake? ┌──────────────┐ │ │ generator │ ────────────────────────────┘ └──────────────┘ Why people liked GANs so much: They sample very fast at inference. Usually one forward pass is enough. That means low latency. Interactive products love that. But the price comes in training. The fight can become unstable. Two famous failures appear again and again. Mode collapse is one. Oscillation is another. Mode collapse means the generator finds one winning trick. Then it keeps painting the same face, car, or beach again and again. Oscillation means the two players keep chasing each other. Today one strategy wins. Tomorrow it fails. The loss curves jump around. So what to do? People use tricks. Feature matching. Gradient penalties. Careful learning rates. Balanced updates. Still, for open-ended prompt generation, GANs became harder to scale cleanly. GANs shine when the domain is tight. Faces. Shoes. Specific art styles. Synthetic avatars. In those cases, fast inference matters a lot. The critic can also teach sharp texture. You can think of GAN output like this. Strong local sharpness. Weak coverage of all possibilities. That is why GAN images can look crisp, yet surprisingly repetitive.

3) VAEs: compress, bottleneck, rebuild¶

VAE means Variational Autoencoder. Do not fear the name. The mental model is simple. An encoder shrinks the image into a smaller latent code. A decoder rebuilds the image from that code. So the system learns a compact the canvas. This compact space is smooth. Nearby points give nearby images. That smoothness is a big advantage. Interpolation looks sensible. Move from one point to another, and the image changes gradually. Look at the pipeline. ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ │ input image │ ─→ │ encoder │ ─→ │ latent z │ └──────────────┘ └──────────────────┘ └──────┬───────┘ │ ▼ ┌──────────────────┐ │ decoder │ └──────┬───────────┘ │ ▼ ┌──────────────┐ │ rebuilt image│ └──────────────┘ Why does a VAE often blur? Because it must compress a lot. Fine details get averaged away. If many sharp answers are possible, the decoder may pick a safe average. Averaging gives soft edges. That is why VAE samples can feel washed out. The shapes are reasonable. The texture can be weak. Still, VAEs are very useful. They give stable latent spaces. They are great for representation learning. They also appear inside diffusion systems. Yes, that matters. Many modern text-to-image systems use a VAE, not as the main generator, but as the front and back door to the canvas. One more anchor. In VAE-style compression, pixels are squeezed before painting continues. So the model works on a smaller board, not every raw the patch at full resolution.

4) Diffusion: add noise, then remove it¶

Diffusion is the dominant family today. The picture first looks backward. We destroy images with noise. Then we learn to reverse that destruction. During training, a clean image gets more and more noisy. At random noise levels, the model learns what noise was added. During sampling, we begin from pure noise. Then the model removes noise step by step. So the generation loop feels patient. Very patient. Look. This is why diffusion wins on quality. Each step can correct structure, lighting, texture, and composition a little more. The downside is speed. One image may need 20, 30, or 50 denoising steps. So latency is worse than GANs. But fidelity is usually better. Prompt control is better too. A simple denoising view: noise x_T ──→ x_9 ──→ x_8 ──→ x_7 ──→ ... ──→ x_1 ──→ clean image x_0 At early steps, only rough layout appears. At middle steps, large objects settle. At late steps, fine texture and edges sharpen. This is why diffusion handles open prompts so well. It can revise itself repeatedly. The painter gets many chances. In other words, the canvas is not painted in one gamble. It is cleaned gradually. That lowers the chance of one catastrophic mistake.

5) Autoregressive image models: one token after another¶

Autoregressive models say, “Treat the image as a sequence.” Then predict the next unit. That unit may be a pixel. More often, it is a token from a visual codebook. So the model sees image generation like sentence generation. One step. Then the next. Then the next. This is conceptually elegant. Language and image can share one modeling idea. That is why early systems like ImageGPT and DALL·E 1 mattered. They unified text and image under sequence prediction. Same broad recipe. Different token types. But there is a catch. Sampling is sequential. If you need many tokens, you wait many steps. Also, an early mistake can echo forward. A wrong roof token may hurt later windows. A wrong wheel token may hurt later spokes. This family thinks strongly in units. Almost like placing one the patch after another. So the sequence view is clean, but long generations can be slow. Now let us compare all four on one prompt. Worked example. Prompt: “a red bicycle parked on a beach at sunset.” Suppose a team rates each output on four axes. Sharpness out of 10. Prompt match out of 10. Diversity across four samples out of 10. Time in seconds. Here are imagined but realistic numbers. - GAN: sharpness 8, prompt match 5, diversity 4, time 0.2 - VAE: sharpness 5, prompt match 6, diversity 7, time 0.3 - Diffusion: sharpness 9, prompt match 9, diversity 8, time 4.0 - Autoregressive: sharpness 6, prompt match 7, diversity 6, time 7.0 Now compute one crude product score. Use: score = 0.35×sharpness + 0.35×prompt match + 0.20×diversity + 0.10×speed score Let speed score = 10 - time, with a floor at 1. So first get speed scores. GAN speed score = 10 - 0.2 = 9.8 VAE speed score = 10 - 0.3 = 9.7 Diffusion speed score = 10 - 4.0 = 6.0 Autoregressive speed score = 10 - 7.0 = 3.0 Now calculate each final score. GAN: 0.35×8 = 2.80 0.35×5 = 1.75 0.20×4 = 0.80 0.10×9.8 = 0.98 Total = 2.80 + 1.75 + 0.80 + 0.98 = 6.33 VAE: 0.35×5 = 1.75 0.35×6 = 2.10 0.20×7 = 1.40 0.10×9.7 = 0.97 Total = 1.75 + 2.10 + 1.40 + 0.97 = 6.22 Diffusion: 0.35×9 = 3.15 0.35×9 = 3.15 0.20×8 = 1.60 0.10×6.0 = 0.60 Total = 3.15 + 3.15 + 1.60 + 0.60 = 8.50 Autoregressive: 0.35×6 = 2.10 0.35×7 = 2.45 0.20×6 = 1.20 0.10×3.0 = 0.30 Total = 2.10 + 2.45 + 1.20 + 0.30 = 6.05 What did we learn? Diffusion wins on overall quality. GAN wins on speed. VAE wins on smooth latent control. Autoregressive wins on one-model sequence simplicity. So the question is never, “Which family is universally best?” The better question is, “Best for what pressure?”

Where this lives in the wild¶

NVIDIA Canvas and earlier GauGAN-style demos use GAN ideas for near-instant landscape synthesis from rough semantic layouts.
Stability AI SDXL uses latent diffusion as the main generator, because quality and prompt control matter more than one-shot speed.
Adobe Firefly uses diffusion-style generation inside creative tools, where commercial-quality detail and editability matter.
OpenAI DALL·E 1 used an autoregressive token pipeline, showing how image generation could align with language-style sequencing.
Stable Diffusion pipelines use a VAE for latent compression and decoding, giving a practical bottleneck around the main denoiser.

Pause and recall¶

Why can GANs feel sharp but repetitive?
Why do VAEs often produce blurrier outputs than diffusion?
Why is diffusion usually slower at inference time?
Why do autoregressive image models fit naturally beside language models?

Interview Q&A¶

Q1. Why did diffusion beat GANs for open text-to-image systems, and not the other way around?¶

Diffusion gives repeated correction steps. That improves prompt fidelity, composition, and stability across many prompt types. GANs are faster, but training instability and mode collapse become painful at scale. Common wrong answer to avoid: “Diffusion is better only because it is newer.”

Q2. Why keep VAEs around if their outputs can be blurry?¶

Because the VAE bottleneck gives a smooth, compact latent space. That is extremely useful for compression, interpolation, and latent diffusion pipelines. The VAE is often infrastructure, not the final artist. Common wrong answer to avoid: “VAEs are obsolete and have no role now.”

Q3. Why use autoregressive image generation at all when diffusion has better image quality?¶

Autoregressive models unify image and language under one sequence objective. That makes architecture sharing and token-level reasoning cleaner. The trade-off is slower sequential sampling. Common wrong answer to avoid: “Autoregressive models are just worse diffusion models.”

Q4. Why not simply choose GANs when product latency is critical?¶

Because latency is only one axis. If the prompt space is wide, coverage, controllability, and training stability may matter more. A fast wrong image is still wrong. Common wrong answer to avoid: “Inference speed alone decides the model family.”

Apply now (5 min)¶

Quick exercise. You are building three products. A face-filter app. A stock-image generator. A design tool with heavy prompt control. Pick one family for each, and write one sentence defending each choice. Now sketch from memory. Draw the four pipelines. Put one starting state in each box. GAN starts from z. VAE starts from encoder compression. Diffusion starts from noise. Autoregressive starts from a start token. If you can redraw that without looking, the landscape is now in your head.

Bridge. Diffusion won the generation war. But how does the full prompt-to-pixel pipeline actually work? → 08-text-to-image-pipeline.md