Skip to content

11. Text-to-video systems — how moving clips are actually built

~13 min read. Nice frames are easy. Coherent motion is the exam.

Built on the ELI5 in 00-eli5.md. The the canvas — the mathematical space where images are painted — now has to stay stable across time while text keeps steering the whole clip.


1) First picture the factory before any equations

A modern text-to-video model rarely paints raw pixels directly. It usually works in a compressed latent space. Why? Because raw video is huge. Latents are smaller. Smaller means cheaper denoising. So the usual flow is this. Text becomes embeddings. Noise becomes latent video. A denoiser removes noise step by step. Then a decoder turns the clean latent into frames. Look.

The key is not only spatial detail. The key is consistency over time. The same dog must remain the same dog. The same shadow must move sensibly. The same camera should not teleport. That is why the frame tape keeps coming back. The model must reason over motion, not only appearance.

prompt ──→ text encoder ──→ text tokens
noise video latent ──→ denoiser with spatial + temporal blocks ──→ clean video latent ──→ decoder ──→ frames
                              └──────────── guidance from text at every step
Now split the family into two common product styles. One is the Sora-style diffusion transformer. The other is the Runway-style diffusion U-Net with temporal layers. Both try to keep the clip coherent. They just organize compute differently.

2) Sora-style systems: patch transformer over spacetime latents

In the Sora-style picture, video is tokenized into patches or cubes in latent space. Then a transformer processes those tokens. This feels closer to large token models. Simple, no? You have a long token sequence. Tokens carry spatial and temporal positions.

Attention mixes information globally. That gives flexibility. A token near the top-left in frame 3 can attend to a token in frame 20. Good for long-range consistency. Good for camera motion too. But the price is heavy attention cost. So these systems often rely on latent compression and smart patching. In this setup, the canvas is not a flat image board anymore. It is a spacetime latent board.

The model paints motion trajectories inside that board.

latent video volume
┌─────────────────────────────────────┐
│ frame 1 cubes │ frame 2 cubes │ ...│
│ frame 3 cubes │ frame 4 cubes │ ...│
└─────────────────────────────────────┘
          │ tokenize to sequence
[token 1][token 2][token 3] ... [token N]
 diffusion transformer blocks
So why do people like this design? Because one architecture can scale across images and video. Because global attention helps with long context. Because patch sequences fit transformer tooling well. Yes?

3) Runway-style systems: diffusion U-Net with temporal layers

Now the other family. A U-Net is strong at multiscale spatial processing. It compresses.

It expands. It keeps skip connections. For video, people add temporal blocks inside that backbone. Maybe temporal attention. Maybe temporal convolutions. Maybe both. This is the Runway Gen-2 and Gen-3 style picture. The model denoises frame stacks while exchanging information across time. That is cheaper than a giant all-token transformer in many settings.

And it reuses proven image diffusion recipes. So what to do if you already have a good image generator? Add temporal layers. Train on video latents. Push the image factory into motion. Image-to-video uses the same idea. Start from one source frame. Encode it. Hold identity anchors from that frame. Then generate the future frames forward. This is useful for animation, camera push-ins, and talking-head motion. But it is fragile.

Identity drift can appear. Clothing texture can change. Physics can break. Water may slosh the wrong way. Flicker may appear between nearby frames. That is the daily engineering pain.

4) Worked example: trace a 4-second clip through the pipeline

Take a 4-second clip at 16 fps. That means 4 × 16 = 64 frames. Suppose final resolution is 512 × 512. Assume a VAE compresses by factor 8 spatially. Then each frame becomes 64 × 64 in latent size. Suppose latent channels are 4.

So one frame latent is 64 × 64 × 4. Now stack 64 frames. Video latent shape becomes 64 frames × 64 × 64 × 4. Good. Now compute total latent values. 64 × 64 × 64 × 4 = 1,048,576 latent numbers. That is before batching. That is before activations inside the denoiser. Now suppose the system uses 30 denoising steps.

At each step, spatial blocks refine within frames. Temporal attention mixes across frames. Say the temporal block uses 8 attention heads. Say each frame position attends over all 64 times. Then one spatial location has 64 × 64 = 4,096 time-pair scores per head. Across 8 heads, that is 4,096 × 8 = 32,768 scores for one position. Now there are 64 × 64 = 4,096 spatial positions. So one temporal layer computes 32,768 × 4,096 = 134,217,728 attention scores. That is one layer.

One step. See why video is expensive? Now imagine 30 denoising steps. 134,217,728 × 30 = 4,026,531,840 temporal scores across the loop. This is rough arithmetic, not full implementation detail. But it gives the smell of the cost.

prompt text
text embeddings
noisy video latent: 64 × 64 × 64 × 4
   ├── spatial denoise blocks
   ├── temporal attention across 64 frames
   ├── repeat for 30 steps
clean video latent
decoder → 64 output frames → 4-second clip
Now connect this back to failures. If temporal mixing is weak, flicker rises. If text guidance is weak, prompt adherence drops.

If denoising steps are too few, motion looks muddy. If conditioning is too strong, motion can become stiff. That is why good systems tune all three. Text. Spatial detail. Temporal coherence. And the canvas must hold all of them together.


Where this lives in the wild

  • OpenAI Sora — diffusion transformer over spacetime latents gives long-range token mixing for coherent generated scenes.

  • Runway Gen-2 — U-Net diffusion with temporal modules turns image diffusion tricks into video generation.

  • Runway Gen-3 Alpha — stronger temporal modeling targets smoother motion and better subject consistency.

  • Pika 1.0 — image-to-video workflows animate a source image forward while trying to preserve identity.

  • Google Veo — text-conditioned video generation emphasizes cinematic motion and longer coherent clips.


Pause and recall

  • Why do most modern text-to-video systems generate in latent space rather than raw pixels?

  • How does a Sora-style patch transformer differ from a Runway-style temporal U-Net?

  • In image-to-video, what extra challenge appears after the first frame is fixed?

  • In the worked example, what latent shape did the 4-second clip have before denoising?


Interview Q&A

Q: Why use a spacetime transformer and not only a temporal U-Net?

A: A spacetime transformer can model longer and more flexible token relationships, which helps with global motion and scene consistency across a clip. Common wrong answer to avoid: Transformers are chosen only because they are fashionable, not because token interactions matter.

Q: Why keep U-Net diffusion designs alive and not move everything to transformers immediately?

A: U-Nets remain efficient, multiscale, and production-proven, especially when temporal layers can extend strong image-generation backbones. Common wrong answer to avoid: U-Nets are old, so they are automatically worse for video.

Q: Why is image-to-video harder than it first sounds?

A: Because the first frame gives appearance, but the system still has to invent plausible motion, preserve identity, and avoid flicker over future frames. Common wrong answer to avoid: Once the first frame is good, the rest is just interpolation.

Q: Why does temporal consistency fail even when single frames look beautiful?

A: Because per-frame realism does not guarantee cross-frame agreement on identity, geometry, or physical motion. Common wrong answer to avoid: If every frame looks realistic alone, the clip will automatically feel realistic.


Apply now (5 min)

Quick exercise. Take a 2-second 12 fps clip and compute how many frames the denoiser must output. Then assume each latent frame is 48 × 48 × 4 and calculate the full video latent size.

Sketch from memory the pipeline from prompt to noisy latent to temporal denoiser to decoder. Under the sketch, write one sentence on where flicker is born.


Bridge. The model can generate video. But shipping it to users means solving latency, batching, and cost at scale. → 12-production-multimodal-systems.md