Skip to content

10. Video tokenization and temporal modeling — time turns one picture into a crowd

~12 min read. One image is neat. A clip is a traffic jam.

Built on the ELI5 in 00-eli5.md. The the frame tape — the timeline that links video frames — forces tokenization to care about motion, identity, and order, not just appearance.


1) First see the pileup before any formula

An image is one sheet. A video is many sheets stacked by time. So image tokenization cuts across height and width. Video tokenization cuts across height, width, and time. Look. A frame can be split into little squares.

Those squares are patches. A video clip can be split into little cubes. Each cube spans space and a short slice of time.

That is the first mental shift. The model is not only asking what is here. It is asking what stayed, what moved, and what changed. So the patch stops being only spatial. It becomes spatiotemporal in practice. If you ignore time, a red car in frame 1 and frame 2 may become two unrelated objects.

Then identity drifts. Motion becomes jumpy. Lighting can blink.

Simple, no? Now the quick arithmetic. A common image setup uses 14 × 14 = 196 patches. That is fine for one frame. But one second of video at 30 frames means 196 × 30 = 5,880 tokens. That is before text tokens.

That is before special tokens. That is before deeper layers multiply compute. So the token game changes immediately.

2) How spatiotemporal patching actually looks

See the clip as a 3D block. Height is one axis. Width is another.

Time is the third. Now slice the whole block into little cubes. Each cube becomes one token after projection.

                 time
        frame 1   frame 2   frame 3   frame 4
      ┌────────┬────────┬────────┬────────┐
row 1 │┌──┬──┐ │┌──┬──┐ │┌──┬──┐ │┌──┬──┐ │
      ││A │B │ ││E │F │ ││I │J │ ││M │N │ │
      │├──┼──┤ │├──┼──┤ │├──┼──┤ │├──┼──┤ │
row 2 ││C │D │ ││G │H │ ││K │L │ ││O │P │ │
      │└──┴──┘ │└──┴──┘ │└──┴──┘ │└──┴──┘ │
      └────────┴────────┴────────┴────────┘
cube token examples
A+E  ──→ same spatial spot across two times
C+G  ──→ motion evidence in lower-left area
B+F  ──→ another spacetime cube
The exact implementation varies. Some systems patch each frame first. Then they add temporal attention later.

Some systems patch directly in spacetime latents. But the idea is the same. The input volume is no longer flat.

That is where the frame tape starts charging rent. Every extra second adds another slice. Every extra slice creates more relationships.

And attention must decide which relationships are worth paying for.

3) Why temporal attention exists at all

Suppose a person lifts a blue cup. Frame 1 shows the hand near the cup. Frame 2 shows contact. Frame 3 shows the cup rising. If each frame is processed alone, the model may keep color right. But it may lose object continuity.

The cup may change shape. The sleeve may change pattern. The hand may teleport slightly.

Temporal attention is the repair tool. It lets tokens from one frame attend to tokens from other frames. So the model can ask which earlier evidence should guide this frame.

frame t-1 tokens ──┐
                   ├──→ temporal attention ──→ frame t update
frame t tokens   ──┤
                   └──→ frame t+1 context
Now the painful part. Full 3D attention is expensive. If you have N video tokens, full attention is roughly pair checks.

When N becomes thousands, memory jumps fast. So people factor the problem. One common trick is spatial-then-temporal attention.

First attend within each frame. Then attend across frames at matched spatial positions. Good. Cheaper. More scalable. But not free.

Factored attention may miss some rich cross-frame, cross-region interactions. Full 3D attention can capture more joint structure. But it burns more compute.

So what to do? Use full attention where quality needs it. Use factored attention where budget demands it. Yes?

4) Worked example: count tokens for a 4-second clip

Take a 256 × 256 video. Frame rate is 8 fps. Duration is 4 seconds.

Patch size is 16 × 16. First count frames. 4 seconds × 8 frames/second = 32 frames

Now patches per frame. 256 / 16 = 16 patches along height. 256 / 16 = 16 patches along width.

So patches per frame are 16 × 16 = 256. Now total video tokens if we patch each frame separately. 32 frames × 256 patches/frame = 8,192 tokens Good. Now compare with the earlier one-second example. That example was 196 × 30 = 5,880 tokens.

Here we already crossed that with only 8 fps. Why? Because the frame is larger in patch count.

256 patches per frame is bigger than 196. And 32 frames cover four full seconds. Now see the same clip as cubes. Suppose we group time in chunks of 2 frames. Spatial grid is still 16 × 16. Temporal groups become 32 / 2 = 16.

So spacetime cubes are 16 × 16 × 16 = 4,096 cube tokens. That halves the token count. Nice.

But each token now summarizes two frames. You saved compute. You also blurred fine motion a bit. That is the trade-off in one line.

frames                 = 4 × 8 = 32
patches per frame      = (256/16) × (256/16) = 16 × 16 = 256
total frame tokens     = 32 × 256 = 8,192
2-frame cube groups    = 32/2 = 16
total cube tokens      = 16 × 16 × 16 = 4,096
compute saved          = 8,192 - 4,096 = 4,096 tokens
motion detail sacrificed = finer changes inside each 2-frame chunk
See the lesson. Video work is a three-way bargain. Resolution fights duration.

Duration fights frame rate. All three fight compute. And the frame tape makes every mistake repeat across time.


Where this lives in the wild

  • OpenAI Sora — patch-based spacetime tokenization lets the model treat video like a latent world volume, not a bag of independent frames.

  • Runway Gen-3 — temporal modules keep subjects and camera motion coherent after spatial features are built.

  • Google Veo — temporal consistency work matters for long clips, especially when identity must survive scene motion.

  • Pika — image-to-video motion needs cross-frame links so the starting subject does not melt after a second.

  • Meta Emu Video — learned temporal structure helps generated clips keep appearance stable across frames.


Pause and recall

  • Why does video tokenization become H × W × T instead of only H × W?

  • How do we get 5,880 tokens from 196 patches per frame and 30 frames?

  • Why does temporal attention help with identity and motion consistency?

  • What trade-off separates full 3D attention from spatial-then-temporal attention?


Interview Q&A

Q: Why use temporal attention and not independent frame generation?

A: Because independent frames can look sharp alone but fail to preserve identity, motion direction, and small causal links across time. Common wrong answer to avoid: Independent frames are fine because a video is only a list of images.

Q: Why choose spatial-then-temporal attention and not full 3D attention everywhere?

A: Because factored attention cuts memory and compute sharply, which often makes training and serving feasible at useful resolutions and durations. Common wrong answer to avoid: Factored attention is always better because it gives the same quality for free.

Q: Why can spacetime cube tokenization be attractive and not per-frame patching only?

A: Because cube tokens reduce sequence length and carry short motion evidence inside each token, which can lower cost on long clips. Common wrong answer to avoid: Bigger tokens only help because they make the model simpler, with no information loss.

Q: Why does token count explode faster in video than many engineers expect?

A: Because every spatial token is copied across time, and attention cost grows with relationships between all those tokens. Common wrong answer to avoid: Video cost rises linearly in frames and stays easy because each frame is small.


Apply now (5 min)

Quick exercise. Compute token counts for a 2-second 224 × 224 clip at 12 fps with 16 × 16 patches. Then recompute if you merge every 3 frames into one cube token.

Sketch from memory the video volume and show where one spacetime cube lives. Under the sketch, write one line on when you would pay for full 3D attention anyway.


Bridge. Tokens are ready. But how do modern systems actually generate coherent video from text? → 11-text-to-video-systems.md