Skip to content

00. Transformer architecture in kid words — the assembly line

Read this first. Every later file in this module calls back to this picture. Five minutes.


The setup

Imagine a giant factory called the assembly line. Each word enters as a small packet of meaning — a vector. The packet travels through a series of identical stations. At each station, two things happen.

First, the social bench. Workers at the station ask nearby workers what they know. "Hey, what did you see? How does your word relate to mine?" That consultation step is attention. Several parallel crews do this at the same time — one crew watches grammar, one watches subject-object links, one watches long-range topic. After comparing notes, they write a combined edit onto the packet.

Second, the private bench. Each worker sits alone and does private thinking — transforms the packet through a small neural network. No consultation. Pure local computation. That is the feed-forward network.

But there are two safety devices.

A shortcut pipe carries the old packet around the station. So even if the station adds a poor edit, the old packet survives. The station writes corrections, not rewrites.

A quality inspector stands at the station entrance. The inspector does not care about meaning. The inspector checks whether the packet is too large or too tiny. If the scale looks wrong, the inspector rescales it. That is layer normalization.


The packet never dies

This is the key picture. The packet enters the factory. It passes through 12, 24, 96 stations. At every station, someone adds notes. But the original packet — plus all previous notes — always survives through the shortcut pipe.

packet x0 ── station 1 ──→ x1 ── station 2 ──→ x2 ── station 3 ──→ x3
               |                    |                    |
           shortcut pipe       shortcut pipe       shortcut pipe
           preserves x0        preserves x1        preserves x2

That traveling packet is the residual stream. It is the main highway of the transformer. Every station reads from it. Every station writes back into it. Nothing else carries meaning through the model.


Three flavors of factory

Not every factory is built the same.

  • Encoder-only. Every worker can look at every other worker — past, present, future. Good for understanding a full sentence. BERT does this.
  • Decoder-only. Every worker can only look leftward — at workers who came before. Good for generating text word by word. GPT, LLaMA, Claude do this.
  • Encoder-decoder. Two assembly lines. The encoder reads the full source. The decoder generates the target, consulting the encoder's output through a special window. Good for translation. T5 does this.

The decoder has a future-blocking rule. Workers cannot peek at tomorrow's word during training. Without this rule, the factory cheats — it reads the answer instead of learning to predict it. That rule is the causal mask.

And during inference, the factory has a cache. Yesterday's workers already wrote their notes. Why rewrite them? The KV cache stores past notes and reuses them. Without it, generating each new word recomputes everything from scratch.


The placeholders you will see called back

Whenever a later file uses one of these names, picture this factory:

Placeholder Picture
The assembly line The full transformer — a sequence of identical stations processing packets.
The station One transformer block — social bench (attention) + private bench (FFN).
The social bench Multi-head attention — tokens consulting other tokens.
The private bench Feed-forward network — per-token transformation, no consultation.
The parallel crews Multiple attention heads running simultaneously.
The shortcut pipe Residual connection — x + f(x). Old packet survives.
The quality inspector Layer normalization — rescale before heavy work.
The residual stream The vector traveling through all layers. The main highway.
The causal mask Future-blocking rule in decoders.
The cache KV cache — stored past keys and values for fast inference.

Every later file calls back to these by name. If a sentence feels abstract, return here and re-read the factory picture.


A tiny worked example

Three tokens enter the factory: [The, sky, is].

  • Station 1. The social bench notices "sky" relates to "is" (subject-verb). The private bench enriches each token's representation. The shortcut pipe preserves the original embeddings.
  • Station 2. The social bench notices "The" modifies "sky" (determiner-noun). The private bench adds more nuance. The shortcut pipe preserves everything from station 1.
  • Station 3. By now, each token's packet contains echoes of the full sentence. The model predicts the next word: "blue".

At every station, the quality inspector checked the packet scale. At every station, the shortcut pipe kept the old packet alive. The prediction emerged from 3 stations of incremental edits — not one giant rewrite.


What's coming

The rest of the module is the failure-fix chain:

  1. The stacking failure. Why one big block without rails produces garbage. → 01-stacking-failure.md
  2. Residual connections. The shortcut pipe — keep the old packet alive.
  3. The residual stream. The shared canvas every block reads and writes.
  4. Layer normalization. The quality inspector — numerical hygiene per token.
  5. Pre-norm vs post-norm. Where the inspector stands changes everything.
  6. The transformer block. Two benches, one station — the complete picture.
  7. Attention inside the block. The social bench — parallel crews consulting.
  8. The feed-forward network. The private bench — per-token thinking.
  9. Encoder, decoder, encoder-decoder. Three factory layouts.
  10. Causal masking. The future-blocking rule.
  11. KV cache. Not rewriting yesterday's notes.
  12. GQA and MQA. Sharing KV heads to shrink the cache.
  13. Flash Attention. Making O(n²) attention fast by respecting GPU memory hierarchy.
  14. Honest admission. What this module glossed over.

Each file is one piece. Each piece exists because the previous piece broke at something specific.


Bridge. The first thing that breaks is depth itself. Stack powerful functions without rails and the signal explodes, collapses, or oscillates. Read 01-stacking-failure.md next.