02. Transformer Architecture — Narrative Explainer¶

Companion to 03_study_material.md. That file is your quick lookup sheet. This file is the picture in your head. Use 04_daily_recall.md after each chapter. Use 05_hands_on_lab.md when the diagrams feel stable. Use 06_revision.md before moving to Module 04.

Table of contents¶

ELI5 — the whole thing in kid words (start here)
Chapter 1: Opening failure & stakes
1.1 One big block, garbage output
1.2 Why this matters to you
Chapter 2: The residual connection — the shortcut pipe
2.1 Mental model — highway bypass around the station
2.2 Why plain stacking kills gradients
2.3 The residual update rule
2.4 The residual stream
2.5 Worked mini-block
Chapter 3: Layer normalization — the quality inspector
3.1 Mental model — ruler checking height
3.2 Why activations drift
3.3 What LayerNorm actually does
3.4 Pre-norm vs post-norm
3.5 Worked token example
Chapter 4: The full transformer block
4.1 Mental model — one station, two benches
4.2 The pre-norm block diagram
4.3 Attention inside the block
4.4 FFN inside the block
4.5 Worked forward pass
Chapter 5: Encoder, decoder, causal mask, KV cache
5.1 Three common stack patterns
5.2 Causal mask mechanics
5.3 Cross-attention in encoder-decoder models
5.4 KV cache for inference
5.5 Worked decoding pass
Retrieval prompts
Honest admission
Chapter 6: Recap & application
6.1 Failure-fix table
6.2 Key points to remember
6.3 Important interview questions
6.4 Production experience
6.5 Apply now — graded exercises

ELI5 — the whole thing in kid words¶

Gaurav, picture a giant factory called the assembly line. Each token enters as a small packet of meaning. That packet is a vector, but keep the picture physical. At every stop, the packet reaches the station. Workers at the station first ask nearby workers what they know. That consultation step is attention. No worker trusts only personal memory. Workers compare notes before writing their update. Several parallel crews do this consultation at the same time. One crew may watch grammar. One crew may watch subject-object links. One crew may watch long-range topic. After consulting colleagues, the station does private thinking. That private thinking is the feed-forward network. But there is a safety device. A shortcut pipe carries the old packet around the station. So even if the station adds a poor update, the old packet survives. There is also a quality inspector at each station entrance. The inspector checks whether the packet is too large or too tiny. If the scale looks wrong, the inspector rescales it. That is layer normalization. So the real story is simple. The assembly line keeps a running meaning packet. Each station consults colleagues, adds a local edit, and passes it onward. The shortcut pipe preserves the running packet. The quality inspector keeps the packet numerically sane. The parallel crews let different relationships be checked together. Without the pipe, the packet gets overwritten. Without the inspector, the packet scale drifts. Without the crews, one worker must do every relationship at once. Without a future-blocking rule in decoders, workers cheat by reading tomorrow's answer. Without caching, workers keep rewriting yesterday's notes during inference. Good. Keep this factory picture steady. The next chapters will replace toy words with transformer language.

Chapter 1: Opening failure & stakes¶

1.1 One big block, garbage output¶

You feed attention plus embeddings into one big block. Output is garbage. Layer outputs explode or collapse to zero. Why? Because plain stacking rewrites the full state every time. Nothing protects the old representation. Nothing keeps feature scale under control. So the stack becomes numerically fragile very quickly. Picture the bad design first.

embedding
   |
   v
attention -> FFN -> attention -> FFN -> attention -> FFN
   |
   v
output
No shortcut pipe.
No quality inspector.
Every stage rewrites the whole vector.

Now use a toy scalar example. Real transformers use vectors. The scalar still shows the failure cleanly. Suppose one block roughly multiplies by 1.8. Start with signal x = 4. Attempt A:

layer 0: 4.000
layer 1: 7.200
layer 2: 12.960
layer 3: 23.328

Same pattern, three steps, already exploding. Now try a block that shrinks too hard. Suppose it multiplies by 0.2. Attempt B:

layer 0: 4.000
layer 1: 0.800
layer 2: 0.160
layer 3: 0.032

Meaning collapses toward zero. Now try a signed unstable block. Suppose it multiplies by -1.2. Attempt C:

layer 0: 4.000
layer 1: -4.800
layer 2: 5.760
layer 3: -6.912

The sign flips and magnitude grows. So the same architecture family gives three ugly behaviors. Explode. Collapse. Oscillate. Now make the picture more transformer-like. Let one token state be [2, 1, -1]. Assume a badly scaled sublayer outputs [6, 3, -4]. Add another badly scaled sublayer after that, giving [18, 9, -12]. Feature norms jump from about 2.45 to 7.81 to 23.43. That is not thoughtful computation. That is uncontrolled amplification. Try the opposite direction. Start again from [2, 1, -1]. Suppose each sublayer outputs only ten percent of the previous scale. You get [0.2, 0.1, -0.1], then [0.02, 0.01, -0.01], then [0.002, 0.001, -0.001]. The model loses usable signal. Attention is not enough by itself. FFN is not enough by itself. Stacking powerful functions is not the same as stacking stable functions. That is the opening failure of this module. The block needs rails. Residual connections give one rail. Layer normalization gives the second rail.

1.2 Why this matters to you¶

You are not studying this for trivia, Gaurav. You are studying it for debugging power. A Lead AI Engineer gets called when training diverges. A Lead AI Engineer gets called when depth increases break stability. A Lead AI Engineer must decide pre-norm or post-norm. A Lead AI Engineer must explain decoder-only versus encoder-decoder tradeoffs. A Lead AI Engineer must know when KV cache matters operationally. If loss spikes at step 600, you need a checklist. Are hidden-state norms growing per layer? Did attention logits get too sharp? Did a masking bug leak future tokens? Did cache positions drift during inference? Those are architecture questions. Not library questions. Not paper-summary questions. Architecture understanding also changes your interview answers. If someone asks why residual connections matter, do not say "better gradients" and stop. Say the identity path preserves the residual stream and stabilizes depth. If someone asks why pre-norm dominates modern LLMs, do not answer mechanically. Say the inspector comes before heavy work, so the clean path stays usable. This module fills five foundation gaps that Module 04 assumes. You need the residual stream concept. You need the pre-norm block diagram. You need encoder-decoder versus decoder-only clarity. You need causal mask mechanics. You need KV cache intuition for inference. Read 03_study_material.md when you need compact formulas. Use 04_daily_recall.md to make these pictures retrievable under pressure. Then use 05_hands_on_lab.md to turn the pictures into code.

Chapter 2: The residual connection — the shortcut pipe¶

2.1 Mental model — highway bypass around the station¶

Picture a highway crossing several repair workshops. Each workshop can add improvements. But trucks do not enter a dead-end warehouse. They keep a bypass lane. That bypass lane is the residual connection. The transformer does not say, "replace the whole vector." It says, "keep the old vector, then add a learned edit." Write the rule cleanly:

x_{l+1} = x_l + f(x_l)

x_l is the running meaning state. f(x_l) is the sublayer's proposed change. The block is not asked to create meaning from zero. It is asked to write a delta. That is a much easier job. See the diagram.

          +------------------- shortcut pipe -------------------+
          |                                                     |
          v                                                     |
residual stream x -----> [ sublayer does useful work ] -----> (+) ----> next x

In ELI5 language, the assembly line keeps one packet moving forward. The station writes notes onto the packet. The shortcut pipe ensures the packet itself never disappears.

2.2 Why plain stacking kills gradients¶

Now see the backward pass picture. Training sends gradient information from later layers to earlier layers. Without residual paths, those gradients multiply through many local slopes. When many numbers multiply, they usually become tiny or huge. Take a simple chain with local slope 0.6 at every layer. Attempt A:

0.6^2  = 0.3600
0.6^6  = 0.0467
0.6^12 = 0.0022

By layer 12, almost nothing survives. Now take local slope 1.4 at every layer. Attempt B:

1.4^2  = 1.9600
1.4^6  = 7.5295
1.4^12 = 56.6939

That gradient explodes. Now take a more realistic mixed chain. Use slopes 0.8, 1.1, 0.7, 1.3, 0.9, 1.2, 0.6, 1.4. Attempt C:

product = 0.8 * 1.1 * 0.7 * 1.3 * 0.9 * 1.2 * 0.6 * 1.4
        ≈ 0.6919

That one did not explode. But it still changed scale unpredictably. And a deep stack multiplies many more such terms. So the issue is not only vanishing. It is uncontrolled sensitivity. One block shrinks. Another enlarges. Another flips direction. The whole chain becomes numerically moody. That is why "just stack more layers" is incomplete advice.

2.3 The residual update rule¶

Residual connection changes the derivative picture. If y = x + f(x), then:

dy/dx = 1 + df/dx

Notice the 1. That is the identity path. Even if the learned branch is weak, some signal still passes. Try three simple cases. Attempt A:

if df/dx = 0.0, then dy/dx = 1.0

Pure carry. Attempt B:

if df/dx = 0.2, then dy/dx = 1.2

Useful amplification, but still near one. Attempt C:

if df/dx = -0.3, then dy/dx = 0.7

Some shrinkage, but not collapse. This does not make bad behavior impossible. It makes the network far harder to destroy accidentally. Residual learning also changes the forward story. The sublayer learns edits, not full rewrites. Try three forward updates from x = 4. Attempt A:

f(x) = 0.5  -> y = 4.5

Small positive edit. Attempt B:

f(x) = -0.2 -> y = 3.8

Small corrective edit. Attempt C:

f(x) = 0.0  -> y = 4.0

Perfect carry. That is exactly what you want from deep stacks. Useful edits. Safe default behavior.

2.4 The residual stream¶

Now we can name the central object. The vector traveling through all layers is the residual stream. This is not fancy jargon. It is the main data highway. Every block reads from it. Every block writes back into it. The width usually stays fixed at d_model. That fixed width is deliberate. It lets every station speak the same vector language. Picture the stream across layers.

x0 ---- block 1 ----> x1 ---- block 2 ----> x2 ---- block 3 ----> x3
 |                     |                     |                     |
 same width            same width            same width            same width
 d_model               d_model               d_model               d_model

Attention reads the stream and proposes an edit. FFN reads the updated stream and proposes another edit. Neither sublayer owns the whole representation permanently. They keep writing into a shared canvas. This idea matters later for interpretability work. People often say a head writes to the residual stream. Or an MLP writes a feature into the residual stream. Now that phrase should sound physical, not mystical. In ELI5 language, the residual stream is the packet moving along the assembly line. The packet survives. Each station adds markings. The packet is the continuity.

2.5 Worked mini-block¶

Take a tiny residual stream vector:

x = [2.0, -1.0, 3.0]

Suppose the attention sublayer proposes this edit:

a(x) = [0.3, 0.1, -0.2]

After the first add:

x' = x + a(x) = [2.3, -0.9, 2.8]

Now suppose the FFN proposes this edit:

m(x') = [-0.1, 0.4, 0.2]

After the second add:

x'' = x' + m(x') = [2.2, -0.5, 3.0]

Notice what happened. The original signal never vanished. Feature one stayed near 2. Feature three returned exactly to 3. The block made corrections, not a total rewrite. That is the shortcut pipe doing its job. Checkpoint questions for you, Gaurav:

If the attention branch outputs zeros, what happens to the stream?
Why is adding an edit easier than recreating the whole token meaning?
Why does fixed-width d_model make stacking simpler?

If these feel obvious, good. Module 04 will depend on that comfort.

Chapter 3: Layer normalization — the quality inspector¶

3.1 Mental model — ruler checking height¶

Now add the second rail. Picture a quality inspector standing before the station. The inspector does not ask, "What word is this?" The inspector asks, "Is this vector numerically well-scaled?" That distinction matters. Layer normalization is not semantics. Layer normalization is numerical hygiene. The inspector works per token. It looks across that token's feature dimensions. It does not average across the batch. It does not average across positions. That is why LayerNorm fits variable-length transformers well.

LayerNorm(x) = gamma * (x - mean(x)) / sqrt(var(x) + eps) + beta

Read this physically. First center the vector. Then scale it to a stable spread. Then allow learned re-scaling with gamma and beta. In ELI5 language, the quality inspector checks packet size before heavy processing.

3.2 Why activations drift¶

Without normalization, hidden states drift in magnitude and mean. That drift can destroy training. Attempt A:

layer 0: [2, 4, 6]      mean = 4     std ≈ 1.63
layer 1: [6, 12, 18]    mean = 12    std ≈ 4.90
layer 2: [18, 36, 54]   mean = 36    std ≈ 14.70

Exploding scale. Attempt B:

layer 0: [0.2, 0.1, -0.1]       mean ≈ 0.067  std ≈ 0.125
layer 1: [0.04, 0.02, -0.02]    mean ≈ 0.013  std ≈ 0.025
layer 2: [0.008, 0.004, -0.004] mean ≈ 0.003  std ≈ 0.005

Collapsing scale. Attempt C:

layer 0: [10, 11, 12]   mean = 11   std ≈ 0.82
layer 1: [13, 14, 15]   mean = 14   std ≈ 0.82
layer 2: [16, 17, 18]   mean = 17   std ≈ 0.82

Mean shift without spread change. All three can hurt optimization. Large norms sharpen logits. Tiny norms weaken updates. Mean drift changes what later layers expect.

3.3 What LayerNorm actually does¶

Take one token vector:

x = [2, 4, 6]
mean = 4
x - mean = [-2, 0, 2]
var = 8/3 ≈ 2.667
std ≈ 1.633
normalized ≈ [-1.225, 0.000, 1.225]

Now compare three inputs.

[2, 4, 6]      -> [-1.225, 0.000, 1.225]
[12, 14, 16]   -> [-1.225, 0.000, 1.225]
[0.2, 0.4, 0.6] -> [-1.225, 0.000, 1.225]

Same shape pattern. Different raw scale. Same normalized direction. That is why the inspector helps.

3.4 Pre-norm vs post-norm¶

This distinction is easy to memorize and easy to misunderstand. So picture it first.

Pre-norm:

x -> LN -> Attention -> add x
x -> LN -> FFN      -> add x

Post-norm:

x -> Attention -> add x -> LN
x -> FFN      -> add x -> LN

Now ask the engineering question. Where do you want the clean identity path? Pre-norm keeps the shortcut pipe cleaner. That makes deep stacks easier to optimize. Modern LLMs mostly choose pre-norm or RMSNorm-style variants. Attempt A:

0.5^12 = 0.0002

Attempt B:

1.3^12 ≈ 23.30

Attempt C:

0.95^8 ≈ 0.66, 1.00^8 = 1.00, 1.05^8 ≈ 1.48

These are not exact transformer proofs. They are stability pictures.

3.5 Worked token example¶

Let the incoming residual stream be:

x = [2, 4, 6]
LN(x) ≈ [-1.225, 0.000, 1.225]
attn(LN(x)) = [0.6, -0.2, 0.1]
x1 = [2.6, 3.8, 6.1]
LN(x1) ≈ [-1.05, -0.29, 1.34]
ffn(LN(x1)) = [0.1, 0.3, -0.2]
x2 = [2.7, 4.1, 5.9]

The token changed. The scale stayed reasonable. The old signal survived.

Chapter 4: The full transformer block¶

4.1 Mental model — one station, two benches¶

A transformer block is one station with two work benches. Bench one is social. Tokens consult other tokens. That is attention. Bench two is private. Each token thinks by itself. That is the feed-forward network. Attention mixes across positions. FFN transforms within one position. Both write edits into the same residual stream.

4.2 The pre-norm block diagram¶

Draw this until it feels boring.

residual stream x
      |
      +------------------------------+
      |                              |
      v                              |
   LayerNorm                         |
      |                              |
      v                              |
 Multi-Head Attention                |
      |                              |
      +---------- add ---------------+
                   |
                   v
                 x1
                   |
                   +------------------------------+
                   |                              |
                   v                              |
                LayerNorm                         |
                   |                              |
                   v                              |
          Feed-Forward Network                    |
                   |                              |
                   +----------- add --------------+
                                |
                                v
                              x2

That is the modern block picture Module 04 assumes.

4.3 Attention inside the block¶

At the social bench, each token asks three questions. What am I looking for? What do others advertise? What content do they offer? Those become query, key, and value. Several heads run as parallel crews. Tiny example:

weights = [0.7, 0.2, 0.1]
v1 = [1, 0]
v2 = [0, 2]
v3 = [3, 1]
head output = 0.7*v1 + 0.2*v2 + 0.1*v3 = [1.0, 0.5]

That becomes one head's suggestion. The output projection maps all heads back to d_model.

4.4 FFN inside the block¶

Now the private bench. FFN applies the same per-token MLP to every position. Typical shape:

d_model -> d_ff -> activation -> d_model

Tiny example:

h = [1, 2]
W1 h = [1, 4, 3]
W2 [1, 4, 3] = [4, 7]

No other token was consulted there. That is why FFN is not redundant.

4.5 Worked forward pass¶

Let the incoming token state be:

x = [2, -1, 3]
LN(x) ≈ [0.27, -1.34, 1.07]
attn edit = [0.4, 0.2, -0.1]
x1 = [2.4, -0.8, 2.9]
LN(x1) ≈ [0.51, -1.40, 0.89]
ffn edit = [-0.2, 0.5, 0.3]
x2 = [2.2, -0.3, 3.2]

Inspect. Consult. Add. Inspect. Transform privately. Add. That rhythm is the transformer block.

Chapter 5: Encoder, decoder, causal mask, KV cache¶

5.1 Three common stack patterns¶

Encoder-only: every token can attend everywhere.

Decoder-only: every token can attend only leftward and itself.

Encoder-decoder: encoder reads source fully; decoder generates causally and cross-attends to encoder outputs.

Encoder-only:
input -> full self-attention blocks -> representations
Decoder-only:
input -> causal self-attention blocks -> next-token logits
Encoder-decoder:
source -> encoder blocks --------+
                                  |
target prefix -> decoder blocks -> cross-attention -> logits

For modern chat LLMs, decoder-only dominates. For translation and summarization, encoder-decoder remains important.

5.2 Causal mask mechanics¶

A decoder token must not read future tokens during training. Otherwise it cheats. Allowed pattern for sequence length 4:

Blocked logits get -inf before softmax. Attempt A:

row 1 before = [5, 2, 9, 1]
row 1 after  = [5, -inf, -inf, -inf]

Attempt B:

row 2 before = [1, 6, 8, 3]
row 2 after  = [1, 6, -inf, -inf]

Attempt C:

row 3 before = [2, 4, 7, 5]
row 3 after  = [2, 4, 7, -inf]

Without the mask, training loss can look good while generation fails.

5.3 Cross-attention in encoder-decoder models¶

Cross-attention means decoder queries use encoder outputs as memory. Queries come from decoder state. Keys and values come from encoder state. That is consultation across two streams, not one stream.

5.4 KV cache for inference¶

During autoregressive inference, past tokens do not change. So cache their keys and values. Do not recompute them every step. Attempt A:

T = 4
naive  = 1^2 + 2^2 + 3^2 + 4^2 = 30
cached = 1 + 2 + 3 + 4 = 10

Attempt B:

T = 16
naive  = 1496
cached = 136

Attempt C:

T = 128
naive  = 707264
cached = 8256

That is why KV cache matters for latency.

5.5 Worked decoding pass¶

Prompt:

[The, sky, is]

Predict blue. Next prompt becomes [The, sky, is, blue]. Without cache, old keys and values are recomputed. With cache, old keys and values are reused. Only the new token needs fresh projections.

Retrieval prompts¶

Draw the modern pre-norm transformer block from memory.

Explain the shortcut pipe and quality inspector without equations first.

What exact bug would you suspect if training looks good but generation leaks future information?

Compare self-attention, cross-attention, and FFN in one table from memory.

Explain KV cache to a product engineer using latency numbers.

Honest admission¶

This module gives a strong foundation, not the full frontier map. Real models often use RMSNorm, SwiGLU, RoPE, grouped-query attention, and memory optimizations. Residuals and normalization solve crucial stability problems. They do not solve every training problem. These toy numbers are intuition tools, not full proofs. That is acceptable. Your goal here is operational understanding.

Chapter 6: Recap & application¶

6.1 Failure-fix table¶

Failure	What you observe	Fix	Why the fix helps
Deep stack overwrites useful features	later layers forget earlier meaning	Residual connection	old state stays available through the shortcut pipe
Gradients vanish through many transforms	early layers learn slowly	Residual identity path	derivative keeps a near-one carry route
Hidden-state norms drift upward	logits become sharp	LayerNorm	rescales each token before heavy work
Hidden-state norms collapse	signals become tiny	LayerNorm	restores usable feature spread
Mean shifts accumulate	later blocks see inconsistent scale	LayerNorm	recenters token features
Post-norm depth becomes brittle	deeper training gets harder	Pre-norm block	keeps a cleaner skip path
Tokens need context from other positions	isolated token views miss dependencies	Self-attention	mixes information across positions
Tokens need local nonlinear transformation	pure mixing is not enough	FFN	applies per-token computation
Decoder reads future targets	training cheats, inference breaks	Causal mask	blocks rightward attention
Decoder recomputes old prefixes	latency grows badly	KV cache	stores past keys and values once
Target tokens need source information	seq2seq output misses source detail	Cross-attention	decoder queries encoder memory

6.2 Key points to remember¶

The residual stream is the main highway.
Residual blocks learn edits, not full rewrites.
LayerNorm is numerical hygiene.
Pre-norm means inspect first, then work, then add.
Attention mixes across positions.
FFN transforms within one position.
Decoder-only models require causal masking.
KV cache is an inference optimization.

6.3 Important interview questions¶

Why do residual connections matter in transformers?
What does LayerNorm normalize over?
Why do modern LLMs prefer pre-norm?
Self-attention vs cross-attention — what changes mechanically?
Why is FFN needed if attention already mixes tokens?
What exactly does a causal mask do?
Why does KV cache help inference but not training?
How would you debug exploding activations in a deep transformer?

6.4 Production experience¶

Log hidden-state norms per layer.
Verify normalization order in code.
Check masking before softmax.
Test cached and uncached decoding on short prompts.
Confirm positional indices advance correctly with cache.
Use the block diagram when explaining bugs to teammates.

6.5 Apply now — graded exercises¶

Easy: define the residual stream in one sentence.

Easy: compute LayerNorm for [3, 5, 7].

Medium: apply a causal mask to one decoder score row.

Medium: compare self-attention and cross-attention using Q, K, and V ownership.

Hard: draw the full pre-norm block and label every add point.

Hard: list five architecture checks when depth increases from 12 to 36 layers and training destabilizes.

Drawing task: draw three diagrams from memory.

Plain unstable stack without residual or norm.
One clean pre-norm transformer block.
Decoder causal mask as a lower-triangular matrix. If you cannot draw them, you do not yet own them. Repeat the cycle. Read 02_explainer.md again. Use 04_daily_recall.md aloud. Then confirm readiness with 06_revision.md. Next module — 04_autoregressive_generation — implements everything from this module in raw Python. You will code attention, residuals, and layer norm from scratch.