04. Layer normalization — the quality inspector¶

Five minutes. One picture. Why every token needs a scale check before heavy work.

Built on the ELI5 in 00-eli5.md. The quality inspector — the guard checking whether a packet is too large or too shifted — is exactly what this file formalizes.

The picture before the math¶

Imagine each token arrives at a station with a packet of numbers. Some packets are too large. Some are too tiny. Some have drifted upward, so every feature is sitting on a big positive baseline. That is bad hygiene. The next station should not guess whether the signal is meaningful or merely badly scaled. So what to do? Put a quality inspector at the entrance. The inspector checks one token at a time. Then rescales that token's feature vector into a cleaner range.

raw token vector ──→ check mean and scale ──→ normalize ──→ pass to station

This is layer normalization. It is about numerical discipline. Not new meaning.

Why activations drift¶

Three common drifts show up in deep stacks. First, explode. The vector norm keeps growing. Second, collapse. The norm shrinks toward zero. Third, mean shift. All coordinates move upward or downward together. Tiny picture:

healthy:    [-1,  0,  1]
exploded:   [-10, 0, 10]
collapsed:  [-0.1, 0, 0.1]
mean-shift: [ 9, 10, 11]

The last one is sneaky. Relative differences may still exist. But the whole packet is riding on an offset. That can make later layers harder to train. LayerNorm fights all three issues. It recenters and rescales per token.

The formula¶

For one token vector x = [x1, x2, ..., xd], LayerNorm computes:

μ = (1/d) Σ xi

σ² = (1/d) Σ (xi - μ)^2

Then normalize:

LN(x) = γ ⊙ (x - μ) / sqrt(σ² + ε) + β

Read the pieces. μ is the mean of that token's features. σ² is the variance of that token's features. γ and β are learned scale and shift parameters. ε is a tiny safety constant. See the scope carefully. One token. Across features. Not across the batch.

Worked example — normalize `[2, 4, 6]`¶

Take one token vector:

x = [2, 4, 6]

Step 1. Mean:

μ = (2 + 4 + 6) / 3 = 4

Step 2. Variance:

σ² = ((2-4)^2 + (4-4)^2 + (6-4)^2) / 3
   = (4 + 0 + 4) / 3
   = 8/3
   ≈ 2.6667

Step 3. Standard deviation:

σ = sqrt(8/3) ≈ 1.6330

Step 4. Center and scale:

(2 - 4) / 1.6330 ≈ -1.2247
(4 - 4) / 1.6330 = 0
(6 - 4) / 1.6330 ≈ 1.2247

So the normalized output is approximately:

[-1.225, 0, 1.225]

That is the inspector at work. Same shape information. Cleaner scale.

Same normalized shape from different scales¶

Now check two more vectors. First:

[12, 14, 16]

Mean is 14. Differences from the mean are [-2, 0, 2]. Standard deviation is again 1.6330. So the normalized output is still:

[-1.225, 0, 1.225]

Second:

[0.2, 0.4, 0.6]

Mean is 0.4. Differences from the mean are [-0.2, 0, 0.2]. Variance is:

(0.04 + 0 + 0.04) / 3 = 0.02667

Standard deviation is about 0.1633. Divide by that and again you get:

[-1.225, 0, 1.225]

Put them together.

| Input | Normalized output |

|---|---|

| [2, 4, 6] | [-1.225, 0, 1.225] |

| [12, 14, 16] | [-1.225, 0, 1.225] |

| [0.2, 0.4, 0.6] | [-1.225, 0, 1.225] |

See the message. LayerNorm cares about relative shape inside the token. It removes absolute scale and mean offset.

Gamma and beta — learned rescaling after cleanup¶

If LayerNorm always ended at zero mean and unit variance, maybe the model would lose useful freedom. So it learns γ and β. Take the normalized vector from above:

n = [-1.2247, 0, 1.2247]

Choose:

γ = [2, 1, 0.5]
β = [0.5, -1, 0]

Apply them coordinate-wise:

2 × (-1.2247) + 0.5  ≈ -1.9494
1 × 0 + (-1)        = -1
0.5 × 1.2247 + 0    ≈ 0.6124

Final output:

[-1.9494, -1, 0.6124]

So the inspector first cleans. Then the model learns how much scale and shift it actually wants back. That is a useful balance.

Per-token, across features — not batch norm¶

This part matters a lot. Suppose you have two tokens, each with three features.

Token 1: [2, 4, 6]
Token 2: [10, 20, 30]

LayerNorm handles each row separately.

[2,  4,  6 ]  ── normalize across row ──→  [-1.225, 0, 1.225]
[10, 20, 30]  ── normalize across row ──→  [-1.225, 0, 1.225]

Token 1 does not care what Token 2 looked like. That is why LayerNorm fits sequence models. Batch norm would mix statistics across examples or time positions. That is awkward when sequence length changes. It is awkward when batch size changes. It is awkward when generation happens one token at a time. LayerNorm avoids that whole mess.

Why LayerNorm fits variable-length sequences¶

A language model may see 8 tokens. Or 8000 tokens. The normalization rule should not depend on how many other tokens happened to be nearby. LayerNorm works locally. One token. Across its own features. So whether the sequence is short or long, the inspector can still do the same job. That makes it a natural fit for transformers.

RMSNorm — a modern variant¶

Some modern models use RMSNorm instead of full LayerNorm. RMSNorm skips mean centering. It only rescales by root-mean-square magnitude. Formula:

RMS(x) = sqrt((1/d) Σ xi^2)

RMSNorm(x) = γ ⊙ x / RMS(x)

Take x = [2, 4, 6]. Compute the RMS:

RMS = sqrt((4 + 16 + 36) / 3)
    = sqrt(56/3)
    ≈ 4.3205

Then divide:

[2, 4, 6] / 4.3205 ≈ [0.463, 0.926, 1.389]

Notice the difference. The mean is not forced to zero. Only the scale is controlled. That can be cheaper and still very effective.

Why the quality inspector matters in the whole transformer¶

The shortcut pipe lets the old packet survive. Good. But packets can still drift in size as many edits accumulate. The quality inspector keeps each station from receiving wildly scaled inputs. That makes attention more predictable. It makes FFN activations more manageable. It makes deep optimization calmer. So residual connections and LayerNorm are partners. One preserves the path. One cleans the numbers on the path.

Where this lives in the wild¶

OpenAI GPT-family / ChatGPT. Transformer blocks use normalization so deep token processing stays numerically stable.
Anthropic Claude. Deep decoder stacks rely on normalization before heavy sublayer work.
Google Gemini / Gemma. Every block includes normalization machinery to control hidden-state drift.
Meta Llama 3. Uses the RMSNorm variant instead of classic LayerNorm, but the purpose is the same: scale control.
Mistral models. Modern decoder architectures also use RMSNorm-style normalization for stable deep stacks.

Interview Q&A¶

Q: Across what dimension does LayerNorm normalize in a transformer? A: Per token, across that token's feature dimensions. It does not use batch-wide statistics. Q: Why is LayerNorm a better fit than batch norm for autoregressive language models? A: Because sequence length and batch composition can change constantly, especially at inference time. LayerNorm depends only on the current token vector, so it behaves consistently. Common wrong answer to avoid: "Batch norm is avoided only because it is slower." Speed is not the main issue. The statistical dependency is. Q: What do gamma and beta do after normalization? A: They let the model learn a useful rescaling and re-centering after the cleanup step. Normalization standardizes first, then γ and β restore flexible representational power. Q: How is RMSNorm different from LayerNorm? A: RMSNorm controls scale using root-mean-square magnitude but skips subtracting the mean. Common wrong answer to avoid: "RMSNorm is just LayerNorm with different notation." It changes the computation by removing mean centering.

Apply now (5 min)¶

Take the vector [2, 4, 6]. Compute its mean. Compute its variance. Normalize it by hand. Then repeat for [12, 14, 16] and [0.2, 0.4, 0.6]. Notice the same normalized answer. Now sketch from memory:

one token row with three features
the formula for μ
the formula for LN(x)
one sentence: "LayerNorm is per token, across features."

If you can explain why the inspector should not depend on the rest of the batch, you own the idea.

Bridge. The last subtle question is placement: should the quality inspector stand before the heavy work or after it? That choice changes gradient flow and training behavior. Read 05-pre-norm-vs-post-norm.md next.