03. Week 3 — Transformer Architecture¶

Start with 02_explainer.md. This file is the quick-reference sheet. See explainer references after each section for the full failure-fix story.

Section 1 — Pre-norm block at a glance¶

Modern LLM blocks are usually pre-norm:

x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

Two sublayers. Two residual additions. Same residual stream width d_model throughout.

See explainer §4.2.

Residual rule:

x_{l+1} = x_l + f(x_l)

Read it as carry old state + learned edit.

Why it matters: - preserves a clean identity path - stabilizes deep optimization - lets every block write into the same residual stream

Residual stream = the running token representation passed across blocks.

See explainer §2.1-§2.5.

LayerNorm operates per token across features:

LN(x) = gamma * (x - mean(x)) / sqrt(var(x) + eps) + beta

Why it matters: - controls activation scale drift - recenters token features - gives later sublayers more predictable input statistics

See explainer §3.1-§3.3.

Pre-norm:

x -> LN -> sublayer -> add x

Post-norm:

x -> sublayer -> add x -> LN

Modern LLMs usually prefer pre-norm because deep stacks train more stably.

See explainer §3.4.

Attention mixes information across positions.

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Q, K, V come from learned projections. Multi-head attention runs several heads in parallel, then projects back to d_model.

See explainer §4.3.

FFN applies the same per-token MLP to each position independently.

FFN(x) = W_2 * activation(W_1 x + b_1) + b_2

Typical shape: d_model -> 4*d_model -> d_model.

Attention mixes across tokens. FFN transforms within one token.

See explainer §4.4.

Encoder block: full self-attention. Every token can attend to every token.

Decoder block: causal self-attention. A token can attend only to earlier positions and itself.

Encoder-decoder model: decoder also includes cross-attention to encoder outputs.

See explainer §5.1 and §5.3.

Causal mask is lower-triangular visibility control. Blocked future logits get -inf before softmax.

Example for sequence length 4:

Mask bug = future leakage during training.

See explainer §5.2.

During autoregressive inference, cache past keys and values. Then compute fresh Q, K, V only for the new token.

Why it matters: - lower latency - less repeated work - critical for long generations

Cache helps inference, not standard parallel training.

See explainer §5.4-§5.5.

What is the residual stream? (See explainer §2.4)
Why does LayerNorm operate per token, not per batch? (§3.1)
Why is pre-norm usually easier to train deeply? (§3.4)
Attention vs FFN — what does each one change? (§4.1)
Why must decoder attention be masked? (§5.2)
Why does KV cache matter operationally? (§5.4)