05. Assignment 3 — Build and Explain a Pre-Norm Transformer Block¶

Week 3. Implement the smallest useful transformer block yourself. Use NumPy or PyTorch. Do not use nn.TransformerEncoderLayer or other full-block helpers.

Required reading first: 02_explainer.md chapters 2-5. You should be able to explain the shortcut pipe (§2), quality inspector (§3), full block (§4), and causal mask + KV cache (§5) before coding.

Goal¶

Build a minimal block that includes: - scaled dot-product self-attention - residual connection around attention - LayerNorm in pre-norm order - FFN with residual connection - optional causal masking for decoder-style behavior

Required deliverables¶

attention.py or attention.ipynb — scaled dot-product attention with mask support
block.py — pre-norm transformer block
verify.py — toy-input checks for shapes, masking, and residual behavior
README.md — explain residual stream, LayerNorm, and causal mask in plain language
One block diagram — hand-drawn or digital, matching explainer §4.2

Spec¶

Input shape: [batch, seq_len, d_model]
Suggested config: d_model = 128, n_heads = 4, d_ff = 512
Residual pattern must be:

x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

Causal mode must block future positions correctly (see explainer §5.2)

Verification checklist¶

Attention output shape matches input width
Residual additions preserve the same d_model
LayerNorm runs before each sublayer, not after (§3.4)
Causal mask sets future logits to effectively zero probability (§5.2)
Toy forward pass can be narrated using explainer §4.5

Suggested workflow¶

Implement attention with a tiny toy example first (§4.3)
Add LayerNorm and confirm per-token normalization (§3.1-§3.3)
Wrap attention with the first residual add (§2.5)
Implement the FFN branch and second residual add (§4.4-§4.5)
Add a lower-triangular mask and test with a 4-token sequence (§5.2)

Common pitfalls¶

Forgetting pre-norm order and accidentally building post-norm (§3.4)
Changing tensor width across residual additions, which breaks the stream (§2.4)
Applying the mask after softmax instead of before it (§5.2)
Treating FFN as optional decoration instead of a core sublayer (§4.4)
Explaining the code without a block diagram (§4.2)

Stretch¶

Add a simple KV-cache demonstration for greedy decoding (§5.4)
Compare masked and unmasked attention weights on the same toy prompt (§5.2)
Replace LayerNorm with RMSNorm and note what changed conceptually (§Honest admission)

Why this matters¶

Module 04 will ask you to code these mechanics from scratch. If this hands_on_lab feels smooth, your conceptual foundation is working. If this hands_on_lab feels fuzzy, revisit 02_explainer.md before moving on.