Skip to content

05. Assignment 3 — Build and Explain a Pre-Norm Transformer Block

Week 3. Implement the smallest useful transformer block yourself. Use NumPy or PyTorch. Do not use nn.TransformerEncoderLayer or other full-block helpers.

Required reading first: 02_explainer.md chapters 2-5. You should be able to explain the shortcut pipe (§2), quality inspector (§3), full block (§4), and causal mask + KV cache (§5) before coding.

Goal

Build a minimal block that includes: - scaled dot-product self-attention - residual connection around attention - LayerNorm in pre-norm order - FFN with residual connection - optional causal masking for decoder-style behavior

Required deliverables

  1. attention.py or attention.ipynb — scaled dot-product attention with mask support
  2. block.py — pre-norm transformer block
  3. verify.py — toy-input checks for shapes, masking, and residual behavior
  4. README.md — explain residual stream, LayerNorm, and causal mask in plain language
  5. One block diagram — hand-drawn or digital, matching explainer §4.2

Spec

  • Input shape: [batch, seq_len, d_model]
  • Suggested config: d_model = 128, n_heads = 4, d_ff = 512
  • Residual pattern must be:
x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
  • Causal mode must block future positions correctly (see explainer §5.2)

Verification checklist

  1. Attention output shape matches input width
  2. Residual additions preserve the same d_model
  3. LayerNorm runs before each sublayer, not after (§3.4)
  4. Causal mask sets future logits to effectively zero probability (§5.2)
  5. Toy forward pass can be narrated using explainer §4.5

Suggested workflow

  1. Implement attention with a tiny toy example first (§4.3)
  2. Add LayerNorm and confirm per-token normalization (§3.1-§3.3)
  3. Wrap attention with the first residual add (§2.5)
  4. Implement the FFN branch and second residual add (§4.4-§4.5)
  5. Add a lower-triangular mask and test with a 4-token sequence (§5.2)

Common pitfalls

  • Forgetting pre-norm order and accidentally building post-norm (§3.4)
  • Changing tensor width across residual additions, which breaks the stream (§2.4)
  • Applying the mask after softmax instead of before it (§5.2)
  • Treating FFN as optional decoration instead of a core sublayer (§4.4)
  • Explaining the code without a block diagram (§4.2)

Stretch

  • Add a simple KV-cache demonstration for greedy decoding (§5.4)
  • Compare masked and unmasked attention weights on the same toy prompt (§5.2)
  • Replace LayerNorm with RMSNorm and note what changed conceptually (§Honest admission)

Why this matters

Module 04 will ask you to code these mechanics from scratch. If this hands_on_lab feels smooth, your conceptual foundation is working. If this hands_on_lab feels fuzzy, revisit 02_explainer.md before moving on.