05. Assignment 3 — Build and Explain a Pre-Norm Transformer Block¶
Week 3. Implement the smallest useful transformer block yourself.
Use NumPy or PyTorch.
Do not use nn.TransformerEncoderLayer or other full-block helpers.
Required reading first:
02_explainer.mdchapters 2-5. You should be able to explain the shortcut pipe (§2), quality inspector (§3), full block (§4), and causal mask + KV cache (§5) before coding.
Goal¶
Build a minimal block that includes: - scaled dot-product self-attention - residual connection around attention - LayerNorm in pre-norm order - FFN with residual connection - optional causal masking for decoder-style behavior
Required deliverables¶
attention.pyorattention.ipynb— scaled dot-product attention with mask supportblock.py— pre-norm transformer blockverify.py— toy-input checks for shapes, masking, and residual behaviorREADME.md— explain residual stream, LayerNorm, and causal mask in plain language- One block diagram — hand-drawn or digital, matching explainer §4.2
Spec¶
- Input shape:
[batch, seq_len, d_model] - Suggested config:
d_model = 128,n_heads = 4,d_ff = 512 - Residual pattern must be:
- Causal mode must block future positions correctly (see explainer §5.2)
Verification checklist¶
- Attention output shape matches input width
- Residual additions preserve the same
d_model - LayerNorm runs before each sublayer, not after (§3.4)
- Causal mask sets future logits to effectively zero probability (§5.2)
- Toy forward pass can be narrated using explainer §4.5
Suggested workflow¶
- Implement attention with a tiny toy example first (§4.3)
- Add LayerNorm and confirm per-token normalization (§3.1-§3.3)
- Wrap attention with the first residual add (§2.5)
- Implement the FFN branch and second residual add (§4.4-§4.5)
- Add a lower-triangular mask and test with a 4-token sequence (§5.2)
Common pitfalls¶
- Forgetting pre-norm order and accidentally building post-norm (§3.4)
- Changing tensor width across residual additions, which breaks the stream (§2.4)
- Applying the mask after softmax instead of before it (§5.2)
- Treating FFN as optional decoration instead of a core sublayer (§4.4)
- Explaining the code without a block diagram (§4.2)
Stretch¶
- Add a simple KV-cache demonstration for greedy decoding (§5.4)
- Compare masked and unmasked attention weights on the same toy prompt (§5.2)
- Replace LayerNorm with RMSNorm and note what changed conceptually (§Honest admission)
Why this matters¶
Module 04 will ask you to code these mechanics from scratch.
If this hands_on_lab feels smooth, your conceptual foundation is working.
If this hands_on_lab feels fuzzy, revisit 02_explainer.md before moving on.