05. Assignment 5 — Fine-tune GPT-2 on a Domain Corpus¶
Week 5. Take a pretrained model, fine-tune it on a focused domain corpus, and measure the difference.
Required reading first:
02_explainer.md§2.2-§3.5 and §5.1-§5.5. You should understand the wiki reader, the shadowing, chat templates, catastrophic forgetting, LR schedules, precision, accumulation, and stop criteria before touching the run.
Goal¶
Fine-tune GPT-2 (124M) on a focused domain corpus. Measure perplexity before and after. Understand the training loop at a mechanical level. Not just "I ran Trainer."
Requirements¶
- Data pipeline — load, clean, tokenize, split, and batch a domain corpus.
- Training loop — use Hugging Face Trainer or a manual loop, but be able to explain every major config choice.
- Eval — compute perplexity on a held-out test set, before and after fine-tuning.
- Generation — produce at least 3 qualitative samples showing domain adaptation.
- Training log — record at least one failure → fix iteration using explainer §6.1 vocabulary.
Suggested corpora¶
- Your own writing: notes, blog posts, docs.
- A narrow support or ops dataset.
- Legal or medical public-domain text.
- Internal-style technical documentation.
- A chosen writing style if you want a lighter project.
Deliverables¶
train.py— training script with hyperparameters documented.eval.py— compute perplexity on the held-out set.generate.py— generate samples with at least two temperatures.config.yamlor equivalent — hyperparameters and paths.README.md— dataset, training setup, curves, perplexity delta, sample outputs.TRAINING_LOG.mdor equivalent section in README — what failed, what you changed, what improved.
Success criteria¶
- Perplexity drops measurably on your domain test set.
- Generated text sounds more domain-specific than the base model.
- You can explain your learning rate, batch size, sequence length, and number of epochs.
- You can explain whether the run was more like broad shadowing or narrow specialization.
- You documented at least one failure → fix step.
Constraints¶
- Use an existing pretrained GPT-2 checkpoint.
- Do not pretend this is frontier pretraining. This hands_on_lab is about the post-training mechanics on a small scale.
- Keep the dataset small enough to finish locally or on a rented GPU.
- Preserve a held-out split. No cheating by evaluating on train text.
Recommended workflow¶
- Pick a domain with visible stylistic or vocabulary patterns.
- Clean obvious duplicates and junk. Apply the curriculum lesson from explainer §2.1.
- Decide the training format. If you want instruction-style behavior, use a consistent prompt template. Explainer §3.3 matters here.
- Run a base-model eval first. Save perplexity and sample generations.
- Fine-tune with a conservative LR. Explainer §5.1 and §3.4 explain why.
- Evaluate again.
- Compare outputs and write the failure → fix story.
Hyperparameter suggestions¶
These are starting points. Not laws.
- Learning rate:
5e-5to5e-4, depending on dataset size and objective. - Batch size: as large as fits, or simulate with gradient accumulation.
- Sequence length: 128-512 for small runs.
- Epochs: 1-5 for small corpora.
- Precision: bf16 if available; otherwise fp16 or fp32.
- Warmup: small fraction of total steps.
Hints¶
- Clean the corpus first. Duplicates and junk distort tiny runs fast. See explainer §2.1.
- Keep the format consistent. If training examples have different separators every few rows, the model learns noise. See explainer §3.3.
- Use a small LR at first. Tiny corpora + big LR = catastrophic forgetting or unstable loss. See explainer §3.4 and §5.1.
- Track effective batch size. If memory is tight, use accumulation and write down the formula. See explainer §5.3.
- Checkpoint the run. Even a small local job can crash. See explainer §5.4.
- Stop from eval, not ego. If train loss falls but samples get worse, stop. See explainer §5.5.
Common pitfalls¶
- Forgetting a held-out split.
- Training on duplicate-heavy text.
- Using inconsistent prompt separators.
- Using too high a learning rate and then blaming the model.
- Reporting only train loss.
- Claiming "the model improved" without before/after samples.
- Comparing outputs at different decoding settings and calling it a fair test.
What to demonstrate in the writeup¶
- Why this corpus was chosen.
- What cleaning you did.
- Exact training format.
- Hyperparameters and why you picked them.
- Perplexity before and after.
- Three sample generations before and after.
- One failure → fix note mapped to explainer §6.1.
- One sentence on whether the run risked catastrophic forgetting.
Optional stretch goals¶
- Try two data formats and compare.
- Try one run with accumulation and one without.
- Compare plain text continuation versus instruction-style formatting.
- Build a tiny chosen/rejected set and write a note on how DPO would differ conceptually.
LinkedIn post template¶
Fine-tuned GPT-2 this week on a narrow domain corpus.
Before tuning, the model sounded generic. After tuning, the outputs picked up the domain vocabulary and structure much more reliably.
Three things that stood out: 1. Data cleaning mattered more than I expected. 2. Prompt / chat formatting changed behavior a lot. 3. Small LR decisions mattered because over-tuning can erase general behavior.
Biggest lesson: a pretrained model is the wiki reader. Fine-tuning is the shadowing. If the shadowing is messy, the behavior is messy.
Repo / notes: [link]
Why this hands_on_lab matters¶
Many people can repeat the words pretraining, SFT, and RLHF. Far fewer can connect those stages to actual training knobs. This hands_on_lab gives you the bridge. You will feel what small-data tuning changes, what it does not change, and how easily formatting and learning-rate mistakes show up in practice.