05. Assignment 5 — Fine-tune GPT-2 on a Domain Corpus¶

Week 5. Take a pretrained model, fine-tune it on a focused domain corpus, and measure the difference.

Required reading first: 02_explainer.md §2.2-§3.5 and §5.1-§5.5. You should understand the wiki reader, the shadowing, chat templates, catastrophic forgetting, LR schedules, precision, accumulation, and stop criteria before touching the run.

Goal¶

Fine-tune GPT-2 (124M) on a focused domain corpus. Measure perplexity before and after. Understand the training loop at a mechanical level. Not just "I ran Trainer."

Requirements¶

Data pipeline — load, clean, tokenize, split, and batch a domain corpus.
Training loop — use Hugging Face Trainer or a manual loop, but be able to explain every major config choice.
Eval — compute perplexity on a held-out test set, before and after fine-tuning.
Generation — produce at least 3 qualitative samples showing domain adaptation.
Training log — record at least one failure → fix iteration using explainer §6.1 vocabulary.

Suggested corpora¶

Your own writing: notes, blog posts, docs.
A narrow support or ops dataset.
Legal or medical public-domain text.
Internal-style technical documentation.
A chosen writing style if you want a lighter project.

Deliverables¶

train.py — training script with hyperparameters documented.
eval.py — compute perplexity on the held-out set.
generate.py — generate samples with at least two temperatures.
config.yaml or equivalent — hyperparameters and paths.
README.md — dataset, training setup, curves, perplexity delta, sample outputs.
TRAINING_LOG.md or equivalent section in README — what failed, what you changed, what improved.

Success criteria¶

Perplexity drops measurably on your domain test set.
Generated text sounds more domain-specific than the base model.
You can explain your learning rate, batch size, sequence length, and number of epochs.
You can explain whether the run was more like broad shadowing or narrow specialization.
You documented at least one failure → fix step.

Constraints¶

Use an existing pretrained GPT-2 checkpoint.
Do not pretend this is frontier pretraining. This hands_on_lab is about the post-training mechanics on a small scale.
Keep the dataset small enough to finish locally or on a rented GPU.
Preserve a held-out split. No cheating by evaluating on train text.

Recommended workflow¶

Pick a domain with visible stylistic or vocabulary patterns.
Clean obvious duplicates and junk. Apply the curriculum lesson from explainer §2.1.
Decide the training format. If you want instruction-style behavior, use a consistent prompt template. Explainer §3.3 matters here.
Run a base-model eval first. Save perplexity and sample generations.
Fine-tune with a conservative LR. Explainer §5.1 and §3.4 explain why.
Evaluate again.
Compare outputs and write the failure → fix story.

Hyperparameter suggestions¶

These are starting points. Not laws.

Learning rate: 5e-5 to 5e-4, depending on dataset size and objective.
Batch size: as large as fits, or simulate with gradient accumulation.
Sequence length: 128-512 for small runs.
Epochs: 1-5 for small corpora.
Precision: bf16 if available; otherwise fp16 or fp32.
Warmup: small fraction of total steps.

Hints¶

Clean the corpus first. Duplicates and junk distort tiny runs fast. See explainer §2.1.
Keep the format consistent. If training examples have different separators every few rows, the model learns noise. See explainer §3.3.
Use a small LR at first. Tiny corpora + big LR = catastrophic forgetting or unstable loss. See explainer §3.4 and §5.1.
Track effective batch size. If memory is tight, use accumulation and write down the formula. See explainer §5.3.
Checkpoint the run. Even a small local job can crash. See explainer §5.4.
Stop from eval, not ego. If train loss falls but samples get worse, stop. See explainer §5.5.

Common pitfalls¶

Forgetting a held-out split.
Training on duplicate-heavy text.
Using inconsistent prompt separators.
Using too high a learning rate and then blaming the model.
Reporting only train loss.
Claiming "the model improved" without before/after samples.
Comparing outputs at different decoding settings and calling it a fair test.

What to demonstrate in the writeup¶

Why this corpus was chosen.
What cleaning you did.
Exact training format.
Hyperparameters and why you picked them.
Perplexity before and after.
Three sample generations before and after.
One failure → fix note mapped to explainer §6.1.
One sentence on whether the run risked catastrophic forgetting.

Optional stretch goals¶

Try two data formats and compare.
Try one run with accumulation and one without.
Compare plain text continuation versus instruction-style formatting.
Build a tiny chosen/rejected set and write a note on how DPO would differ conceptually.

LinkedIn post template¶

Fine-tuned GPT-2 this week on a narrow domain corpus.

Before tuning, the model sounded generic. After tuning, the outputs picked up the domain vocabulary and structure much more reliably.

Three things that stood out: 1. Data cleaning mattered more than I expected. 2. Prompt / chat formatting changed behavior a lot. 3. Small LR decisions mattered because over-tuning can erase general behavior.

Biggest lesson: a pretrained model is the wiki reader. Fine-tuning is the shadowing. If the shadowing is messy, the behavior is messy.

Repo / notes: [link]

Why this hands_on_lab matters¶

Many people can repeat the words pretraining, SFT, and RLHF. Far fewer can connect those stages to actual training knobs. This hands_on_lab gives you the bridge. You will feel what small-data tuning changes, what it does not change, and how easily formatting and learning-rate mistakes show up in practice.