Skip to content

05. Assignment 5 — Fine-tune GPT-2 on a Domain Corpus

Week 5. Take a pretrained model, fine-tune it on a focused domain corpus, and measure the difference.

Required reading first: 02_explainer.md §2.2-§3.5 and §5.1-§5.5. You should understand the wiki reader, the shadowing, chat templates, catastrophic forgetting, LR schedules, precision, accumulation, and stop criteria before touching the run.

Goal

Fine-tune GPT-2 (124M) on a focused domain corpus. Measure perplexity before and after. Understand the training loop at a mechanical level. Not just "I ran Trainer."

Requirements

  1. Data pipeline — load, clean, tokenize, split, and batch a domain corpus.
  2. Training loop — use Hugging Face Trainer or a manual loop, but be able to explain every major config choice.
  3. Eval — compute perplexity on a held-out test set, before and after fine-tuning.
  4. Generation — produce at least 3 qualitative samples showing domain adaptation.
  5. Training log — record at least one failure → fix iteration using explainer §6.1 vocabulary.

Suggested corpora

  • Your own writing: notes, blog posts, docs.
  • A narrow support or ops dataset.
  • Legal or medical public-domain text.
  • Internal-style technical documentation.
  • A chosen writing style if you want a lighter project.

Deliverables

  1. train.py — training script with hyperparameters documented.
  2. eval.py — compute perplexity on the held-out set.
  3. generate.py — generate samples with at least two temperatures.
  4. config.yaml or equivalent — hyperparameters and paths.
  5. README.md — dataset, training setup, curves, perplexity delta, sample outputs.
  6. TRAINING_LOG.md or equivalent section in README — what failed, what you changed, what improved.

Success criteria

  • Perplexity drops measurably on your domain test set.
  • Generated text sounds more domain-specific than the base model.
  • You can explain your learning rate, batch size, sequence length, and number of epochs.
  • You can explain whether the run was more like broad shadowing or narrow specialization.
  • You documented at least one failure → fix step.

Constraints

  • Use an existing pretrained GPT-2 checkpoint.
  • Do not pretend this is frontier pretraining. This hands_on_lab is about the post-training mechanics on a small scale.
  • Keep the dataset small enough to finish locally or on a rented GPU.
  • Preserve a held-out split. No cheating by evaluating on train text.
  1. Pick a domain with visible stylistic or vocabulary patterns.
  2. Clean obvious duplicates and junk. Apply the curriculum lesson from explainer §2.1.
  3. Decide the training format. If you want instruction-style behavior, use a consistent prompt template. Explainer §3.3 matters here.
  4. Run a base-model eval first. Save perplexity and sample generations.
  5. Fine-tune with a conservative LR. Explainer §5.1 and §3.4 explain why.
  6. Evaluate again.
  7. Compare outputs and write the failure → fix story.

Hyperparameter suggestions

These are starting points. Not laws.

  • Learning rate: 5e-5 to 5e-4, depending on dataset size and objective.
  • Batch size: as large as fits, or simulate with gradient accumulation.
  • Sequence length: 128-512 for small runs.
  • Epochs: 1-5 for small corpora.
  • Precision: bf16 if available; otherwise fp16 or fp32.
  • Warmup: small fraction of total steps.

Hints

  • Clean the corpus first. Duplicates and junk distort tiny runs fast. See explainer §2.1.
  • Keep the format consistent. If training examples have different separators every few rows, the model learns noise. See explainer §3.3.
  • Use a small LR at first. Tiny corpora + big LR = catastrophic forgetting or unstable loss. See explainer §3.4 and §5.1.
  • Track effective batch size. If memory is tight, use accumulation and write down the formula. See explainer §5.3.
  • Checkpoint the run. Even a small local job can crash. See explainer §5.4.
  • Stop from eval, not ego. If train loss falls but samples get worse, stop. See explainer §5.5.

Common pitfalls

  • Forgetting a held-out split.
  • Training on duplicate-heavy text.
  • Using inconsistent prompt separators.
  • Using too high a learning rate and then blaming the model.
  • Reporting only train loss.
  • Claiming "the model improved" without before/after samples.
  • Comparing outputs at different decoding settings and calling it a fair test.

What to demonstrate in the writeup

  • Why this corpus was chosen.
  • What cleaning you did.
  • Exact training format.
  • Hyperparameters and why you picked them.
  • Perplexity before and after.
  • Three sample generations before and after.
  • One failure → fix note mapped to explainer §6.1.
  • One sentence on whether the run risked catastrophic forgetting.

Optional stretch goals

  • Try two data formats and compare.
  • Try one run with accumulation and one without.
  • Compare plain text continuation versus instruction-style formatting.
  • Build a tiny chosen/rejected set and write a note on how DPO would differ conceptually.

LinkedIn post template

Fine-tuned GPT-2 this week on a narrow domain corpus.

Before tuning, the model sounded generic. After tuning, the outputs picked up the domain vocabulary and structure much more reliably.

Three things that stood out: 1. Data cleaning mattered more than I expected. 2. Prompt / chat formatting changed behavior a lot. 3. Small LR decisions mattered because over-tuning can erase general behavior.

Biggest lesson: a pretrained model is the wiki reader. Fine-tuning is the shadowing. If the shadowing is messy, the behavior is messy.

Repo / notes: [link]

Why this hands_on_lab matters

Many people can repeat the words pretraining, SFT, and RLHF. Far fewer can connect those stages to actual training knobs. This hands_on_lab gives you the bridge. You will feel what small-data tuning changes, what it does not change, and how easily formatting and learning-rate mistakes show up in practice.