Skip to content

Assignment 5 — GPT-2 Domain Fine-Tune

This folder implements the Week 5 hands_on_lab from ../05_hands_on_lab.md.

What is included

  • train.py — fine-tuning script using Hugging Face Trainer
  • eval.py — held-out perplexity evaluation for base vs tuned checkpoints
  • generate.py — before/after sample generation at two temperatures
  • config.yaml — model, data, and training hyperparameters
  • TRAINING_LOG.md — one failure → fix note in the module’s language
  • data/domain_corpus.txt — a narrow corpus of AI-platform runbook notes
  • data/prompts.json — three prompts for qualitative comparison

Dataset

The local corpus is intentionally narrow:

  • AI platform runbooks
  • support-assistant operating guidelines
  • prompt, eval, and incident-response notes

That gives the model a visible vocabulary target: latency, citations, prompt templates, catastrophic forgetting, escalation, held-out evals.

Training setup

The default config targets gpt2 for the real hands_on_lab. For local smoke tests on CPU, override the model with sshleifer/tiny-gpt2.

Key defaults:

  • learning rate: 5e-5
  • batch size: 2
  • gradient accumulation: 4
  • effective batch size: 8
  • sequence length: 128
  • epochs: 2
  • warmup ratio: 0.1

This is a conservative setup on purpose. The module explainer warns that tiny corpora plus high LR can trigger catastrophic forgetting quickly.

Commands

Real hands_on_lab path

python3 train.py --config config.yaml
python3 eval.py --config config.yaml
python3 generate.py --config config.yaml

Local smoke-test path

python3 train.py --config config.yaml --model-name sshleifer/tiny-gpt2 --output-dir outputs/tiny_gpt2_smoke --max-train-samples 10 --max-eval-samples 2
python3 eval.py --config config.yaml --base-model-name sshleifer/tiny-gpt2 --tuned-model-path outputs/tiny_gpt2_smoke --max-eval-samples 2
python3 generate.py --config config.yaml --base-model-name sshleifer/tiny-gpt2 --tuned-model-path outputs/tiny_gpt2_smoke

Smoke validation snapshot

The local tiny-model smoke run completed successfully in this workspace.

  • base eval loss: 10.8128586
  • tuned eval loss: 10.8124733
  • base perplexity: 49655.21
  • tuned perplexity: 49636.08
  • perplexity delta: 19.13

That is only a sanity check, not a meaningful domain-tuning claim. sshleifer/tiny-gpt2 is too small for high-quality generations here, but it proves the train/eval/generate path works end to end.

What to write up after a full run

  1. Base-model perplexity on the held-out split
  2. Tuned-model perplexity on the held-out split
  3. Perplexity delta
  4. Three before/after generations at matched decoding settings
  5. One failure → fix note
  6. One sentence on catastrophic-forgetting risk

Why this hands_on_lab is mechanically useful

This is not frontier pretraining. It is a small post-training exercise that makes the knobs concrete:

  • data cleaning
  • formatting consistency
  • LR choice
  • warmup
  • effective batch size
  • held-out evaluation

If those words still feel abstract, run this folder end to end and inspect the saved metrics JSON.