Assignment 5 — GPT-2 Domain Fine-Tune¶

This folder implements the Week 5 hands_on_lab from ../05_hands_on_lab.md.

What is included¶

train.py — fine-tuning script using Hugging Face Trainer
eval.py — held-out perplexity evaluation for base vs tuned checkpoints
generate.py — before/after sample generation at two temperatures
config.yaml — model, data, and training hyperparameters
TRAINING_LOG.md — one failure → fix note in the module’s language
data/domain_corpus.txt — a narrow corpus of AI-platform runbook notes
data/prompts.json — three prompts for qualitative comparison

Dataset¶

The local corpus is intentionally narrow:

AI platform runbooks
support-assistant operating guidelines
prompt, eval, and incident-response notes

That gives the model a visible vocabulary target: latency, citations, prompt templates, catastrophic forgetting, escalation, held-out evals.

Training setup¶

The default config targets gpt2 for the real hands_on_lab. For local smoke tests on CPU, override the model with sshleifer/tiny-gpt2.

Key defaults:

learning rate: 5e-5
batch size: 2
gradient accumulation: 4
effective batch size: 8
sequence length: 128
epochs: 2
warmup ratio: 0.1

This is a conservative setup on purpose. The module explainer warns that tiny corpora plus high LR can trigger catastrophic forgetting quickly.

Commands¶

Real hands_on_lab path¶

python3 train.py --config config.yaml
python3 eval.py --config config.yaml
python3 generate.py --config config.yaml

Local smoke-test path¶

python3 train.py --config config.yaml --model-name sshleifer/tiny-gpt2 --output-dir outputs/tiny_gpt2_smoke --max-train-samples 10 --max-eval-samples 2
python3 eval.py --config config.yaml --base-model-name sshleifer/tiny-gpt2 --tuned-model-path outputs/tiny_gpt2_smoke --max-eval-samples 2
python3 generate.py --config config.yaml --base-model-name sshleifer/tiny-gpt2 --tuned-model-path outputs/tiny_gpt2_smoke

Smoke validation snapshot¶

The local tiny-model smoke run completed successfully in this workspace.

base eval loss: 10.8128586
tuned eval loss: 10.8124733
base perplexity: 49655.21
tuned perplexity: 49636.08
perplexity delta: 19.13

That is only a sanity check, not a meaningful domain-tuning claim. sshleifer/tiny-gpt2 is too small for high-quality generations here, but it proves the train/eval/generate path works end to end.

What to write up after a full run¶

Base-model perplexity on the held-out split
Tuned-model perplexity on the held-out split
Perplexity delta
Three before/after generations at matched decoding settings
One failure → fix note
One sentence on catastrophic-forgetting risk

Why this hands_on_lab is mechanically useful¶

This is not frontier pretraining. It is a small post-training exercise that makes the knobs concrete:

data cleaning
formatting consistency
LR choice
warmup
effective batch size
held-out evaluation

If those words still feel abstract, run this folder end to end and inspect the saved metrics JSON.