08. Preferences, Reward Models, PPO, and DPO — choosing better answers without breaking the model¶

What SFT fixed and what preferences still must choose¶

In chapter 6, SFT made assistant-shaped answers cheap. In chapter 7, protocol discipline made sure the model sees the same role boundaries during training, eval, and serving.

The new problem is that two answers can both follow the format, preserve the facts, and obey the protocol, yet one is much better for the user. One is concise and specific; another is polite but vague. One refuses safely; another refuses too broadly.

This chapter teaches the preference desk: collect chosen/rejected pairs, update against a reference model, and watch drift signals so taste improves without breaking skill.

What this file solves¶

SFT shows one acceptable answer, but products often need to choose between several acceptable answers. This file shows how to collect chosen/rejected pairs, tune against a reference model, and watch KL, length, refusal rate, and factuality while improving taste.

Why SFT is not enough for taste¶

One demonstration says "this answer is acceptable." Product behavior often needs a ranking among acceptable answers: concise over verbose, calibrated over overconfident, useful refusal over blanket refusal.

When optimizing preference can break skill¶

The naive repair is to push hard toward whatever the judge rewards. If the reward is imperfect or the update moves too far from the reference model, the checkpoint can become stylish while losing factuality, coverage, or refusal balance.

When the shorter answer should win¶

Answer A is short, specific, and useful.
Answer B is polite, long, and vague.
Preference training teaches the model that A should win, without letting it forget how to answer other tasks.

Rule: preferences move taste without breaking skill¶

Preference training should make better answers more likely without breaking the model's useful skills.

Why the leash matters. Preference training teaches the model which good-looking answer should win. It also needs a leash, because chasing the judge too hard can damage the model.

1) Hook — both answers are correct, only one should win¶

Prompt:

Explain why retries rose after the deploy.

Answer A is concise and names the likely cause. Answer B is longer, polite, and vague. SFT may imitate either if it appears as the target. Preference training says: when both are plausible, choose A.

The curiosity is that the "better" answer may contain fewer tokens, less politeness, and less explanation. Preference training exists because usefulness is often relational, not visible from a single gold answer.

2) Mental model — preference desk with a tether¶

prompt ──→ answer A ─┐
                     ├─→ human/model preference ─→ training step
prompt ──→ answer B ─┘                                │
                                                       ▼
                                      improved policy, tethered to reference

Without a tether, the model can exploit the judge. With too strong a tether, it cannot improve.

chosen/rejected pair ──→ preference signal
preference signal + reference model ──→ small behavior shift
small shift + evals ──→ useful change or caught reward hack

3) Running example — incident explanation ranking¶

Chosen:

Retries rose because the deploy likely changed payment-worker behavior;
rollback began at 14:18, so compare retry rate before and after that point.

Rejected:

Retries can happen for many reasons. The team should investigate carefully
and communicate with stakeholders.

The chosen answer is not more grammatical. It is more operational.

4) Choosing RM+PPO or DPO under drift risk¶

Reward model + PPO trains a judge and optimizes the policy with KL; it fits online-style improvement, but is complex and unstable.
DPO optimizes directly from chosen/rejected pairs; it is simpler offline preference tuning, but assumes pair data and the reference form are enough.
Rejection sampling generates many answers and keeps the best; it is simple when the judge is good, but expensive and diversity-reducing.
More SFT imitates chosen answers and is easy to run, but gives weak pairwise pressure.

PPO exposes the control loop. DPO compresses the loop into a supervised-looking objective.

5) KL sets the drift budget¶

KL penalty asks: how far did the tuned policy move from the reference policy? If reward rises while KL explodes, the model may be gaming the reward model instead of becoming more useful.

Too little drift gives little improvement.
Moderate drift gives useful style shift.
Too much drift invites reward hacking, verbosity, and weird refusals.

6) Reward hacking turns taste into damage¶

If the preference desk over-rewards polite caveats, PPO can produce long, hedged answers that score well and annoy users.

flowchart LR
  A[Reward model likes politeness] --> B[PPO optimizes reward]
  B --> C[More caveats]
  C --> D[Higher reward score]
  C --> E[Lower product usefulness]

7) What preference tuning improves and risks¶

Preference collection may need 20k chosen/rejected pairs; human judgment becomes the bottleneck.
Reward modeling uses a small classifier or ranker; proxy quality becomes the risk.
PPO carries policy, reference, and reward model together; memory and stability become expensive.
DPO trains policy against a reference on pairs; it is simpler, but explores less online.

8) Signals that preference training is helping or reward-hacking¶

Healthy: preference win rate rises while KL, length, refusal rate, and factuality stay bounded.
First degrading metric: answer length or refusal rate drifts before humans complain.
Misleading beginner metric: reward score alone.
Expert graph: reward, KL, length, factuality, and human win rate together.

9) Where preferences help and where missing facts remain¶

Preference methods work well for style, helpfulness, refusal calibration, and tradeoffs among valid answers. They become pathological when the judge is shallow, labels are inconsistent, or factuality is not separately evaluated. They cannot create missing domain knowledge.

10) Wrong model: the reward model is human values¶

Wrong model: "The reward model represents human values."

Replacement: the reward model is a proxy trained from limited comparisons. The preference desk can be useful and still incomplete.

11) Other ways the judge can be gamed¶

reward hacking
verbosity inflation
sycophancy
over-refusal
under-refusal
factuality regression hidden by style wins
preference label inconsistency
KL collapse or no movement
DPO overfitting to pair artifacts

12) The same dashboard-chasing failure in product systems¶

This is the same failure as metrics-driven product teams: chase the dashboard too hard and users suffer. It also mirrors control systems: feedback helps only when the measured signal tracks the real goal.

13) Quick test: do the pairs encode the real tradeoff?¶

What real product value does each preference label encode?
Is factuality evaluated outside the reward score?
Are length and refusal rates monitored?
Is there a reference model and drift budget?
Do pair labels include hard tradeoff cases, not only obvious wins?

Where preference training shapes assistant behavior¶

InstructGPT-style RLHF — SFT, reward model, PPO, and KL control.
DPO-tuned open models — pairwise preferences without explicit PPO rollout.
Chatbot helpfulness tuning — chooses concise, direct answers over generic prose.
Safety tuning — calibrates refusal boundaries from comparisons.
Code assistants — ranks patch usefulness, not just syntactic validity.
Customer support bots — optimizes tone and escalation choices.
Evaluation vendors — collect pairwise judgments and rubrics.
Red-team loops — add preference pressure after failure discovery.
Search ranking — pairwise clicks or judgments teach which result should win.
Recommendation systems — optimizing engagement without guardrails creates proxy harm.
Summarization products — users prefer concise faithful summaries over verbose coverage.
Coding agents — patch usefulness and test behavior need ranking beyond syntax.
Policy tuning — refusal boundaries are comparisons, not one universal phrase.
Tutor models — explanations are ranked by pedagogical usefulness, not length.
Enterprise assistants — brand tone and escalation choices require taste calibration.

What you should remember¶

This chapter explained why preference training exists after SFT. The important idea is that SFT can show one acceptable answer, but products often need to choose between several acceptable answers without letting the model drift away from useful behavior.

You learned to collect chosen/rejected pairs, choose an update method such as RM+PPO or DPO, anchor the model against a reference, and watch KL, length, refusal rate, factuality, and slice metrics. That solves the opening failure because it teaches which good-looking answer should win while limiting reward hacking and skill loss.

Carry this diagnostic forward: when preference tuning improves style but hurts truth, coverage, or refusals, inspect the judge and the drift budget. A reward signal is not human values; it is a training pressure that can be gamed.

Remember:

SFT shows one acceptable answer; preferences choose between acceptable answers.
Pair quality matters more than preference jargon.
The reference model is the leash.
Reward score is not product truth.
Watch KL, length, refusal rate, factuality, and human win rate together.
If style improves while usefulness drops, suspect reward hacking or drift.

Check your understanding of preference tuning¶

Why does preference training come after SFT?
What does KL protect?
Why is reward score alone dangerous?
When is DPO simpler than PPO, and what does it give up?
Why is a preference label not the same thing as a truth label?
Which independent metrics would you monitor while optimizing a reward proxy?

Interview Q&A¶

Q. Why not use SFT for all preference learning?
A. SFT imitates a target answer; preference data directly teaches which of two plausible answers should win.
Common wrong answer to avoid: "SFT and preference training are identical because both use text."

Q. What is reward hacking in RLHF?
A. The policy learns behaviors that score well under the reward model but do not improve true user value.
Common wrong answer to avoid: "Reward hacking only happens with malicious prompts."

Q. Why use a KL penalty in PPO?
A. It limits drift from the reference model so training does not destroy capability or chase the reward model too aggressively.
Common wrong answer to avoid: "KL is just a regularization term with no behavioral meaning."

Q. Why can DPO be simpler operationally than PPO?
A. It optimizes directly from chosen/rejected pairs against a reference form, avoiding online rollouts and a separate reward-model control loop.
Common wrong answer to avoid: "DPO means there is no reference or drift concern."

Q. Why should factuality be evaluated outside the reward score?
A. A reward model may learn style signals correlated with preference while missing factual errors that get worse when training chases the score.
Common wrong answer to avoid: "If humans preferred it, it must be factual."

Q. What kind of pair data is most valuable?
A. Pairs where both answers are plausible but differ on a real product tradeoff, because obvious bad/good pairs teach little about the boundary.
Common wrong answer to avoid: "Only collect easy wins for clean labels."

Apply now (10 min)¶

Model the exercise: write a chosen/rejected pair for the incident bot.
Your turn: name one proxy failure your pair could introduce.
Reproduce from memory: draw the preference desk with a reference-model tether.

Bridge. Preference training gives another powerful knob, but a lifecycle is not complete until you decide when to stop, what to ship, and which failures belong to the next module's adaptation and compression work. → 09-lifecycle-decisions-evals.md