Skip to content

06. SFT Behavior Copying — same loss, different scenes

What tooling makes visible and what behavior still lacks

In chapter 3, we saw that training still reduces to next-token loss. In chapter 5, we saw that tooling must expose tokenizer, collator, masks, loss, gradients, and checkpoint artifacts before we trust a wrapper.

The new problem is behavioral. A base model may know the content, and the tooling may run correctly, but the model can still answer like a document continuer instead of an assistant doing a job.

This chapter teaches the shadow shift: show user requests and assistant answers until the desired role, shape, and style become cheap continuations. The loss stays familiar; the scenes change.

What this file solves

A base model may know the content but answer in the wrong role, shape, or style. This file shows how clean user/assistant examples, loss on assistant tokens only, and behavior-plus-forgetting evals turn broad knowledge into assistant-shaped work.

Why known content still needs behavior examples

Knowing facts does not teach the model which role to play or which answer shape the product expects. SFT makes the desired assistant scene cheap by repeating exact user request and assistant response patterns.

When prompting cannot make the role cheap enough

The tempting repair is to keep adding instructions to the prompt. If the model has rarely practiced the desired assistant behavior, long prompts stay fragile and the answer drifts back toward generic continuation.

When examples change the answer shape

Before SFT, the model may answer an incident prompt with an essay.
After many clean examples, it learns that the assistant reply should be short bullets.
The loss is still next-token loss; the scenes changed.

Rule: SFT teaches by showing assistant behavior

SFT changes behavior by training on assistant answers we want the model to copy.

Why imitation changes behavior. SFT does not add a new brain. It shows the model many scenes where a user asks for work and the assistant gives the kind of answer we want.


1) Hook — no new magic loss appears

Pretraining predicts the next token. SFT also predicts the next token. The difference is the sequence:

<system> You are concise.
<user> Summarize this incident in 3 bullets.
<assistant> - Retry spike followed deploy...

The model learns that after <assistant>, the useful continuation is a concise answer, not generic prose.

The hook is that the loss did not become "helpfulness loss." The same token machine learns manners because the training scenes changed from loose documents to supervised work shifts.

2) Mental model — shadowing a senior operator

user task ──→ senior answer ──→ masked sequence ──→ loss on assistant
                 **the shadow shift**

The junior does not learn a new law of physics. They copy the demonstrated job.

pretraining scene: many possible continuations
SFT scene:         one demonstrated assistant continuation
preference scene:  several continuations ranked by usefulness

3) Running example — incident answer shape

Before SFT, first-token probabilities after the incident prompt might be:

Token Probability
Incident 0.31
The 0.22
- 0.07

After 80k strong demonstrations:

Token Probability
- 0.46
1. 0.18
Incident 0.05

The model is not suddenly more knowledgeable about rollbacks. It now expects assistant-style output.

4) Choosing SFT, prompting, or preferences

  • Prompting is best at selecting existing behavior at inference time, but weak at creating missing behavior priors.
  • SFT is best at teaching format, role, and task framing, but weak at ranking two acceptable answers.
  • Reward/PPO/DPO is best at choosing preferred behavior, but weak at teaching basic syntax from scratch.

SFT is the behavior foundation. Preference training sharpens taste after the foundation exists.

5) Loss masks decide what gets imitated

If loss covers all tokens, the model learns to write user prompts. If loss covers only assistant tokens, it learns the answer behavior.

system/user tokens:     context, usually masked
assistant tokens:       target behavior, usually trained

6) Sloppy examples make sloppy assistants

If demonstrations include rambling caveats, fake certainty, or inconsistent JSON, SFT faithfully copies those flaws. The shadow shift is only as good as the senior operator.

Mini-FAQ. "Can preference training clean that later?" Some of it. But preference stages are weaker and more expensive when they must fight a badly copied base behavior.

7) What imitation fixes and can forget

  • 10k pristine rows are small, but can strongly improve style and format.
  • 500k mixed rows add coverage, but may also add style noise.
  • 5M weak synthetic rows add breadth, but risk blandness and copied artifacts.

The useful unit is not row count. It is correct behavioral demonstrations per failure mode.

8) Signals that SFT improved behavior or caused forgetting

  • Healthy: task pass rate improves without broad capability collapse.
  • First degrading metric: regression on unrelated capabilities.
  • Misleading beginner metric: training loss alone.
  • Expert graph: per-task eval deltas versus SFT dataset source buckets.

9) Where SFT teaches behavior and where it cannot rank taste

SFT works unusually well for format, turn-taking, role behavior, and common task patterns. It becomes pathological when teams use it to force facts, policy nuance, or taste ranking into one canonical answer. It hits a limit when there are many valid responses and one target hides tradeoffs.

10) Wrong model: SFT teaches the model to reason

Wrong model: "SFT teaches the model to reason."

Replacement: SFT mostly teaches the model how good answers look for a task. Reasoning may improve if demonstrations expose useful intermediate structure, but imitation is the primitive.

11) Other ways imitation copies the wrong behavior

  • overfitting to one response style
  • loss on prompt tokens
  • synthetic homogeneity
  • refusal overgeneralization
  • train/eval template mismatch
  • catastrophic forgetting from narrow data
  • hidden benchmark leakage
  • copied hallucination patterns

12) The same apprenticeship pattern in code review and ML

SFT has the same shape as apprenticeship in code review: repeated examples establish taste faster than policy documents. It also echoes supervised learning in classical ML: labels define the target, and bad labels become model behavior.

13) Quick test: do the examples teach the desired behavior?

  • Which behavior is each dataset slice supposed to teach?
  • Are assistant tokens masked correctly?
  • Does eval include non-target regressions?
  • Are synthetic rows distinguishable in logs?
  • Can humans explain why a target answer is good?

Where SFT turns base models into assistants

  • Instruct model variants — base checkpoints become chat-friendly through SFT.
  • Copilot-style repair agents — demonstrations teach edit/test/explain patterns.
  • Customer-support bots — tone, brevity, and escalation behavior come from examples.
  • JSON extraction models — repeated target shapes teach parser-compatible output.
  • Safety assistants — refusal wording is first demonstrated before preference tuning.
  • Tool-use datasets — assistant learns when to call a tool versus answer directly.
  • Domain copilots — SFT teaches domain workflows on top of general language.
  • Code-edit assistants — examples teach patch shape, test invocation, and explanation order.
  • Legal clause tools — demonstrations teach conservative drafting format.
  • Medical note summarizers — SFT teaches what to preserve and what not to invent.
  • SQL generation systems — examples teach schema-aware answer style.
  • Math tutors — demonstrations teach when to show steps versus final answer.
  • Moderation assistants — examples teach policy categories and refusal wording.
  • Enterprise search assistants — SFT teaches citation behavior on internal documents.
  • Report-writing copilots — examples teach executive summary, evidence, and caveat shape.

What you should remember

This chapter explained why SFT can change assistant behavior without introducing a new magic loss. The important idea is that the model still predicts next tokens, but now it practices user/assistant scenes where the desired answer shape is repeatedly shown.

You learned to build clean demonstrations, mask only assistant tokens, and evaluate both improved behavior and possible forgetting. That solves the opening failure because the base model may know the content, but SFT makes the correct role, shape, and style cheap to continue.

Carry this diagnostic forward: when a model knows the facts but answers in the wrong form, inspect the demonstrations before assuming it needs more knowledge. If the examples are sloppy, ambiguous, or masked incorrectly, SFT will copy the sloppiness.

Remember:

  • SFT uses the same next-token loss in different scenes.
  • The examples teach role, format, and answer shape.
  • Loss masks decide what behavior is copied.
  • Clean small datasets can beat huge weak datasets.
  • SFT does not rank two good answers; it imitates one target.
  • Bad demonstrations make bad behavior statistically normal.

Check your understanding of SFT as imitation

  • Why can SFT use the same next-token loss as pretraining?
  • What does masking decide?
  • Why does SFT struggle to rank two valid answers?
  • What metric catches forgetting after SFT?
  • Why can a small clean SFT set outperform a huge weak one?
  • What behavior would you avoid trying to teach with one canonical target?

Interview Q&A

Q. Why does SFT improve formatting so strongly?
A. It repeatedly places correct format tokens after instruction contexts, raising their probability in exactly those scenes.
Common wrong answer to avoid: "The model gets a formatting module."

Q. Why is SFT not enough for preference nuance?
A. It usually provides one target answer, not a comparison between acceptable alternatives under a value function.
Common wrong answer to avoid: "The target answer is always uniquely best."

Q. What is the first SFT bug you check in code?
A. Label masking and chat template correctness, because loss can look good while training the wrong tokens.
Common wrong answer to avoid: "Only check final benchmark score."

Q. Why does SFT often improve perceived intelligence?
A. It makes existing knowledge appear in user-useful forms: concise answers, correct roles, structured output, and better task framing.
Common wrong answer to avoid: "SFT adds most of the model's knowledge."

Q. Why can SFT cause forgetting?
A. A narrow set of demonstrations can move probabilities toward one behavior family while reducing performance on unrelated tasks not represented in the data.
Common wrong answer to avoid: "Fine-tuning only affects the target task."

Q. When is more SFT data worse?
A. When added rows are noisy, duplicated, bland, contradictory, or synthetic in a way that teaches the model cheap style instead of robust behavior.
Common wrong answer to avoid: "More rows always reduce overfitting."

Apply now (10 min)

  1. Model the exercise: write one prompt/assistant pair for the incident summarizer.
  2. Your turn: mark which tokens should receive loss.
  3. Reproduce from memory: explain why SFT is same loss, different scenes.

Bridge. If SFT copies demonstrations, the next pressure is demonstration quality and protocol clarity. The model cannot obey a chat contract that the dataset formats inconsistently. → 07-chat-protocol-and-data-quality.md