07. Chat Protocol and Data Quality — behavior lives in tiny boundaries¶

What SFT teaches and what protocol can still break¶

In chapter 6, we saw that SFT teaches assistant behavior by showing user/assistant scenes and masking the right tokens. That solves the broad role problem: the model gets repeated examples of the answer shape we want.

The new problem is that scenes are not just human-readable text. They are rendered through chat templates, role markers, special tokens, masks, and validators. If those boundaries move between training, eval, and serving, the same visible words become a different model input.

This chapter teaches protocol discipline: render every row with the same template, mask only assistant tokens, reject bad role sequences, and inspect exactly what the model sees.

What this file solves¶

The same text can behave differently when role markers, templates, masks, or data quality drift. This file shows how to render every row with the same chat template, mask only assistant tokens, and reject bad role sequences before training, eval, or serving.

Why chat protocol is part of the model's job¶

Role markers and templates tell the model where instructions stop and assistant tokens begin. Treating them as formatting hides the real contract: training, eval, and serving must present the same conversation boundary.

When same words become a different conversation¶

The naive repair is to keep the visible text and change only the wrapper. If the special tokens, role order, or masks change, the model sees a different protocol even though the human-readable prompt looks unchanged.

When role markers move¶

<user> summarize this and <assistant> summary... teach two roles.
If serving uses different markers, the model may not know where the user's text ends and the assistant's answer begins.

Rule: chat behavior follows role boundaries¶

Chat behavior is stable only when role boundaries stay the same in training, eval, and serving.

Why boundaries change behavior. A chat template tells the model who is speaking. If those boundaries move between training and serving, the model is asked to do a different job.

1) Hook — the invisible template breaks the visible answer¶

Two datasets contain the same text. One wraps it as:

<|user|> ... <|assistant|> ...

Another uses:

[INST] ... [/INST] ...

If the checkpoint expects one protocol and inference uses another, quality drops for reasons that look mysterious.

The interesting part is that the visible words can be identical while the learned boundary is different. To the model, special tokens and role markers are part of the world, not decoration around it.

2) Mental model — chat as a wire protocol¶

system rules ─┐
user request ─┼─→ template ─→ token ids ─→ model ─→ assistant tokens
tool result  ─┘

The template is part of the API. It tells the model where instruction ends and answer begins.

same conversation
      │
      ├─ rendered with train template  → familiar boundary
      └─ rendered with serve template  → shifted boundary, strange behavior

3) Running example — incident bot with role confusion¶

Bad row:

User: Summarize the incident.
Assistant: User should not restart workers.

The answer says "User" as prose. In some templates, that can be confused with a role marker. Clean data avoids collisions and enforces exact assistant behavior.

Attempt A: concatenate text with loose separators. Faster ingestion, more role confusion.

Attempt B: canonical chat template, validation checks, and loss masks generated from roles.

4) Better rows beat more rows for protocol bugs¶

Add weak rows — helps coverage, but adds noisy style and contradictions.
Curate fewer rows — helps precision, but can miss long-tail cases.
Synthetic expansion — helps rapid breadth, but risks blandness and hidden artifacts.
Human gold rows — helps trust, but costs more and scales slowly.

For assistant behavior, one contradictory template can poison many otherwise good examples.

5) Curation prevents regression¶

Quality filters should check:

role sequence legality
assistant answer presence
refusal calibration
factual preservation
duplicate prompts with conflicting targets
unsafe or private content
template render/decode round trip

6) Train and eval templates can disagree¶

flowchart TD
  A[Train with Template A] --> B[Good train loss]
  C[Serve with Template B] --> D[Role boundary shift]
  D --> E[Bad completions]
  B --> F[Misleading confidence]

The model did not forget. The serving input is out of protocol.

7) What strict templates fix and slow down¶

Exact duplicate prompt checks at 100% duplicate match catch conflicting targets.
Source-specific p99 answer-length checks catch rambling targets.
Template round-trip tests on every row catch broken special tokens.
Human audit samples of about 200 rows per bucket catch subtle policy and style bugs.

8) Signals that train and serve templates disagree¶

Healthy: train, eval, and serving all render identical role boundaries.
First degrading metric: format pass rate drops only in served traffic.
Misleading beginner metric: aggregate SFT loss.
Expert graph: eval score by template version and dataset bucket.

9) Where protocol rigor matters most¶

Protocol rigor is strongest for chat, tool use, JSON, and multi-turn systems. It becomes less central for plain completion models. It hits a limit when product behavior depends on external state or policy engines that cannot be represented in static examples.

10) Wrong model: chat templates are just formatting¶

Wrong model: "Chat templates are formatting."

Replacement: chat templates are behavioral boundary markers. The wiki reader learns where the assistant identity starts because the protocol repeats.

11) Other ways role boundaries leak into behavior¶

assistant learns to generate role markers
system message appears inside user text
tool outputs are trained as assistant prose
refusal examples contradict policy
duplicate prompts have different labels
synthetic answers are overlong
answer masks include padding tokens
eval uses a different BOS/EOS convention

12) The same boundary problem in APIs and tool agents¶

This mirrors API schema discipline in backend systems: loose serialization creates downstream ambiguity. It also foreshadows tool-calling agents, where role and tool boundaries become security boundaries, not just quality boundaries.

13) Quick test: can you render exactly what the model sees?¶

Can you render any row exactly as the model sees it?
Are special tokens checkpoint-compatible?
Can a validator reject illegal role sequences?
Do train and inference share the same template code?
Are duplicate conflicts measured?

Where chat protocol bugs appear in real systems¶

Llama chat templates — [INST] conventions affect behavior.
ChatML-style formats — role markers define system/user/assistant boundaries.
Tool-calling datasets — function results must not be learned as assistant speech.
OpenAI-compatible servers — APIs hide templates, but models still need them.
HF Tokenizers — apply_chat_template is a training and serving contract.
Enterprise assistants — system policy drift creates inconsistent refusals.
JSON mode fine-tunes — schema examples need exact target masking.
Function-calling agents — tool results must be context, not assistant-authored claims.
Moderation systems — policy role and user content must not blur.
Multi-agent chats — speaker identity becomes part of task correctness.
Evaluation harnesses — prompt serialization must match deployment.
Voice chat systems — turn boundaries affect interruption and continuation.
Notebook copilots — code, output, markdown, and user intent need separate roles.
Customer-support transcripts — copied agent/customer labels can poison speaker behavior.
RAG chatbots — retrieved context must stay source material, not become assistant identity.

What you should remember¶

This chapter explained why chat protocol is part of the model's learned job, not decoration around the text. The important idea is that role markers, templates, special tokens, and loss masks define where the user stops and the assistant begins.

You learned to render every row with the same chat template, mask only assistant tokens, reject bad role sequences, and keep train, eval, and serving prompts aligned. That solves the opening failure because identical visible words can become different conversations when the boundary tokens change.

Carry this diagnostic forward: when a chat checkpoint behaves oddly after data conversion or serving changes, inspect the rendered tokens before blaming the model. If boundaries moved, the model may be answering a different protocol than the one it practiced.

Remember:

Same visible chat can become different token sequences.
The template is part of the training contract.
Loss should teach assistant behavior, not user or system text.
Bad role order creates bad behavior signal.
Train, eval, and serving must use the same renderer.
If quality drops after deployment, inspect rendered tokens before blaming the checkpoint.

Check your understanding of role boundaries¶

Why is a chat template part of model behavior?
What failure appears when train and serve templates differ?
Why can fewer curated rows beat many weak rows?
What validator would you add before SFT?
Why is a role marker more like an API boundary than a visual label?
What data bug can make loss fall while deployment quality drops?

Interview Q&A¶

Q. Why do role markers matter in SFT?
A. They mark which text is instruction, context, tool output, and assistant answer, so the model learns correct speaker boundaries.
Common wrong answer to avoid: "They only make logs readable."

Q. How can train loss hide template bugs?
A. The model can learn the training serialization well while serving uses a different serialization that shifts boundaries.
Common wrong answer to avoid: "Low loss always means deployment behavior is good."

Q. What does data quality mean beyond grammar?
A. Correct role order, consistent targets, policy accuracy, factual preservation, deduplication, and template compatibility.
Common wrong answer to avoid: "Quality means polished writing."

Q. Why should train and inference share template code?
A. Separate renderers can drift in special tokens, BOS/EOS placement, role order, or spacing, creating inputs the model was not tuned to handle.
Common wrong answer to avoid: "Templates only affect readability."

Q. How can duplicate prompts with conflicting answers hurt SFT?
A. They give the same context different targets, weakening the behavior signal and teaching inconsistency as if it were legitimate variance.
Common wrong answer to avoid: "Duplicates only waste tokens."

Q. Why are tool outputs dangerous in chat data?
A. If tool outputs are labeled or masked like assistant speech, the model may learn to fabricate tool results instead of treating them as external context.
Common wrong answer to avoid: "All text in the conversation should be learned equally."

Apply now (10 min)¶

Model the exercise: render one chat row with explicit system/user/assistant tokens.
Your turn: write three validation rules for that row.
Reproduce from memory: draw chat as a wire protocol.

Bridge. Clean demonstrations create useful behavior, but they still show only one answer at a time. The next pressure is choosing between multiple plausible answers, which moves us to the preference desk. → 08-preferences-reward-ppo-dpo.md