01. Base Model Product Contract — fluency is not usefulness¶

What the module setup gives us and what still breaks¶

In 00-eli5.md, we set up the lifecycle as a chain of pressures: the wiki reader reads broadly, the curriculum decides what world it practices, the shadow shift teaches assistant-shaped work, and the preference desk later chooses between plausible answers. That gives us the map, but not the first failure.

The first failure is simple and easy to miss: the base model can be good at continuing text while bad at doing the user's job. It may know the topic, sound fluent, and still ignore the product contract.

This chapter teaches the first split to carry through the whole module: knowledge is not the same as contract-following. Before adding data, tools, preferences, or eval gates, we need to ask whether the model lacks facts or lacks the practiced behavior the product expects.

What this file solves¶

A base model can sound fluent while ignoring the user's actual job. This file shows how to separate text continuation from the product contract, write the contract as explicit eval rows, and add demonstrations that make contract-shaped answers likely.

Why product contracts need more than fluent continuation¶

The product does not buy "likely next text." It buys a bounded outcome: exact format, preserved facts, safe refusals, and useful next action. A base model may know the words but still not know which obligations matter.

When prompting a base model still misses the job¶

The tempting repair is to ask more clearly: "Return exactly three bullets." If format obedience was never practiced as part of the job, the model may still answer with a fluent paragraph instead of three fact-preserving bullets.

Rule: continuation is not the product contract¶

A base model continues text; an assistant must finish the user's job.

Why fluency is not enough. The base model is trying to write likely next text. The product needs a useful outcome, so training has to make the job-shaped answer easier than the generic continuation.

1) Hook — the answer sounds nearby but misses the job¶

Ask a base checkpoint:

Summarize this incident update in exactly 3 bullets.

Payments retries rose after the 14:05 deploy. Rollback started at 14:18.
Do not restart workers manually; the queue is draining. Finance wants an ETA.

It may answer:

Incident updates help teams coordinate during outages. Payment retries can
increase when deployments change service behavior. Rollbacks are common.

The answer is fluent. It is also a contract failure. It did not produce three bullets, preserve the operational warnings, or help finance.

Teacher voice. The dangerous failure is not gibberish. It is nearby prose that looks intelligent enough to hide that the product task was ignored.

The curiosity hook is that the model can fail because it is doing its original job well. Article-like continuation is not a bug under pretraining. It becomes a bug only after a product wraps the model in an instruction API and expects obedience.

2) Mental model — continuation versus contract¶

┌─────────────────────┐        ┌─────────────────────┐
│ base model pressure │        │ assistant pressure  │
├─────────────────────┤        ├─────────────────────┤
│ continue text       │        │ obey instruction    │
│ imitate corpus      │   ≠    │ preserve context    │
│ be plausible        │        │ respect format      │
│ minimize token loss │        │ satisfy user job    │
└─────────────────────┘        └─────────────────────┘

The wiki reader sees a document prefix. The product sees an API call with obligations. Same prompt, different contract.

same visible text
      │
      ├─ pretraining view: "what text usually follows?"
      │
      └─ product view:     "what outcome did the caller buy?"

3) Running example — incident summarizer¶

We will carry one task through the chapter: a support system needs an incident summarizer that returns exactly three bullets, each under twelve words, preserving actions.

Attempt A: keep pretraining and prompt harder.

Please, very carefully, summarize in exactly 3 bullets...

This sometimes works. It is not a dependable product boundary because the behavior was never made statistically central.

Attempt B: create demonstrations.

User: Summarize in exactly 3 bullets...
Assistant:
- Retry spike followed the 14:05 deploy.
- Rollback began at 14:18.
- Do not restart workers; queue is draining.

The second approach teaches the model what answer shape belongs after that request.

4) Why demonstrations beat more pretraining for contract failures¶

More raw pretraining helps when the model lacks domain facts, but it gives weak signal for the exact incident-summary contract.
Better prompting helps when the behavior already exists, but it stays fragile across wording and long contexts.
SFT examples help when format and task behavior are missing, but they can imitate bad examples if curation is weak.
Preference training helps when valid answers need ranking, but it comes after basic behavior already exists.

For this workload, 50,000 clean incident pairs can beat 5B generic text tokens because the missing behavior is concentrated in the pairs.

5) Local plausibility is not user usefulness¶

Suppose the first answer token distribution before SFT looks like this:

Token	Probability
`Incident`	0.34
`The`	0.20
`-`	0.08
`1.`	0.05
other	0.33

For a three-bullet product, - or 1. should dominate. The base model's top choice is not "wrong" under pretraining; article-style continuations were common in the curriculum.

6) Format obeyed, content dropped¶

A common intermediate model returns:

- Retry spike followed the deploy.
- Rollback started.
- Finance wants an ETA.

It obeys format but drops "do not restart workers." That exposes the next pressure: product behavior is not only shape; it is preservation of task-critical facts.

7) What contract training fixes and costs¶

5B extra generic tokens cost many GPU-days and mostly buy broader language compression.
50k incident SFT rows at 180 tokens each cost about 9M targeted tokens and buy task framing plus output shape.
5k pairwise preferences cost human-review time and buy ranking between already acceptable summaries.

Small targeted stages can move behavior more than large unfocused stages.

8) Signals that fluency is hiding a contract failure¶

Healthy: exact-format pass rate and action-preservation rate rise together.
First degrading metric: critical-fact omission increases on long incidents.
Misleading beginner metric: average response fluency.
Expert graph: format pass rate versus factual preservation by incident length.

9) Where contract training helps and where facts are missing¶

This diagnosis fits when the model knows the words but violates the job. It becomes pathological when the domain facts are absent; SFT cannot imitate knowledge it never acquired. It breaks at scale when the product contract includes hidden policies not represented in demonstrations or evals.

10) Wrong model: a smarter base model will naturally obey¶

Wrong model: "A smarter base model will naturally follow instructions."

Replacement: intelligence and obedience are different learned distributions. The wiki reader can know the policy and still fail the response contract until the shadow shift makes that contract common.

11) Other ways base models miss the product job¶

correct topic, wrong format
correct format, missing critical action
answers as if writing an article
continues the user's text as another speaker
over-explains because the corpus rewarded exposition
refuses harmless tasks because safety wording is misgeneralized
follows the last instruction but ignores earlier constraints

12) The same contract gap in APIs, agents, and prompts¶

This is the same shape as API contract drift: a service can be internally healthy while violating the caller's schema. It also echoes prompt-injection failures later in the track: the model confuses text to transform with instructions to obey. The shared invariant is boundary recognition under ambiguous text.

13) Quick test: is this knowledge failure or contract failure?¶

Can you name the exact product contract?
Does the eval separate fluency from task completion?
Do demonstrations contain the boundary cases users actually send?
Can prompt-only fixes survive paraphrases?
Do you know whether the failure is factual, behavioral, or preference-ranked?

Where base-model contract gaps show up in products¶

ChatGPT-style assistants — base capability needs instruction behavior before it feels usable.
Copilot Chat — code knowledge is not enough; repair requests need task framing.
Support summarizers — exact action preservation matters more than fluent prose.
Legal drafting tools — continuation can sound credible while missing clause constraints.
Incident bots — format adherence and operational salience must be evaluated separately.
JSON extraction systems — one wrong opening token can break downstream parsers.
Safety classifiers — refusals must follow policy, not generic caution.
Email assistants — tone and brevity are product contracts, not facts.
SQL assistants — valid SQL still fails if it ignores schema, cost, or result shape.
Documentation bots — summarizing a page is different from answering the user's migration question.
Voice assistants — fluent text is useless if it cannot fit latency and turn-taking constraints.
Code review agents — comments must identify actionable defects, not produce generic advice.
Data-labeling copilots — the contract is label consistency, not beautiful explanation.
Compliance tools — a correct citation must also match jurisdiction and policy scope.
Report generators — stakeholders need decisions and exceptions, not a textbook paragraph.

What you should remember¶

This chapter explained why a fluent base model can still fail the product. The important idea is that pretraining teaches likely continuation, not the user's contract: exact format, preserved facts, safe boundaries, and useful action.

You learned to separate "the model knows words about this" from "the model performs this job." The concrete move is to write contract-shaped eval rows and demonstrations that make the desired answer shape cheap, instead of expecting more generic pretraining to create obedience.

Carry this diagnostic forward: when an answer sounds smart but misses the requested job, ask whether this is a knowledge gap or a contract gap. If the facts are present but the shape, role, or obligation is wrong, inspect demonstrations, instructions, and eval rows before blaming model size.

Remember:

Fluency is not the same as task success.
A base model continues likely text; an assistant satisfies a caller's contract.
If facts are present but the job is missed, inspect behavior examples and eval rows.
Prompting helps only when the behavior already exists.
Measure format, action preservation, and usefulness separately from prose quality.

Check your understanding of continuation versus contract¶

Why can a base model be fluent and still fail?
What makes more pretraining a weak repair for instruction following?
Which metric would expose the incident summarizer's hidden failure?
Why is prompt engineering not a replacement for missing behavior priors?
Why is "the model knows the fact" weaker than "the model obeys the task"?
What would you measure separately if format improved but action preservation got worse?

Interview Q&A¶

Q. A base model knows your domain but ignores output format. What stage do you inspect first?
A. Inspect instruction demonstrations and format-specific evals, because the failure is probably behavioral rather than factual.
Common wrong answer to avoid: "Add more raw documents."

Q. Why is fluency a misleading eval for assistants?
A. Fluency measures plausibility, while the product contract may require exact format, preserved actions, refusal calibration, or tool-call structure.
Common wrong answer to avoid: "Fluent answers mean the model understood."

Q. When would more pretraining be the correct repair?
A. When the model lacks underlying domain knowledge or language coverage, not when it already knows the content but fails the interaction contract.
Common wrong answer to avoid: "Never pretrain more after a base model exists."

Q. Why can a larger base model still need SFT?
A. Scale can improve knowledge and latent skills, but the assistant contract still needs examples that make role, format, and task completion statistically central.
Common wrong answer to avoid: "Bigger models automatically infer every product contract."

Q. What is the first-principles difference between continuation and instruction following?
A. Continuation minimizes local text surprise; instruction following optimizes against an external caller goal that may not be the most likely next text in the corpus.
Common wrong answer to avoid: "They are the same because both produce tokens."

Q. How do you diagnose whether a failure is contract failure or knowledge failure?
A. Ask whether the missing information appears in the context or base capability. If yes, inspect behavior demonstrations, masking, templates, and eval gates before adding raw data.
Common wrong answer to avoid: "Any wrong answer means the model needs more facts."

Apply now (10 min)¶

Model the exercise: take one assistant failure and label it factual, behavioral, or preference-ranked.
Your turn: write three eval rows that would catch that failure without rewarding generic fluency.
Reproduce from memory: draw the continuation-versus-contract diagram and explain it in thirty seconds.

Bridge. Once you see the base model as the wiki reader, the next question is what it was allowed to read. The curriculum decides which patterns become cheap for the model to imitate. → 02-curriculum-data-mix.md