02. Curriculum Data Mix — the model learns what it repeatedly reads¶
What the product-contract failure exposes next¶
In chapter 1, we separated fluent continuation from the product contract. A base model may sound smart while missing the user's job, so later stages need demonstrations, preferences, and evals. But those stages do not start from a blank model.
Before the model ever becomes an assistant, the wiki reader spends most of its life reading. If it reads calm support replies, calm support language feels normal. If it reads angry forum arguments, angry argument language feels normal. If it reads the same stale policy page a thousand times, that stale policy feels more certain than it deserves.
So the root cause is not "the model has a bad personality." The root cause is that repeated text becomes practice, and practice becomes habit.
So how can we shape the habits before SFT has to fight them? We control the reading diet: source buckets, caps, deduplication, protected eval sets, and source-sliced checks. That reading diet is the curriculum.
What this file solves¶
A model can pick up unwanted habits because the training pile over-represents the wrong text. This file shows how to turn "more data" into an explicit curriculum: bucket sources, cap noisy buckets, deduplicate repeats, and check evals by source family before later stages have to fight those habits.
Why data mix decides behavior before alignment¶
Imagine teaching a junior teammate by handing them a giant reading folder. If most pages are calm runbooks, they learn calm operational language. If most pages are flame-war threads, they learn that arguing is normal. The teammate is not judging the folder; they are absorbing patterns from it.
The base model is similar. Before SFT or preference tuning, it is already practicing what normal text looks like. The cheapest fix is to shape that practice early instead of paying later stages to fight a bad prior.
When more data teaches the wrong habit¶
The naive repair is simple: "add more relevant text." That feels reasonable because more examples usually help.
But more of the wrong thing teaches the wrong lesson. If the added bucket repeats angry, stale, duplicated, or SEO-shaped text, the model learns those habits as normal instead of learning the product's desired voice.
When the reading pile changes the model¶
If the model reads 10,000 calm support replies and 10 angry forum replies, calm support style feels normal. If it reads 10,000 angry forum replies, angry forum style feels normal. The model is practicing, not judging.
So when a model starts sounding argumentative, do not first ask which layer broke. Ask what it practiced.
So the real problem is not the number of tokens. It is the pressure created by which tokens appear, how often they appear, and which important tokens are missing.
So how can we make the reading pile teach the right pressure? We design the curriculum instead of treating the corpus like a dump truck.
Rule: repeated text becomes model habit¶
The base model learns what appears often, cleanly, and repeatedly in its training text.
Why the reading pile wins. The model does not read the real world. It reads the text we give it. Text that appears often feels normal, duplicated text feels extra true, missing text stays weak, and bad text teaches bad habits.
1) Hook — one extra source changes model personality¶
A team adds a large forum dump to improve conversational ability. Casual Q&A scores rise, but support replies become more argumentative. Nothing in the architecture changed. The model practiced on a different kind of text.
Teacher voice. Data mix is product design wearing an infrastructure costume.
The interesting part is that "more human conversation" can make a model less helpful for humans in a product. If the added conversations reward dunking, hedging, or speculation, the wiki reader learns that those moves are normal.
This is not a bigger-model problem. Not a prompt-wording problem. It is a practice-diet problem.
2) Mental model — curriculum as diet¶
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ web prose │ │ code │ │ math │ │ dialogue │
└─────┬──────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
└──────────────┴─────────────┴─────────────┘
│
filters + dedup
│
▼
sampled token stream
│
▼
**the wiki reader**
The model does not ingest "the internet." It ingests the result of many quiet choices.
what appears often ──→ easy continuation
what appears rarely ──→ fragile skill
what appears duplicated ──→ overconfident memory
what never appears ──→ cannot be recovered by vibes
3) Running example — incident summarizer corpus¶
For our incident summarizer, useful pretraining includes status pages, runbooks, support tickets, code comments, and concise operational writing. Harmful over-weighting includes SEO prose, stale policies, duplicate boilerplate, and synthetic summaries that erase uncertainty.
Attempt A: sample all available text uniformly. Duplicate vendor notices dominate because they are easy to scrape.
Attempt B: cap source families, deduplicate near-identical pages, keep high-signal operational text, and reserve held-out incident styles for evals.
4) Filtering early versus fighting habits later¶
- Aggressive filtering removes junk and unsafe text, but may erase rare domains.
- Weak filtering preserves coverage, but teaches spam and stale patterns.
- Deduplication prevents memorized boilerplate dominance, but can remove legitimate repeated formats.
- Fixing it later with SFT speeds initial corpus work, but later stages fight entrenched priors.
For the incident assistant, cleaning the operational corpus is cheaper than teaching the model to unlearn millions of duplicated non-actionable updates.
5) Token count is not experience count¶
Suppose a 100B-token corpus contains:
- Duplicate policy pages: 18B tokens — creates memorization pressure.
- Code and logs: 12B tokens — useful for diagnostics.
- Incident reports: 0.2B tokens — rare but product-critical.
- General prose: 69.8B tokens — broad language.
The product-critical slice can be tiny. Sampling ratios decide whether it is learnable signal or rounding noise.
6) Synthetic smoothness erases operational edges¶
Synthetic data often makes answers cleaner. It can also remove the messy details users rely on. If every generated incident summary has perfect structure, the model may fail when real incidents contain contradictions, partial rollbacks, or uncertain ETAs.
flowchart LR
A[Raw messy incidents] --> B[Synthetic rewrite]
B --> C[Cleaner language]
B --> D[Lost ambiguity]
C --> E[Better format eval]
D --> F[Worse real incident handling]
7) What source caps fix and make expensive¶
- Exact dedup over 100B tokens is a cheap hash pass that avoids repeated boilerplate.
- Near-dedup with MinHash costs moderate compute and catches paraphrased duplicates.
- Source-quality classifiers need labels and model passes, but remove spam and low-signal pages.
- A 5k-document manual gold set has high human cost and prevents eval blindness.
Data curation cost is front-loaded; bad curriculum cost recurs in every downstream stage.
8) Signals that the data mix is teaching the wrong habits¶
- Healthy: held-out perplexity improves across source families, not only dominant web text.
- First degrading metric: rare-domain eval gets worse after a mix change.
- Misleading beginner metric: total token count.
- Expert graph: quality-stratified loss by source bucket and duplication band.
9) Where curriculum design helps and where it overfits¶
Curriculum design is strongest when the target behavior depends on broad priors. It becomes pathological when teams overfit the corpus to current product tasks and erase generality. It hits a scale limit when data governance, copyright, privacy, or provenance cannot be audited.
10) Wrong model: more data is automatically better data¶
Wrong model: "Data is fuel; more clean fuel is always better."
Replacement: data is environment. The wiki reader adapts to the environment's frequency, style, omissions, and repeated mistakes.
11) Other ways the reading pile poisons behavior¶
- duplicated boilerplate creates memorized phrases
- source imbalance creates personality drift
- stale docs teach deprecated APIs
- synthetic data hides uncertainty
- filters remove minority dialects or rare domains
- benchmark leakage inflates confidence
- PII cleanup misses structured identifiers
- tokenizer choices make important strings expensive
12) The same upstream-data problem in RAG and data systems¶
This echoes retrieval corpus design in RAG: ingestion quality bounds answer quality before ranking begins. It also echoes data engineering backpressure: cheap ingestion creates expensive downstream cleanup. The shared invariant is that upstream mixture choices become downstream behavior.
13) Quick test: can you explain what each source teaches?¶
- Can you state the source buckets and sampling ratios?
- Do you track loss by bucket, not only aggregate loss?
- Are duplicates measured before and after filtering?
- Is there a gold set protected from synthetic contamination?
- Can you explain which downstream behavior each data source is supposed to improve?
Where data mix becomes product behavior¶
- Llama-family pretraining — source mixture decisions visibly shape code, reasoning, and chat readiness.
- Code models — permissive, high-quality code data changes completion behavior more than generic prose.
- Medical assistants — stale or low-quality health pages create dangerous priors.
- Enterprise copilots — internal docs need provenance and freshness before tuning.
- Search snippets — SEO-heavy corpora teach verbose, generic phrasing.
- Synthetic instruction corpora — useful for coverage but risky when they smooth away real mess.
- Tokenizer training — corpus choice decides which substrings become cheap tokens.
- Benchmark suites — leaked examples turn evals into memorization tests.
- Search engines — crawling policy decides what the index believes is common.
- Fraud models — oversampling confirmed fraud changes calibration on normal traffic.
- Speech recognition — accent and microphone mix become real-world reliability.
- Translation systems — parallel-corpus imbalance changes which languages get nuance.
- Security copilots — stale CVE writeups teach outdated remediation patterns.
- Education tutors — textbook-heavy data can miss student misconception language.
- Healthcare triage — source quality decides whether rare symptoms stay visible.
Check your understanding of curriculum pressure¶
- Why is token count a weak proxy for curriculum quality?
- What does dedup prevent, and what can it accidentally remove?
- Why can synthetic data improve one eval while hurting real usage?
- Which graph would you inspect after changing source ratios?
- Why is data mix closer to environment design than fuel selection?
- What downstream stage pays the price for duplicated or stale pretraining data?
Interview Q&A¶
Q. Why not fix a bad pretraining mix entirely with SFT?
A. SFT can steer behavior, but it fights base priors created by much larger token volumes; some omissions and biases are cheaper to prevent upstream.
Common wrong answer to avoid: "SFT overwrites pretraining completely."
Q. What is the risk of aggressive filtering?
A. It can remove rare but important domains, dialects, adversarial examples, or messy real-world forms the model must handle.
Common wrong answer to avoid: "Filtering only removes bad data."
Q. How do you audit a curriculum change?
A. Compare quality-stratified and source-stratified loss, contamination checks, duplication metrics, and downstream task evals.
Common wrong answer to avoid: "Look at one aggregate validation loss."
Q. Why can rare high-quality data matter more than its token share suggests?
A. Rare slices may encode product-critical behaviors or domains; sampling and eval design decide whether they become learnable signal instead of statistical dust.
Common wrong answer to avoid: "Only large buckets affect behavior."
Q. What makes dedup a behavioral intervention rather than storage cleanup?
A. Repeated text increases gradient pressure and memorization risk, so dedup changes what the model treats as common or certain.
Common wrong answer to avoid: "Dedup only saves disk."
Q. Why protect a held-out gold set before synthetic expansion?
A. Synthetic rows can leak patterns into evals and make a model look robust on sanitized cases while failing messy originals.
Common wrong answer to avoid: "Synthetic data is always safe because no human wrote it."
Apply now (10 min)¶
- Model the exercise: make a four-bucket data mix for an incident assistant.
- Your turn: name one failure each bucket prevents and one failure it can introduce.
- Reproduce from memory: redraw the source-to-filter-to-token-stream diagram.
What you should remember¶
This chapter explained why data mix is not neutral fuel. It is the practice environment that shapes the base model before alignment. The model learns from what appears often, what is duplicated, what is missing, and what is over-sampled.
You learned to treat curriculum as an engineering control surface. The concrete move is to bucket sources, cap noisy buckets, deduplicate repeated text, protect messy real eval sets, and slice metrics by source family so later SFT or preference tuning does not fight habits the base model learned too cheaply.
Carry this diagnostic forward: if behavior changes after a corpus change, ask what the model practiced more, less, or repeatedly before blaming the architecture, prompt, or preference stage.
Remember:
- Data mix is practice, not neutral fuel.
- What appears often becomes familiar behavior.
- Duplicates create false confidence.
- Missing sources become weak skills.
- Slice evals by source family before trusting an average.
- Later alignment pays for bad upstream curriculum choices.
Use DIET when you need the chapter in 20 seconds:
- D — Distribution becomes behavior. The model does not learn "the world"; it learns the sampled world.
- I — Imbalance becomes personality. Overweight one source and its tone, format, and assumptions start to feel normal.
- E — Echoes become confidence. Duplicates and near-duplicates make repeated text feel more true than it deserves.
- T — Test by source, not just average. Slice evals by source family, duplication band, freshness, and synthetic-vs-real rows.
The quick trick: ask, "What did this model eat too much of, too little of, and too many times?" That usually points to the first curriculum audit.
Bridge. After choosing the curriculum, the next question is mechanical: how does a token stream turn into changed weights? That is where next-token loss and the workbench meet. → 03-next-token-training-loop.md