11. Scaling laws — bigger pile, more photos, in the right ratio¶

Why doubling the rule pile predictably lowers loss — and why photos must double with it.

Built on the ELI5 in 00-eli5.md. "Bigger pile + more photos = smarter robot" is exactly this chapter. We now put numbers on it.

The promise that almost worked again¶

You have a working cat-robot. ReLU bends. Adam smart-nudge. Dropout keeping it honest. It runs.

Now somebody hands you a 10× bigger compute budget. What to do?

The naive answer — make the rule pile 10× bigger. Same photos. Train longer.

You try. Loss drops a little. Then stalls way above where it should be. The bigger robot is barely smarter than the smaller one. You wasted the budget.

Something is structurally wrong. Not a tuning problem.

The picture before the formula — a ramp on log-log paper¶

See. Loss does not drop on a normal graph in a clean way. It drops in a clean way only when you plot log of loss against log of model size (or log of data, or log of compute).

On log-log paper, loss vs parameters is a straight downward ramp. Same shape for loss vs tokens. The slope of that ramp is small — but it never goes flat. Diminishing returns, yes. But never zero returns.

log(loss)
   ↑
   │  ●
   │   ●
   │    ●●
   │      ●●            ← slope = −α  (small, negative, steady)
   │        ●●
   │          ●●●
   │             ●●
   │                ●●
   │                   ●●
   └─────────────────────────→ log(parameters)

Now flip the x-axis to tokens. Same picture. Same slope. Loss drops as a power law in both parameters and tokens.

log(loss)
   ↑
   │  ●
   │   ●●
   │     ●●
   │       ●●          ← also a ramp, also slope −α
   │         ●●●
   │            ●●
   │               ●●
   └─────────────────────────→ log(tokens)

Two ramps. Same shape. The catch — both must scale together. If you push parameters way ahead of tokens, the parameter ramp flattens early. Pile starves for photos. If you push tokens way ahead of parameters, the token ramp flattens early. Pile cannot absorb more photos.

This is the rule pile from the ELI5, with a balance scale on top:

        compute budget (fixed)
             /        \
       parameters    tokens
       (rule pile)   (cat photos)
              ratio ≈ 1 : 20

The 20:1 ratio (Chinchilla) is the empirical sweet spot. Each parameter wants ~20 tokens to be fed properly.

The formula in one line¶

L(N) ≈ a · N^(−α) with α ≈ 0.05 to 0.1.

Take logs. log L = log a − α · log N. Straight line. Slope is −α. That slope is the scaling exponent.

Same form for tokens D and total compute C. Three power laws. One picture.

Worked numerical — three scenarios at the same compute budget¶

Take a 1B-parameter rule pile. The compute budget for training scales roughly as C ≈ 6 · N · D FLOPs (this is the standard transformer training-compute estimate — six FLOPs per parameter per token).

So C ≈ 6 · 10⁹ · D. For D = 20B tokens, C ≈ 1.2 · 10²⁰ FLOPs. Hold this compute fixed across three scenarios.

Scenario A — under-trained (1B params, 1B tokens)¶

Compute used: 6 · 10⁹ · 10⁹ = 6 · 10¹⁸ FLOPs. Tiny fraction of the budget. Pile is huge for the photos. Photos run out before the pile is fed. Loss sits at, say, 2.40 nats/token — leaves loss on the table because we stopped early.

Scenario B — compute-optimal (1B params, 20B tokens)¶

Compute used: 6 · 10⁹ · 2 · 10¹⁰ = 1.2 · 10²⁰ FLOPs. Ratio 1:20. Both ramps still descending at the same rate. Loss lands at, say, 2.05 nats/token. This is the Chinchilla balance.

Scenario C — over-trained (1B params, 200B tokens)¶

Compute used: 6 · 10⁹ · 2 · 10¹¹ = 1.2 · 10²¹ FLOPs. 10× the budget of B. We bought 10× the compute, photos ramp keeps descending. Loss reaches, say, 1.95 nats/token. Improvement over B: 0.10 nats. For 10× the cost.

The cat-robot picture. A is a rule pile that never saw enough photos. C is a small pile drowning in photos with most of them wasted. B is the balanced robot. Same compute as A. Lower loss. Half the cost of C. Almost the same loss.

So what to do? Match the photos to the pile. Twenty per parameter is the rule.

"Bigger model > more data" — why that claim is wrong¶

The Kaplan paper (early LLM scaling work) said one thing. Chinchilla disproved it. The early claim — for a fixed compute budget, bigger model and fewer tokens wins. Frontier labs believed this. Built giant under-trained piles.

Show the math. Three concrete budget comparisons.

Comparison 1 — GPT-3-style vs Chinchilla-style at the same compute¶

Model	Params N	Tokens D	Compute (6·N·D)	Token ratio	Reported loss
GPT-3 style	175B	300B	3.15 · 10²³	1 : 1.7	higher
Chinchilla style	70B	1.4T	5.88 · 10²³	1 : 20	lower

Chinchilla used a bit more compute, but at GPT-3's exact compute, a 70B-on-1.4T pile would still beat the 175B-on-300B pile. Smaller pile, more photos, wins. The ratio matters more than the parameter count.

Comparison 2 — the same lesson at smaller scale¶

Two piles, both targeting ~10²¹ FLOPs:

Pile	Params	Tokens	Ratio	Outcome
Top-heavy	10B	17B	1 : 1.7	starves for photos
Balanced	3.3B	50B	1 : 15	lower loss

Balanced wins again. Same compute. Smaller pile fed more.

Comparison 3 — over-trained inference-economy pile¶

Now flip the goal. We will serve this model to a billion users. Inference cost matters more than training cost.

Pile	Params	Tokens	Ratio	Train cost	Inference cost	Quality
Chinchilla-optimal	70B	1.4T	1 : 20	low	high (big pile)	good
LLaMA 3 8B style	8B	15T	1 : 1875	high	low (small pile)	comparable

LLaMA 3 8B is wildly past Chinchilla. The training compute is far from optimal. But the small pile is forever cheap to serve. Over-training small models is the inference-economy move. Chinchilla optimizes training. Production optimizes serving.

The rule pile from the ELI5, picked for the right job.

Diminishing returns but never flat¶

The slope on log-log paper is the scaling exponent. With α = 0.05, doubling parameters drops loss by 2⁻⁰·⁰⁵ ≈ 0.966 — a 3.4% improvement. Tiny per doubling.

But power laws compound. Twenty doublings (1M× more parameters) → 0.966²⁰ ≈ 0.50. Loss halves. The ramp keeps descending. No magic wall.

This is why labs scale. Each step looks small. The compounded gain is huge. Until the ramp eventually does flatten — at the entropy floor of the data, roughly. You cannot predict text below the inherent randomness of text. That is the asymptote, not the slope.

Pause and recall before the wrong-claims section. Without scrolling — what is the ratio? On log-log paper, what is the shape of loss vs parameters? Why is "bigger pile, fewer photos" wrong at fixed compute?

Where this lives in the wild¶

Chinchilla / DeepMind scaling work. Established the 20:1 ratio. Showed that 70B params trained on 1.4T tokens beats 175B params trained on 300B tokens at the same compute. Reset frontier-lab strategy.
LLaMA 3 (Meta). The 8B and 70B variants were trained on 15T tokens. Ratio of ~1:1875 for the 8B — wildly over-trained relative to Chinchilla. Why? Inference economics. Smaller pile is cheaper to serve forever. The training over-spend pays back across billions of queries.
GPT-4 (rumored). Industry analysis suggests its training mix moved toward Chinchilla-style or past-Chinchilla ratios after GPT-3's lesson. Mixture-of-experts adds another knob — active parameters per token differ from total parameters.
Mosaic-BERT (MosaicML / Databricks). Encoder-style training where data-to-parameter tradeoffs were explicitly studied. Showed BERT-class models could be trained for ~$20 by following compute-optimal scaling on commodity hardware. Same ramp, smaller pile, sized correctly.
Gemini / Gemini Nano (Google). The Nano variants are deliberately over-trained small piles for on-device inference. Same playbook as LLaMA 3 8B — sacrifice training-compute optimality for serving cost.

The pattern. Every modern frontier-lab pile sits somewhere on the parameter-vs-token curve. The choice of where depends on whether training cost or inference cost dominates.

Honest admission¶

Scaling laws are empirical, not theoretical. The slope α is fitted from runs. It varies with architecture, dataset, tokenizer, training recipe. The 20:1 ratio is a fit on one family of decoder-only transformers. Encoder models scale differently. Mixture-of-experts piles scale differently. Multi-modal piles scale differently.

And the ramp eventually flattens. Where? Nobody knows precisely. Frontier labs hit the wall of data quality, not compute. There are only so many high-quality tokens on the internet. Past that, the cat photos are blurry — and a bigger pile cannot extract a sharper signal from blurry photos.

This is the modern bottleneck. Not GPUs. Photos.

Interview Q&A¶

Q: What does "compute-optimal" mean exactly? A: For a fixed compute budget C, compute-optimal means choosing parameters N and tokens D such that loss is minimized. Empirically, this happens near D / N ≈ 20. It is the bottom of the loss-vs-(N, D) surface at constant compute. Not the smallest model, not the biggest model — the balanced one. Common wrong answer to avoid: "compute-optimal = lowest possible loss". No. Compute-optimal is lowest loss for a given compute. With infinite compute you train a bigger pile on more photos and get lower loss. The optimum is conditional on the budget.

Q: Why is the 20:1 ratio not always the right answer? A: Because Chinchilla optimizes training compute. Production-served models (LLaMA 3 8B, Gemini Nano) are over-trained on purpose — far past 1:20 — because inference cost dominates total lifetime cost when the model serves billions of queries. A smaller, over-trained pile is forever cheaper to serve. Common wrong answer to avoid: "20:1 is the law". It is the training-optimal point. Inference economics, data availability, and architecture choices shift the optimum.

Q: Why do labs train past the Chinchilla optimum? A: Inference cost. Training is a one-time bill. Inference is forever. Halving parameters halves serving FLOPs forever. Trading 10× more training compute for a smaller pile of equal quality pays back across the deployment lifetime. The math shifts when usage is high.

Q: Where does scaling break down? A: Three places. One — data quality runs out. Past trillions of high-quality tokens, you start training on noise and gains shrink. Two — architecture saturates. Decoder-only transformers may not have an infinite slope; new architectures may have a different α. Three — the entropy floor of language. You cannot predict below the inherent randomness of the data. The ramp does eventually flatten. Common wrong answer to avoid: "scaling laws prove AGI is inevitable". They prove only that on the current curve, more compute and more clean data give predictable loss reduction. Capability emergence, reasoning, generalization — those are not direct outputs of the loss curve. Loss going down does not mean every capability scales smoothly.

Apply now (5 min)¶

Take a 7B-parameter budget. Compute-optimal data?

N = 7 · 10⁹
D = 20 · N = 1.4 · 10¹¹ tokens = 140B tokens
Compute ≈ 6 · N · D = 6 · 7·10⁹ · 1.4·10¹¹ ≈ 5.9 · 10²¹ FLOPs

So a 7B compute-optimal pile wants ~140B tokens and ~6 · 10²¹ FLOPs. Now answer — if your serving traffic is high, would you train it on 140B tokens or push to 1T+? Which cost dominates?

Then — without looking — sketch from memory:

The log-log loss curve. Two axes. Power-law ramp.
The balance scale: parameters vs tokens, ratio 1:20.
One sentence: why GPT-3 was top-heavy.

If you can reproduce all three in under 90 seconds, you own this idea.

Bridge. Scaling laws give a ramp, not a magic ladder. Loss drops predictably — but capability, generalization, the things we actually care about, do not always follow the ramp. Before we admit what is still mysterious, one more practical topic: the normalization layers that keep deep networks trainable. Read 12-batch-norm-layer-norm.md.