Skip to content

11. Scaling laws — bigger pile, more photos, in the right ratio

Why doubling the rule pile predictably lowers loss — and why photos must double with it.

Built on the ELI5 in 00-eli5.md. "Bigger pile + more photos = smarter robot" is exactly this chapter. We now put numbers on it.


The promise that almost worked again

You have a working cat-robot. ReLU bends. Adam smart-nudge. Dropout keeping it honest. It runs.

Now somebody hands you a 10× bigger compute budget. What to do?

The naive answer — make the rule pile 10× bigger. Same photos. Train longer.

You try. Loss drops a little. Then stalls way above where it should be. The bigger robot is barely smarter than the smaller one. You wasted the budget.

Something is structurally wrong. Not a tuning problem.


The picture before the formula — a ramp on log-log paper

See. Loss does not drop on a normal graph in a clean way. It drops in a clean way only when you plot log of loss against log of model size (or log of data, or log of compute).

On log-log paper, loss vs parameters is a straight downward ramp. Same shape for loss vs tokens. The slope of that ramp is small — but it never goes flat. Diminishing returns, yes. But never zero returns.

log(loss)
   │  ●
   │   ●
   │    ●●
   │      ●●            ← slope = −α  (small, negative, steady)
   │        ●●
   │          ●●●
   │             ●●
   │                ●●
   │                   ●●
   └─────────────────────────→ log(parameters)

Now flip the x-axis to tokens. Same picture. Same slope. Loss drops as a power law in both parameters and tokens.

log(loss)
   │  ●
   │   ●●
   │     ●●
   │       ●●          ← also a ramp, also slope −α
   │         ●●●
   │            ●●
   │               ●●
   └─────────────────────────→ log(tokens)

Two ramps. Same shape. The catch — both must scale together. If you push parameters way ahead of tokens, the parameter ramp flattens early. Pile starves for photos. If you push tokens way ahead of parameters, the token ramp flattens early. Pile cannot absorb more photos.

This is the rule pile from the ELI5, with a balance scale on top:

        compute budget (fixed)
             /        \
       parameters    tokens
       (rule pile)   (cat photos)
              ratio ≈ 1 : 20

The 20:1 ratio (Chinchilla) is the empirical sweet spot. Each parameter wants ~20 tokens to be fed properly.


The formula in one line

L(N) ≈ a · N^(−α) with α ≈ 0.05 to 0.1.

Take logs. log L = log a − α · log N. Straight line. Slope is −α. That slope is the scaling exponent.

Same form for tokens D and total compute C. Three power laws. One picture.


Worked numerical — three scenarios at the same compute budget

Take a 1B-parameter rule pile. The compute budget for training scales roughly as C ≈ 6 · N · D FLOPs (this is the standard transformer training-compute estimate — six FLOPs per parameter per token).

So C ≈ 6 · 10⁹ · D. For D = 20B tokens, C ≈ 1.2 · 10²⁰ FLOPs. Hold this compute fixed across three scenarios.

Scenario A — under-trained (1B params, 1B tokens)

Compute used: 6 · 10⁹ · 10⁹ = 6 · 10¹⁸ FLOPs. Tiny fraction of the budget. Pile is huge for the photos. Photos run out before the pile is fed. Loss sits at, say, 2.40 nats/token — leaves loss on the table because we stopped early.

Scenario B — compute-optimal (1B params, 20B tokens)

Compute used: 6 · 10⁹ · 2 · 10¹⁰ = 1.2 · 10²⁰ FLOPs. Ratio 1:20. Both ramps still descending at the same rate. Loss lands at, say, 2.05 nats/token. This is the Chinchilla balance.

Scenario C — over-trained (1B params, 200B tokens)

Compute used: 6 · 10⁹ · 2 · 10¹¹ = 1.2 · 10²¹ FLOPs. 10× the budget of B. We bought 10× the compute, photos ramp keeps descending. Loss reaches, say, 1.95 nats/token. Improvement over B: 0.10 nats. For 10× the cost.

The cat-robot picture. A is a rule pile that never saw enough photos. C is a small pile drowning in photos with most of them wasted. B is the balanced robot. Same compute as A. Lower loss. Half the cost of C. Almost the same loss.

So what to do? Match the photos to the pile. Twenty per parameter is the rule.


"Bigger model > more data" — why that claim is wrong

The Kaplan paper (early LLM scaling work) said one thing. Chinchilla disproved it. The early claim — for a fixed compute budget, bigger model and fewer tokens wins. Frontier labs believed this. Built giant under-trained piles.

Show the math. Three concrete budget comparisons.

Comparison 1 — GPT-3-style vs Chinchilla-style at the same compute

Model Params N Tokens D Compute (6·N·D) Token ratio Reported loss
GPT-3 style 175B 300B 3.15 · 10²³ 1 : 1.7 higher
Chinchilla style 70B 1.4T 5.88 · 10²³ 1 : 20 lower

Chinchilla used a bit more compute, but at GPT-3's exact compute, a 70B-on-1.4T pile would still beat the 175B-on-300B pile. Smaller pile, more photos, wins. The ratio matters more than the parameter count.

Comparison 2 — the same lesson at smaller scale

Two piles, both targeting ~10²¹ FLOPs:

Pile Params Tokens Ratio Outcome
Top-heavy 10B 17B 1 : 1.7 starves for photos
Balanced 3.3B 50B 1 : 15 lower loss

Balanced wins again. Same compute. Smaller pile fed more.

Comparison 3 — over-trained inference-economy pile

Now flip the goal. We will serve this model to a billion users. Inference cost matters more than training cost.

Pile Params Tokens Ratio Train cost Inference cost Quality
Chinchilla-optimal 70B 1.4T 1 : 20 low high (big pile) good
LLaMA 3 8B style 8B 15T 1 : 1875 high low (small pile) comparable

LLaMA 3 8B is wildly past Chinchilla. The training compute is far from optimal. But the small pile is forever cheap to serve. Over-training small models is the inference-economy move. Chinchilla optimizes training. Production optimizes serving.

The rule pile from the ELI5, picked for the right job.


Diminishing returns but never flat

The slope on log-log paper is the scaling exponent. With α = 0.05, doubling parameters drops loss by 2⁻⁰·⁰⁵ ≈ 0.966 — a 3.4% improvement. Tiny per doubling.

But power laws compound. Twenty doublings (1M× more parameters) → 0.966²⁰ ≈ 0.50. Loss halves. The ramp keeps descending. No magic wall.

This is why labs scale. Each step looks small. The compounded gain is huge. Until the ramp eventually does flatten — at the entropy floor of the data, roughly. You cannot predict text below the inherent randomness of text. That is the asymptote, not the slope.


Pause and recall before the wrong-claims section. Without scrolling — what is the ratio? On log-log paper, what is the shape of loss vs parameters? Why is "bigger pile, fewer photos" wrong at fixed compute?


Where this lives in the wild

  • Chinchilla / DeepMind scaling work. Established the 20:1 ratio. Showed that 70B params trained on 1.4T tokens beats 175B params trained on 300B tokens at the same compute. Reset frontier-lab strategy.
  • LLaMA 3 (Meta). The 8B and 70B variants were trained on 15T tokens. Ratio of ~1:1875 for the 8B — wildly over-trained relative to Chinchilla. Why? Inference economics. Smaller pile is cheaper to serve forever. The training over-spend pays back across billions of queries.
  • GPT-4 (rumored). Industry analysis suggests its training mix moved toward Chinchilla-style or past-Chinchilla ratios after GPT-3's lesson. Mixture-of-experts adds another knob — active parameters per token differ from total parameters.
  • Mosaic-BERT (MosaicML / Databricks). Encoder-style training where data-to-parameter tradeoffs were explicitly studied. Showed BERT-class models could be trained for ~$20 by following compute-optimal scaling on commodity hardware. Same ramp, smaller pile, sized correctly.
  • Gemini / Gemini Nano (Google). The Nano variants are deliberately over-trained small piles for on-device inference. Same playbook as LLaMA 3 8B — sacrifice training-compute optimality for serving cost.

The pattern. Every modern frontier-lab pile sits somewhere on the parameter-vs-token curve. The choice of where depends on whether training cost or inference cost dominates.


Honest admission

Scaling laws are empirical, not theoretical. The slope α is fitted from runs. It varies with architecture, dataset, tokenizer, training recipe. The 20:1 ratio is a fit on one family of decoder-only transformers. Encoder models scale differently. Mixture-of-experts piles scale differently. Multi-modal piles scale differently.

And the ramp eventually flattens. Where? Nobody knows precisely. Frontier labs hit the wall of data quality, not compute. There are only so many high-quality tokens on the internet. Past that, the cat photos are blurry — and a bigger pile cannot extract a sharper signal from blurry photos.

This is the modern bottleneck. Not GPUs. Photos.


Interview Q&A

Q: What does "compute-optimal" mean exactly? A: For a fixed compute budget C, compute-optimal means choosing parameters N and tokens D such that loss is minimized. Empirically, this happens near D / N ≈ 20. It is the bottom of the loss-vs-(N, D) surface at constant compute. Not the smallest model, not the biggest model — the balanced one. Common wrong answer to avoid: "compute-optimal = lowest possible loss". No. Compute-optimal is lowest loss for a given compute. With infinite compute you train a bigger pile on more photos and get lower loss. The optimum is conditional on the budget.

Q: Why is the 20:1 ratio not always the right answer? A: Because Chinchilla optimizes training compute. Production-served models (LLaMA 3 8B, Gemini Nano) are over-trained on purpose — far past 1:20 — because inference cost dominates total lifetime cost when the model serves billions of queries. A smaller, over-trained pile is forever cheaper to serve. Common wrong answer to avoid: "20:1 is the law". It is the training-optimal point. Inference economics, data availability, and architecture choices shift the optimum.

Q: Why do labs train past the Chinchilla optimum? A: Inference cost. Training is a one-time bill. Inference is forever. Halving parameters halves serving FLOPs forever. Trading 10× more training compute for a smaller pile of equal quality pays back across the deployment lifetime. The math shifts when usage is high.

Q: Where does scaling break down? A: Three places. One — data quality runs out. Past trillions of high-quality tokens, you start training on noise and gains shrink. Two — architecture saturates. Decoder-only transformers may not have an infinite slope; new architectures may have a different α. Three — the entropy floor of language. You cannot predict below the inherent randomness of the data. The ramp does eventually flatten. Common wrong answer to avoid: "scaling laws prove AGI is inevitable". They prove only that on the current curve, more compute and more clean data give predictable loss reduction. Capability emergence, reasoning, generalization — those are not direct outputs of the loss curve. Loss going down does not mean every capability scales smoothly.


Apply now (5 min)

Take a 7B-parameter budget. Compute-optimal data?

N = 7 · 10⁹
D = 20 · N = 1.4 · 10¹¹ tokens = 140B tokens
Compute ≈ 6 · N · D = 6 · 7·10⁹ · 1.4·10¹¹ ≈ 5.9 · 10²¹ FLOPs

So a 7B compute-optimal pile wants ~140B tokens and ~6 · 10²¹ FLOPs. Now answer — if your serving traffic is high, would you train it on 140B tokens or push to 1T+? Which cost dominates?

Then — without looking — sketch from memory:

  1. The log-log loss curve. Two axes. Power-law ramp.
  2. The balance scale: parameters vs tokens, ratio 1:20.
  3. One sentence: why GPT-3 was top-heavy.

If you can reproduce all three in under 90 seconds, you own this idea.


Bridge. Scaling laws give a ramp, not a magic ladder. Loss drops predictably — but capability, generalization, the things we actually care about, do not always follow the ramp. Before we admit what is still mysterious, one more practical topic: the normalization layers that keep deep networks trainable. Read 12-batch-norm-layer-norm.md.