13. Honest admission — what we still do not know¶
The previous eleven files were "this fails, here is the fix". This one is different. We sit with three things that work — and admit nobody fully knows why.
Built on the cat-robot ELI5 in
00-eli5.md. The "rule pile" comes back. So does the cat-robot. The point of this file is to be honest about the limits of the picture.
Why this file exists¶
See. So far every chapter has been a story with a clean ending. Something fails. We see why. We fix it. We move on.
That neatness is partly a teacher's trick. It hides the parts we cannot explain.
A Lead AI Engineer who pretends everything is solved gets caught the moment a serious question lands. So we name three open puzzles. We say "we do not know" out loud. That is not weakness. That is the position of every research lab on the planet right now.
The three puzzles. Generalization without overfitting. The lottery ticket. Grokking.
Puzzle 1 — the generalization mystery¶
What classical statistics predicts¶
Picture an old textbook. It says — if your model has more knobs than data points, it will memorize the data and fail on anything new. This is overfitting. It is the central warning of every statistics course taught for the last hundred years.
The picture in your head:
test error
↑
| sweet spot
| \
| \
| ______\______
| / \____ overfitting
| / \___
+────────────────────────────→ model size
small big huge
Classical theory. As you grow the model past the sweet spot, test error climbs. The bigger the model, the worse it generalizes.
What neural nets actually do¶
Now look at a real network. A ResNet-50 has roughly 25 million parameters. ImageNet has 1.3 million training images. Twenty parameters per image.
Classical theory screams — this should overfit catastrophically.
It does not. It generalizes. Top-5 accuracy above 90% on images the network has never seen.
Worse. Make the network bigger. Twice as big. Ten times as big. Test error keeps dropping. The actual curve looks like this:
test error
↑
| classical \ modern
| bump \ descent
| __ \ ___________
| / \ \__/
|/ \
+─────────────────────────────→ model size
sweet interpolation huge
spot threshold overparam
This is called double descent. Test error rises near the interpolation threshold (the model just barely big enough to memorize the data) — then falls again as the model gets bigger.
The honest part¶
Why does this happen? We have theories. Implicit regularization from SGD. Flat minima. The lottery ticket idea (next puzzle). Neural tangent kernels.
None of them is complete. Each explains a piece. None predicts which architecture will generalize and by how much.
This is the rule pile behaving in a way the math books did not warn us about. A pile with billions of rules trained on millions of cat photos generalizes. Classical learning theory says it should not. It does. We do not know why.
Puzzle 2 — the lottery ticket hypothesis¶
The puzzle in one experiment¶
Train a network. It works. Now look at the trained weights — most are tiny. Near zero. A small fraction carry almost all the signal.
Question. What if those small weights never mattered? What if the network was actually just one small subnetwork doing the work, and the rest was dead weight?
The experiment that made this concrete:
- Train a big network normally. Save the weights.
- Find the top 10% of weights by magnitude. Throw the rest away.
- Reset the surviving weights to their original random initialization. Not the trained values — the values they had on day zero.
- Retrain just this small subnetwork from those original random values.
Result. The small subnetwork — about 10% of the original size — trains to roughly the same accuracy. Sometimes faster. Sometimes better.
What this means¶
The big network was, in some sense, a lottery. Most of its weights were noise. A few of them — the winning ticket — happened to start near a good solution. Training mostly served to grow that one ticket while the rest faded.
The cat-robot picture. Imagine the rule pile had a million rules, but only fifty thousand of them were ever doing real work. The rest were silent passengers. We trained them all. We paid GPU time for all of them. Only a small subnetwork mattered.
Why this is a puzzle, not a solution¶
If we knew which 10% would win, we could train only that subnetwork. We do not. The only way to find the winning ticket is to train the full network first and then look back. There is no a-priori test for "which random initialization is a winning ticket".
So we burn 10x the compute. We throw away 90% of the trained weights. We do not understand what made the surviving 10% special, beyond "they happened to start in the right neighborhood".
This is not a solved problem dressed up as a puzzle. We genuinely do not know.
Puzzle 3 — grokking¶
The expected learning curve¶
Train a network on a small task — say, modular arithmetic. Compute (a + b) mod 97 for pairs of numbers. Cute toy problem.
The expected picture:
accuracy
↑
1 | ____________
| ___/ ← train and test rise together
| __/
| __/
| /
0 +───────────────────────→ training steps
Train accuracy climbs. Test accuracy climbs alongside it. Both reach a plateau. Story over.
What actually happens — the grokking curve¶
Sometimes — especially with weight decay and small data — you see this instead:
accuracy
↑
1 | train ________________________
| /
| / ← long memorization plateau
| / ┌──── test ─────
| / /
| / /
| / /
| / /
| /________/
0 +───────────────────────────→ training steps
memorize do nothing generalize
(fast) (long) (suddenly)
The network memorizes the training set quickly. Test accuracy stays at random. You wait. Nothing happens for thousands of steps. Tens of thousands. Sometimes a hundred times longer than memorization took.
Then — suddenly, with no obvious trigger — test accuracy snaps from chance to near-perfect.
The network "groks" the underlying rule. It stops memorizing and starts generalizing.
Why this is a puzzle¶
Standard training-curve intuition says — if the loss has stopped decreasing for ten thousand steps, you are done. Stop training. Move on.
Grokking says — sometimes you are not done. The network is silently reorganizing. The weights are drifting toward a more general solution while training accuracy stays pinned at 100%. Eventually the rearrangement crosses a threshold and test accuracy jumps.
We do not know:
- When grokking will happen versus when it will not.
- What architecture or hyperparameter choices induce it.
- Whether something grokking-shaped is happening invisibly inside large language models too — and if so, on what timescale.
The cat-robot is silently rebuilding itself in the dark. We cannot see it from the loss curve. We only notice when test accuracy moves.
Where these puzzles show up in real systems¶
Four real artifacts. Each is connected to one of these puzzles.
-
GPT-3's emergent abilities. Certain tasks — like multi-step arithmetic or analogical reasoning — show sharp jumps at specific parameter counts. Schaeffer et al. (2023) argued some of those jumps are partly metric artifacts from nonlinear evaluation, but the practical pattern still holds: some capabilities seem to need a minimum pile size. This looks more like grokking-on-scale than grokking-on-time. We still do not have a theory that predicts which abilities will appear at which scale.
-
Lottery ticket pruning at production scale. Several labs have shown that large vision and language models can be pruned to 10–30% of their original size with almost no quality loss — if you find the right ticket. Hardware-aware pruning (structured sparsity on NVIDIA Ampere/Hopper) ships in production inference today. The pruning works. The theory of which tickets win does not.
-
Modular arithmetic toy networks. The cleanest grokking demonstrations are tiny networks (a few thousand parameters) learning addition mod p. Researchers can watch the weights reorganize into Fourier-like structures during the long silent plateau. This is the only setting where we can almost-see grokking happen mechanically — and even there the trigger is unclear.
-
Double descent in vision benchmarks. CIFAR-10 and ImageNet both show the double-descent shape clearly when you sweep model size. Production teams use this to justify "just train a bigger model" as a default move — the bigger model often generalizes better, against classical intuition. The empirical curve drives the decision; the theory follows behind.
Pause and recall. Without scrolling — what does classical statistics predict about an over-parameterized network? What is the lottery ticket experiment in three steps? Sketch the grokking curve in your head. If any of the three is fuzzy, scroll back.
What this means for your job¶
You are not going to solve these puzzles. Nobody on your team will. The frontier labs have not solved them.
What you do instead.
- Default to over-parameterization. Classical "bigger model = more overfitting" is wrong for deep nets in most regimes. Reach for the bigger model first. Regularize after.
- Do not stop training too early on small algorithmic tasks. If the loss has plateaued but the task has structure to discover, give it more steps. Grokking happens.
- Plan for emergence. Capabilities can appear at scale that did not exist below it. Evaluate at every checkpoint. The model you have at 7B is not the same animal as the model you have at 70B.
- Be honest in interviews. "We do not know exactly why" is the correct answer to "why does this generalize". Anyone who claims a complete theory is bluffing.
The cat-robot still works. We can build it. We can ship it. We can debug it. We just cannot fully explain why it generalizes when classical theory says it should not.
That is okay. Engineering moves faster than theory in this field. Has for a decade. Probably will for another.
Interview Q&A¶
Q: Why does over-parameterization not destroy generalization?
A: Honestly — we do not have a complete answer. Partial explanations include implicit regularization from SGD favoring flat minima, the lottery ticket effect making the effective model much smaller than the parameter count suggests, and double descent showing test error falls again past the interpolation threshold. Each piece explains some of the picture. None is complete.
Common wrong answer to avoid: "more parameters = more capacity = always better." That confuses parameter count with effective capacity. The real story is that SGD on deep nets does not search the full hypothesis space the parameter count would suggest.
Q: Is grokking a real phenomenon or a training artifact?
A: It is real and reproducible — cleanly observed in small algorithmic tasks like modular arithmetic, with weight-decay and small data. Whether something grokking-shaped happens at scale in large language models, on what timescale, and whether emergent abilities are the same phenomenon or a different one — those are open questions. The toy version is solid science. The big-model analogue is a hypothesis.
Common wrong answer to avoid: "grokking just means the optimizer was stuck and finally escaped." It is more specific — train accuracy is already at 100% during the silent plateau, so the optimizer is not stuck on training loss. Something else is reorganizing.
Q: If most weights do not matter, why train them?
A: Because we do not know in advance which weights will matter. The lottery ticket can only be identified after training the full network. There is no current method that picks the winning subnetwork from a fresh random initialization. So we pay the full compute cost to discover which 10% was load-bearing.
Common wrong answer to avoid: "we train all weights because every weight contributes a little." For trained networks the weight magnitudes are extremely heavy-tailed — a small fraction carry almost all the signal.
Q: Why does this matter for a Lead AI Engineer?
A: Three reasons. One — your defaults change. Bigger model first, regularize after, do not trust classical "fewer parameters generalize better" intuition. Two — your evaluation cadence changes. Capabilities can appear at scale or after long plateaus, so you evaluate continuously, not at one fixed checkpoint. Three — your honesty changes. When something generalizes mysteriously well or fails mysteriously, you do not pretend to have a clean theory. You measure, you ablate, you ship.
Apply now (5 min)¶
No code this time. Get a piece of paper and a pen.
- Write one sentence describing what classical statistics predicts about a 100M-parameter network trained on 1M examples. Use the word "overfit".
- Write one sentence describing what actually happens when you train such a network on something like ImageNet. Use the word "generalize".
- Hold those two sentences side by side. Notice the contradiction. That gap is the field's open question.
- Sketch the grokking curve from memory — train accuracy hits 100% fast, test accuracy stays at chance for a long plateau, then jumps. Label the three regions: memorize, do nothing, generalize.
Keep the paper. The next time someone tells you "deep learning is just statistics", show it to them.
The end of this module¶
Eleven failures. Eleven fixes. One honest admission. That is the cat-robot's complete story for now.
The rule pile bends. The nudge flows. The smart nudge remembers. The pile scales. And somehow, against the textbook, it generalizes.
You can build it. You can ship it. You can debug it. You can interview about it. The pieces you do not fully understand — you can name them honestly.
That is the bar.
Bridge. The rule pile processes tokens. But what is a token? The next module tears language into pieces, turns those pieces into vectors, and shows how each token attends to the others that matter. Start at
../02_tokens_embeddings_context/00-eli5.md.