Skip to content

14. Honest admission — what still feels unsolved

The clean module helps. The rough edges still matter. Good engineers can hold both.

Built on the ELI5 in 00-eli5.md. The spotlight beam — attention over context — was the module's clean picture, and this file names where that picture still bends.


Mental model — a useful map with rough roads

See. A teaching module gives you the stable skeleton. The splitter. The badge board. The seat number. The spotlight beam. The scorecard. That skeleton is real. It is also incomplete. Production systems keep running into rough edges. Some are cost problems. Some are fairness problems. Some are interpretation problems. Some are still research problems. So what to do? Do not throw away the clean picture. Keep it. But mark the cracks on the map. That is what this file does.

Formula snapshot — where the pain shows up

A few small formulas reveal most of the pain.

Attention cost

attention work ~ L x L

Longer sequences make score matrices explode.

Token inflation across languages

token burden = tokens needed / meaning carried

Same meaning. Different languages. Different token counts.

Position stretch

trained context <= reliable context

A model may accept a long window. That does not mean it uses it well.

Attention weights as evidence

attention map != full causal explanation

Useful clue. Not final proof.

Worked numerical examples — where intuition starts to wobble

Example 1 — long context gets expensive quickly

Suppose prompt length L = 512. Score cells are:

512 x 512 = 262,144

Now move to L = 4,096.

4,096 x 4,096 = 16,777,216

That is not 8 times bigger. It is 64 times more score cells. ASCII picture:

L=512    ########
L=4096   ################################################################

So long-context serving remains costly.

Example 2 — tokenization is not equally kind to every language

Imagine two short messages with similar meaning.

Message A meaning units -> 10 tokens
Message B meaning units -> 16 tokens

If billing and context both count tokens, then message B pays more. ASCII picture:

same idea
English-ish   : ||||||||||
Language B    : ||||||||||||||||

That affects latency. That affects cost. That affects who gets squeezed by context limits.

Example 3 — attention maps help, but not perfectly

Suppose one head for token bank gives:

river  0.52
loan   0.10
the    0.08
bank   0.30

Good. We learned the model looked at river. But we still do not know the whole causal chain. Another head may matter more. The FFN may transform the signal later. Residual paths may dominate. So the map is useful. Not complete.

What still feels unsolved

1) Tokenization is still a compromise

There is no universal splitter. Subword tokenization is practical. It is not perfect. Some languages and scripts get awkward splits. Code, emojis, transliteration, and mixed-language text stress the vocabulary. So the splitter is still a compromise. Not a solved endpoint.

2) Attention is expensive

The L x L cost is still the headline problem. Flash-style kernels help. Caching helps during generation. Sparse and linear ideas help in pockets. Still, long-context serving is expensive in both memory and latency. That pain has not gone away.

3) Attention weights are not perfect explanations

The scorecard shows information flow hints. That is valuable. But it is not the full causal story of a prediction. Attention may point somewhere important. Or somewhere only loosely correlated. So you can inspect attention. You should not worship it.

4) Long-context generalization is messy

Models often advertise very long windows. Real quality across that whole window is uneven. Lost-in-the-middle behavior still appears. RoPE scaling tricks help. ALiBi helps in some settings. None is a perfect fix. Reliable retrieval over very long context is still messy.

5) Multilingual fairness is hard

Some languages consume more tokens for the same meaning. That means higher cost for users. It can also mean less effective context budget. So fairness is not just about model quality. It is also about token budget shape. That is a systems issue and a product issue.

Why these cracks matter in real work

A Lead AI Engineer gets asked practical questions. Why is this language more expensive to serve? Why did quality drop in a 100k-token prompt? Why does the attention heatmap look sensible, but the answer is still wrong? Why did the tokenizer split this product name into nonsense pieces? These are not classroom edge cases. They hit product, infra, and trust.

Where this lives in the wild

  • GitHub Copilot sees tokenization pain when code, comments, and strange identifiers mix in one file.
  • OpenAI ChatGPT users feel long-context cost because bigger prompts mean slower and pricier attention.
  • Anthropic Claude markets long context, yet real retrieval quality across huge documents still needs careful evaluation.
  • Google Gemini multilingual use makes token fairness and context budgeting product-level concerns.
  • Meta Llama open-weight experiments expose how position extension and long-context tricks help, but do not fully solve reliability.

Interview Q&A

Q: Is tokenization solved now that subword methods work well? A: No. They are practical compromises, not universal language-neutral solutions. Common wrong answer to avoid: "BPE solved tokenization for every language." It solved enough for deployment, not everything.

Q: Why is long context still expensive even with modern kernels? A: Because pairwise attention still scales badly with sequence length. Common wrong answer to avoid: "Once you use FlashAttention, context cost is basically linear." Kernel efficiency helps. The core scaling pain remains.

Q: Can I use attention maps as explanations? A: Use them as evidence of information flow, not as the full causal proof.

Q: Why is multilingual fairness tied to tokenization? A: Because the same meaning may consume different token budgets across languages.

Apply now (5 min)

Take one prompt of 100 tokens. Now imagine the same meaning taking 160 tokens in another language. Write three consequences for cost, latency, and context budget. Then double a prompt length from 512 to 1,024. Say how the raw attention score grid changes. Sketch from memory: Draw a clean map of the module. Then mark five cracks on it.

The end of this module

You now have the usable picture. The splitter turns text into pieces. The badge board turns IDs into vectors. The seat number restores order. The spotlight beam lets tokens consult each other. The scorecard decides who matters how much. Causal masking keeps decoders honest. Multiple heads divide the routing work. The full pipeline turns raw text into contextual vectors. And this final file reminds you where the picture is still incomplete. That is enough to walk into interviews with honesty. That is enough to debug many real systems. That is enough for this module.


Bridge. This module ends here. Next, move into transformer blocks, residual paths, and system structure in ../03_transformer_mechanics/00-eli5.md.