06. Code completion models — filling the missing code without wasting tokens¶
~13 min read. Code models look magical until you notice how much their success depends on context shape and cursor position.
Built on the ELI5 in 00-eli5.md. The translator is now speaking to a compiler-style market vendor. The phrasebook becomes nearby code, types, APIs, and the hole you want to fill.
Code completion is structured prediction, not free essay writing¶
Normal prose has many acceptable next sentences. Code has far fewer acceptable next tokens. One wrong bracket breaks parsing. One wrong variable name breaks execution. One wrong type assumption breaks the whole function. So code completion is a stricter market.
That strictness is helpful. Tests, parsers, and type checkers give sharp feedback. It is also challenging. The model must track indentation, scope, imports, naming conventions, and local intent. That means context quality matters a lot more than in casual prose chat.
Look at the mental picture. You are not asking for a whole essay. You are pointing at a hole in a machine. The translator must read the surrounding bolts first. Then it can suggest the missing part. Simple, no?
This is why code completion models often outperform general chat models inside editors. They are trained to predict code around structured gaps. They learn naming reuse. They learn indentation habits. They learn common API trajectories.
Prefix-only completion versus fill-in-the-middle¶
Classic completion sees only the prefix. Cursor is at the end. The model predicts what comes next. That works for many editor cases. It fails when the best completion depends on code after the hole.
Fill-in-the-middle fixes that.
Now the model sees both prefix and suffix.
It knows what must come before the next line and what must still fit after it.
This is huge for refactors and partial edits.
If the suffix already returns result, the missing span probably defines result.
If the suffix closes a try, the missing span probably starts risky work.
See the extra structure.
┌──────────────┐ hole ┌──────────────┐
│ prefix code │ ───→ │ suffix code │
└──────┬───────┘ └──────┬───────┘
│ │
└──────────┬──────────────┘
▼
translator infill
This also changes token efficiency. You need not resend an entire file if the useful clue is local. You send the sharp prefix, the sharp suffix, and maybe a few retrieved helpers. That is a better phrasebook than one giant dump.
Worked example: infilling a helper function¶
Suppose the editor shows this Python snippet.
def normalize_email(raw: str) -> str:
def create_user(email: str) -> dict:
clean = normalize_email(email)
return {"email": clean}
What should go in the hole?
A good infill reads both sides.
The prefix says the function takes raw: str and returns str.
The suffix says create_user expects a cleaned email string.
So the missing span should strip spaces and lowercase.
One reasonable fill is:
Now test with numbers and concrete strings.
Input " A@Example.COM " becomes "a@example.com".
Length before strip is 17 characters.
After stripping spaces, length becomes 13.
Lowercasing changes letters, not length.
The returned value is stable and fits the suffix use.
A prefix-only model might still guess this. A fill-in-the-middle model gets extra confidence because it sees how the helper is consumed. That suffix is part of the phrasebook. Yes?
Context windows and token efficiency decide quality more than people admit¶
Bigger context windows help, but only to a point. If you stuff 50 unrelated files into the prompt, the model has more tokens and less focus. If you send the two caller functions, the type definitions, and one test, the model often does better. Relevant context density matters. Not raw size alone.
Do the arithmetic. Suppose a full file is 1,200 tokens. The local prefix is 120 tokens. The local suffix is 80 tokens. One retrieved type definition is 60 tokens. Total focused context is 260 tokens. That is far cheaper than 1,200 and often more useful. You saved 940 tokens while improving locality. Simple math. Good engineering.
This is why editor systems rank context aggressively. Current file. Nearby lines. Open tabs. Definitions of referenced symbols. Recent edits. The translator writes better code when the phrasebook is curated around the cursor.
Completion quality depends on validation, not just prediction¶
A model can suggest beautiful-looking code that is subtly wrong. Wrong import. Wrong async usage. Wrong null handling. Wrong library version. So even completion features benefit from lightweight validation. Run a parser. Run a type checker. Run quick tests if available.
One more subtle point. Code completion should usually optimize for token efficiency and latency, not maximum novelty. The best completion is often the boring local one. Reuse existing helper names. Reuse nearby style. Reuse current abstractions. That is what users want in editors. Not a brand-new architecture every time.
Look. Completion is the smallest unit of code generation. Once you understand holes, suffixes, and token-efficient phrasebooks, the jump to full-file generation becomes much easier.
Where this lives in the wild¶
- GitHub Copilot — backend engineer: predicts the next lines or infills a helper while reusing nearby symbols and imports.
- Cursor Tab — full-stack engineer: uses local file context and recent edits to make low-latency completion feel accurate.
- JetBrains AI Assistant — Kotlin developer: leverages IDE type information so suggested code fits the current method signature.
- Amazon Q Developer — Java maintainer: fills in methods while respecting enterprise library usage and current project patterns.
- Replit Agent — solo builder: mixes prefix, suffix, and live execution feedback to complete code inside browser IDE flows.
Pause and recall¶
- Why is code completion a stricter prediction problem than prose continuation?
- What extra information does fill-in-the-middle use that prefix-only completion misses?
- In the worked example, why did the suffix help justify
strip().lower()? - Why is relevant context density more important than blindly sending huge files?
Interview Q&A¶
Q: Why prefer fill-in-the-middle over prefix-only completion for many editing tasks? A: Because the suffix constrains what the missing span must define, return, or preserve, which sharply reduces ambiguity. Common wrong answer to avoid: "Because suffix tokens are easier for the model than prefix tokens."
Q: Why can smaller context outperform a giant context window in code completion? A: Focused local context preserves the symbols, types, and nearby intent that actually govern the hole, while irrelevant files dilute attention. Common wrong answer to avoid: "More tokens always mean more accuracy."
Q: Why is completion validation still useful when the suggestion is only a few lines long? A: Tiny suggestions can still introduce parse errors, type mismatches, or semantic bugs that become expensive once accepted into the codebase. Common wrong answer to avoid: "Validation matters only for full generated files."
Q: Why do editor users usually prefer boring completions over clever ones? A: Because local consistency, low latency, and reuse of existing abstractions beat novelty in day-to-day engineering workflows. Common wrong answer to avoid: "The best completion is the most sophisticated code the model can invent."
Apply now (5 min)¶
Exercise. Take one small helper from your codebase and blank out the middle line. Write the prefix and suffix on paper, then propose one fill that keeps names, types, and local intent consistent. Also do the token math: 120 + 80 + 60 = 260 focused tokens.
Sketch from memory. Draw the prefix box, the suffix box, and the hole in the middle. Under it, write parser and tests as the final check. See. That is code completion in one picture.
Bridge. A few missing lines are the easy case. Next we move to whole tasks where the translator must generate, validate, and test larger code changes. → 07-code-generation-pipeline.md