11. Multi-file code understanding — when the phrasebook is a whole repository¶

~14 min read. Serious code tasks rarely live in one file. The translator must gather repository context without drowning in it.

Built on the ELI5 in 00-eli5.md. The phrasebook is now spread across many stalls in the market. The translator must fetch the right repository pages before speaking to the compiler market vendor.

One-file intelligence is not enough¶

A bug may appear in one endpoint and originate in another file. A generated function may compile locally and still break a shared interface. A review comment may sound reasonable until you inspect the caller chain. This is why multi-file understanding matters.

Repository code is a graph. Imports connect modules. Calls connect functions. Configs influence runtime behavior. Tests encode assumptions. Docs name the intended contracts. No single file is the whole truth. Simple, no?

route.py ──→ service.py ──→ repo.py
    │             │             │
    └──→ config.py│             │
                  └──→ tests.py ◀┘

The challenge is not only finding more context. It is finding the right context. Too little and you miss the dependency. Too much and the signal disappears. That is classic phrasebook retrieval.

Dependency graphs and codebase RAG help focus¶

A strong system starts from the task anchor. Maybe a function name. Maybe a stack trace. Maybe a changed file. Then it expands outward along useful edges. Imports. Call sites. Type definitions. Tests. Configuration values. This is repository retrieval.

You can think of it as codebase RAG. Instead of retrieving text chunks by meaning alone, you retrieve code artifacts by both meaning and graph structure. Call graph edges matter. Import paths matter. Symbol definitions matter. Recent edits may matter too. Look. The phrasebook for code is part semantic, part structural.

A helpful rule is this. First retrieve the directly connected files. Then retrieve one hop of validating context. Stop unless the receipts show you need more. That prevents giant context dumps. Yes?

Worked numerical example: timeout bug across files¶

Suppose a web route calls a service, and the service calls an HTTP client. The route passes timeout=30. The client expects milliseconds, not seconds. Now the request times out almost instantly.

Trace the files. route.py sends timeout=30. service.py forwards the same value. client.py interprets timeout as milliseconds. So the actual timeout becomes 30 ms. Expected behavior was 30 seconds. That is 30,000 ms. The bug is a factor of 1,000.

Without multi-file context, the translator may stare at route.py and see nothing wrong. With the graph, it sees the unit mismatch. 30 looked fine locally. It becomes obviously wrong when matched against the client contract. That is the whole point.

Now do the arithmetic clearly. Expected timeout = 30 seconds. Convert to milliseconds. 30 × 1000 = 30000 ms. Actual passed value = 30 ms. So the request gets only one-thousandth of the intended wait time. See the power of cross-file reasoning.

Retrieval quality matters more than raw repository size¶

Large repositories scare people. They should not. The real problem is not size itself. It is retrieval quality. If a 10,000-file repo yields the 4 relevant files, the task is manageable. If a 200-file repo yields 80 noisy files, the task is hard. Simple, no?

Do the token picture. Imagine 30 candidate files at 400 tokens each. That is 12,000 tokens. Now graph ranking narrows it to 4 files. That is 1,600 tokens. You saved 10,400 tokens and raised context density. Better focus. Lower cost. Usually better answers.

This is why tooling matters. Symbol search. Definition jumps. Call graphs. Import graphs. Repo embeddings. File summaries. These are not luxuries. They are the retrieval stack that makes repository-scale translation practical.

Good repository understanding stays grounded in receipts¶

Multi-file retrieval is only the first half. You still need execution or review receipts to validate the hypothesis. Maybe the timeout mismatch is real. Maybe the route already multiplies by 1000 in a hidden helper. Maybe tests cover the contract. So what to do? Retrieve, hypothesize, verify. Do not stop at a plausible graph story.

A mature loop looks like this. Start from the failing symptom. Fetch nearby graph context. Draft a hypothesis. Run tests or inspect runtime logs. Patch and verify. That keeps repository reasoning tethered to evidence. The translator reads the phrasebook, but the vendor still decides. Look. The same market rule keeps returning.

Where this lives in the wild¶

Sourcegraph Cody — staff engineer: retrieves symbol definitions, call sites, and documentation across large repositories before suggesting edits.
Cursor codebase chat — full-stack engineer: answers repo questions by combining semantic retrieval with nearby file structure.
GitHub Copilot Chat in VS Code — API maintainer: pulls related files and tests when asked to explain or modify a feature end to end.
Amazon Q Developer — enterprise developer: benefits from project-wide context retrieval when tracing service contracts across modules.
JetBrains AI Assistant — platform engineer: uses IDE symbol knowledge so edits respect interfaces defined far from the current cursor.

Pause and recall¶

Why is repository understanding fundamentally a retrieval problem as much as a modeling problem?
In the timeout example, where did the factor-of-1000 bug come from?
Why can four highly relevant files beat thirty loosely relevant files?
What should happen after the model forms a multi-file hypothesis?

Interview Q&A¶

Q: Why is codebase RAG better than naively pasting many files into one prompt? A: Because retrieval guided by graph structure and relevance preserves the few files that govern the task instead of flooding the model with weak context. Common wrong answer to avoid: "Because large prompts are impossible for current models."

Q: Why can a bug look correct in one file and still be wrong at repository level? A: Local code may satisfy its own syntax and style while violating a contract, unit, or invariant defined elsewhere in the dependency graph. Common wrong answer to avoid: "If a file looks fine in isolation, the bug must be in runtime infrastructure."

Q: Why are call graphs and symbol lookups so important for AI coding systems? A: They provide structural retrieval signals that semantic similarity alone often misses, especially for typed APIs and indirect dependencies. Common wrong answer to avoid: "Embeddings are enough for repository understanding."

Q: Why should repository reasoning still be validated with tests or runtime evidence? A: Because a plausible cross-file narrative can still be false unless execution receipts confirm that the inferred dependency actually drives behavior. Common wrong answer to avoid: "Once the graph explains the symptom, validation is optional."

Apply now (5 min)¶

Exercise. Pick one bug you know that crossed file boundaries. List the entry file, one dependent file, one config or type file, and one test file. Then compute one concrete unit conversion or invariant, like 30 × 1000 = 30000.

Sketch from memory. Draw a four-node dependency graph with arrows between route, service, client, and test. Write one sentence under it: retrieve the graph before editing the node. See. That is repository understanding.

Bridge. Once systems can answer SQL questions and edit multi-file code, we need a way to measure whether they are actually good. Next we study benchmarks and evaluation. → 12-evaluation-benchmarks.md