07. Code generation pipeline — prompt, generate, validate, test, repeat¶

~14 min read. Whole-file generation is just text completion until you wrap it in checks, ranking, and repair loops.

Built on the ELI5 in 00-eli5.md. The translator now drafts a larger bargain for the compiler market vendor. The receipt is no longer a row set. It is parse output, test results, and logs.

Whole-task generation needs a control loop¶

People sometimes imagine code generation as one prompt and one answer. That is demo thinking. Real engineering tasks have constraints. Existing interfaces must stay stable. Tests must pass. Performance may matter. Security may matter. Style may matter a little. Correct behavior matters most.

So what to do? Turn generation into a pipeline. Give the model the task, context, and constraints. Generate one or more candidates. Run validators. Run tests. Rank outcomes. Retry with feedback when needed. This is disciplined haggling with the compiler and test runner.

shopping list ──→ translator ──→ candidate code
                                     │
                                     ├──→ parser / type check
                                     ├──→ tests
                                     └──→ ranking / retry

The key idea is separation. Natural language defines intent. Code defines one attempt. Execution receipts decide which attempt survives. Simple, no?

Good prompts reduce waste before testing starts¶

Strong prompts for code generation are specific. Function signature. Input assumptions. Expected edge cases. Allowed libraries. Output format. Known failure modes. Existing code style if important. This is the code-generation phrasebook.

Look at the difference. "Write a validator" is vague. "Write normalize_phone(raw: str) -> str that strips spaces, keeps a leading plus, rejects letters, and returns digits-only local numbers" is much stronger. The second prompt gives the translator fewer places to drift.

But even a great prompt does not remove validation. Why? Because many wrong programs still satisfy the wording loosely. One may pass happy paths and fail edge cases. One may use the wrong library. One may mutate global state. This is why execution must arbitrate. Yes?

Worked numerical example: tests rescue a buggy draft¶

Suppose we want this function: discounted_total(price, discount_pct) Behavior should be price * (1 - discount_pct / 100).

Take a concrete case. Price is 200. Discount is 15. Correct multiplier is 1 - 15/100 = 0.85. So total should be 200 * 0.85 = 170.

The translator's first draft is:

def discounted_total(price, discount_pct):
    return price - discount_pct

Run the test. discounted_total(200, 15) returns 185. Expected is 170. Difference is 15. Clear failure.

Now add a second test. discounted_total(80, 25) should be 80 * 0.75 = 60. The buggy draft returns 55. Now the pattern is obvious. The code is subtracting points, not percent of price.

The repair prompt can now say: "The function subtracts the discount value directly. It should subtract the percentage of the price." That feedback is much sharper than the original natural-language task. The new draft becomes:

def discounted_total(price, discount_pct):
    return price * (1 - discount_pct / 100)

Now both tests pass. The receipt changed the conversation. See how execution teaches the translator.

Validation layers should be cheap-to-expensive¶

Parser and linter checks are cheap. Type checks are often cheap. Unit tests are a bit heavier. Integration tests are heavier still. Sandboxed runtime tests may be heaviest. So order them sensibly. Do not spend minutes on integration if the file does not parse.

A common production pattern is candidate ranking. Sample three or five solutions. Discard broken ones quickly. Keep the candidates that parse and pass the most tests. Then ask a model or heuristic ranker to pick the most idiomatic passing option. This often beats trusting one sample.

Do the simple probability picture. If one sample passes at 30%, three independent samples give you a better shot. You are not magically solving the task. You are widening search. That matters for code where many local optima exist. Look. Search plus validation is the whole trick.

Pipelines fail in recognizable ways¶

Sometimes the prompt is underspecified. Sometimes the tests are weak. Sometimes the model overfits visible tests and misses hidden ones. Sometimes generated code passes but is hard to maintain. Sometimes it edits the wrong file. These are all different failure classes.

So what to do? Log them separately. Prompt failure. Search failure. Validation gap. Context gap. That turns code generation from magic theatre into an engineerable system. It also tells you whether to improve prompts, retrieval, tests, or ranking.

Look at the market analogy one more time. The shopping list gives the task. The phrasebook gives the APIs and files. The translator drafts a candidate. The compiler and test runner act like strict vendors. Their receipts decide the next haggling turn. Simple, no?

Where this lives in the wild¶

GitHub Copilot Chat — backend engineer: proposes whole functions, then the engineer validates them with local tests and review.
Cursor Composer — product engineer: generates multi-file changes but leans on terminal feedback and test results to repair failures.
Amazon Q Developer — Java maintainer: creates code transformation candidates that still need compilation and unit-test checks.
Replit Agent — startup founder: turns prompts into code, then immediately runs the app to collect runtime receipts.
Vercel v0 — frontend engineer: generates UI code fast, but preview renders and lint output still govern whether it is acceptable.

Pause and recall¶

Why is one-prompt code generation a weak production mental model?
In the worked example, what specifically revealed the percent-versus-points bug?
Why order validation from cheap checks to expensive checks?
Why does multi-sampling help even when each single sample is imperfect?

Interview Q&A¶

Q: Why wrap code generation in tests instead of relying on a carefully written prompt? A: Because prompts define intent, but only execution receipts reveal whether the generated program actually satisfies that intent under concrete cases. Common wrong answer to avoid: "A strong enough prompt removes the need for validation."

Q: Why can candidate ranking outperform trusting the first passing sample? A: Different samples may all pass visible tests yet vary in generality, maintainability, and hidden-bug risk, so ranking adds another quality filter. Common wrong answer to avoid: "Once something passes tests, all passing solutions are equivalent."

Q: Why separate prompt failure from validation-gap failure in logs? A: Because the fix differs: one needs better task specification or context, while the other needs stronger tests or static analysis. Common wrong answer to avoid: "A failed run just means the model is weak."

Q: Why is code generation really a search problem as much as a modeling problem? A: Multiple candidate programs may satisfy pieces of the prompt, so sampling, filtering, and repairing are central to reaching a robust final solution. Common wrong answer to avoid: "The best model should output the single correct program immediately."

Apply now (5 min)¶

Exercise. Pick one tiny utility from your world and write three concrete tests before writing any code. Then imagine a wrong draft and use the test outputs as the repair message, just like 200 with 15% should be 170, not 185.

Sketch from memory. Draw the pipeline from prompt to candidate code to parser to tests to retry. Add one note saying receipts rank the candidates. See. That is the generation loop.

Bridge. Generating from a clear signature is still easier than generating from a loose spec. Next we study full program synthesis from requirements and tests. → 08-program-synthesis.md