08. Program synthesis — from vague spec to executable program¶

~14 min read. Program synthesis asks for more than completion. It asks the model to search the space of programs that satisfy a specification.

Built on the ELI5 in 00-eli5.md. The shopping list is now a behavioral spec, not a half-written function. The translator must infer a full program that satisfies the market vendor and its tests.

Program synthesis is a harder claim than code completion¶

Completion starts with code already on the screen. Synthesis may start with only a description. Sometimes you get a few examples. Sometimes you get tests. Sometimes you get a formal grammar. Sometimes you get only intent and constraints. This is a much larger search space.

See the difference. Completion asks, "What code fits here?" Synthesis asks, "What program satisfies this behavior?" That means multiple candidate algorithms may compete. A surface-level pattern match is often not enough. The model must search, test, and revise. Simple, no?

One useful mental picture is program space. Every possible program is a stall in the market. Most stalls sell junk. A few satisfy the visible examples. Even fewer satisfy the hidden ones. So synthesis needs strong receipts to guide search.

spec / tests
    │
    ▼
┌──────────────┐
│ translator   │  proposes program candidates
└──────┬───────┘
       ▼
┌──────────────┐
│ market vendor│  run tests / constraints
└──────┬───────┘
       ▼
   passing or failing receipt

Tests are executable specifications¶

Natural language specs are often ambiguous. "Clamp a value" sounds clear until you meet edge cases. Does clamp(10, 10, 20) return 10? Yes. What about if lo > hi? Should we swap them, raise, or assume the caller is wrong? The words alone may not decide.

Tests sharpen the phrasebook. They tell the model what behavior actually matters. Positive examples help. Negative examples help more. Boundary cases help most. That is why test-driven synthesis is powerful. The vendor's receipt becomes objective.

Constrained decoding is another tool. Maybe the output must obey JSON, a DSL grammar, or a typed AST. If you restrict the search space, syntax errors fall. Semantic errors remain, but at least the program is well formed. Look. Good constraints turn a giant search problem into a smaller one.

Worked numerical example: synthesizing `clamp`¶

Spec: write clamp(x, lo, hi). It should return lo if x < lo. It should return hi if x > hi. Otherwise return x.

Test with numbers. clamp(12, 0, 10) should be 10. clamp(-3, 0, 10) should be 0. clamp(7, 0, 10) should be 7. These three cases cover upper bound, lower bound, and inside range.

Candidate 1:

def clamp(x, lo, hi):
    return x

This passes the inside-range test. It fails the other two. For 12, it returns 12 instead of 10. For -3, it returns -3 instead of 0. So the receipt says the program ignores bounds.

Candidate 2:

def clamp(x, lo, hi):
    return max(x, lo)

Now clamp(-3, 0, 10) becomes 0. Good. But clamp(12, 0, 10) becomes 12. Still wrong. The upper bound is missing.

Candidate 3:

def clamp(x, lo, hi):
    return min(max(x, lo), hi)

Now test all three. max(12, 0) = 12, then min(12, 10) = 10. max(-3, 0) = 0, then min(0, 10) = 0. max(7, 0) = 7, then min(7, 10) = 7. All pass. See how receipts guide search.

Constrained decoding and search strategies¶

One approach is direct generation. Ask for one candidate. Run tests. Repair if needed. Another approach is sample-and-rank. Generate several programs. Keep the ones that parse. Run tests. Pick the strongest. A third approach is grammar-constrained decoding. Force the output to stay inside the allowed syntax from the start.

Which is best? Depends on the task. If syntax mistakes are common, constraints help a lot. If multiple algorithms may work, sampling helps. If feedback is fast, repair loops help. In practice, good systems mix all three. That is why program synthesis feels more like search than chat.

There is also the hidden-test problem. A model can overfit the visible examples. It learns the receipt instead of the real behavior. So synthesis systems often need diverse tests, property checks, or symbolic constraints. Otherwise the translator may learn to game the market vendor. Bad bargain.

Honest limits of synthesis¶

Even with tests, synthesis can fail from ambiguous specs, missing constraints, or huge state spaces. The model may choose an algorithm that passes examples but explodes on larger inputs. It may misunderstand performance needs. It may ignore side effects. It may use a banned library. So what to do? Keep the spec tight. Keep the tests meaningful. Keep the search bounded.

The best working mental model is this. Natural language starts the search. Tests prune the search. Constraints fence the search. Execution receipts steer the search. Simple, no? That is program synthesis in one line.

Where this lives in the wild¶

GitHub Copilot Chat — interview candidate: uses tests and examples to synthesize helpers from problem statements instead of raw code stubs.
Cursor Agent — application engineer: explores several implementation candidates, then keeps the one that survives execution feedback.
OpenAI structured outputs — platform engineer: benefits from constrained decoding when emitted code or DSL must stay syntactically valid.
Semgrep Assistant — security engineer: writes rule snippets from behavior specs under grammar-like constraints.
Replit Agent — prototype builder: turns app requirements into working code faster when tests act as the real phrasebook.

Pause and recall¶

Why is program synthesis a harder problem than code completion?
In the clamp example, what exactly did candidate 2 still miss?
Why do tests act like executable specifications?
When does constrained decoding help more than plain sampling?

Interview Q&A¶

Q: Why are tests so central to program synthesis instead of being just a final check? A: Because they define the operational meaning of the spec and prune the search space toward programs that actually satisfy behavior. Common wrong answer to avoid: "Tests matter only after the model has already found the right algorithm."

Q: Why can constrained decoding improve synthesis even if it does not solve semantics? A: It shrinks the candidate space to well-formed programs, so the search budget is spent on behavior rather than syntax repair. Common wrong answer to avoid: "Constraints mainly make the output look prettier."

Q: Why is sample-and-rank useful when one candidate already passes visible tests? A: Another passing candidate may generalize better, be simpler, or avoid hidden bugs that the visible suite does not expose. Common wrong answer to avoid: "Visible-test pass means the search is finished."

Q: Why do synthesis systems overfit visible examples so easily? A: Because small example sets underconstrain the true program, letting shortcut implementations mimic the receipt without learning the full behavior. Common wrong answer to avoid: "Overfitting is only a training-time phenomenon, not an inference-time search issue."

Apply now (5 min)¶

Exercise. Write a tiny behavioral spec like clamp, is_even, or safe_divide and give it three tests. Then imagine two wrong candidates and one correct candidate, and compute which tests each one passes.

Sketch from memory. Draw spec → candidates → tests → passing receipt. Under it, write search, prune, repair as the three verbs. See. That is synthesis.

Bridge. Even synthesized code that passes tests may still hide bugs or risky changes. Next we study AI-assisted code review over diffs. → 09-code-review-ai.md