10. Execution feedback loops — let the runtime argue back¶

~13 min read. The fastest way to improve generated code is to stop debating in prose and start reading concrete runtime receipts.

Built on the ELI5 in 00-eli5.md. The haggling loop becomes explicit here. The translator proposes code, the runtime market vendor returns a receipt, and the next draft learns from it.

Runtime feedback is sharper than prompt advice¶

A prompt can say, "Handle edge cases carefully." A runtime failure can say, "IndexError on empty list at line 14." Which one is easier to repair from? Obviously the second. Execution turns vague guidance into concrete evidence. That is why self-debugging loops work as well as they do.

Look at the pattern. Generate code. Run it. Observe the receipt. Repair the code. Run again. This is just engineering discipline wrapped around the translator. Simple, no?

translator draft
      │
      ▼
 run / test / REPL
      │
      ▼
 runtime receipt
      │
      ▼
 repair prompt
      │
      └──→ next draft

The key is that receipts are typed. Assertion failure. Exception trace. Wrong output value. Timeout. These are far more informative than generic "improve the code" instructions.

REPL-driven generation makes small loops cheap¶

For exploratory tasks, a REPL is perfect. You ask the model for a helper. You run one function call. You inspect the value. You patch the code. Then you rerun. That short loop makes iteration cheap.

This works especially well for data transformations, parsers, and tiny utilities. The model does not need to imagine the whole world. It can test assumptions immediately. That is like checking the market bargain on the spot instead of waiting until next week. The receipt comes fast. The haggling stays grounded.

A practical rule is to surface the smallest failing example. Not the whole integration log. One concise trace or one wrong input-output pair often helps more. If the receipt is noisy, the repair prompt gets noisy too. So curate it. Yes?

Worked numerical example: fixing Fibonacci indexing¶

Suppose the translator drafts this function.

def fib(n):
    a, b = 1, 1
    for _ in range(n):
        a, b = b, a + b
    return a

What should fib(6) be? Using the sequence 0, 1, 1, 2, 3, 5, 8, the answer should be 8. Now run the draft. Start with a = 1, b = 1. After one loop, values become 1, 2. After two loops, 2, 3. After three loops, 3, 5. After four loops, 5, 8. After five loops, 8, 13. After six loops, 13, 21. Returned a is 13. Wrong.

The receipt is useful. It says the indexing convention is off. The function behaves like a one-shifted Fibonacci. So the repair prompt can be precise. Maybe initialize a, b = 0, 1. Maybe change the loop count. The corrected version is:

def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

Now fib(6) returns 8. One concrete receipt fixed the bug. See how much stronger that is than a vague instruction.

Self-debugging works best when the loop is bounded and honest¶

Execution feedback can still mislead. A model may patch only the visible failure. It may overfit the current test. It may introduce a new bug elsewhere. So what to do? Keep the loop bounded. Track what changed. Rerun old tests as well as new ones. Do not trust the latest draft just because the newest failure disappeared.

Another important rule is honesty. If the runtime receipt says permission denied, do not ask the model to invent access. If the receipt says package missing, do not hallucinate success. The loop should refine code, not fabricate environment state. Look. Execution makes the translator smarter only if the receipts stay real.

One more design choice. Should the model see raw logs or summarized logs? Raw logs preserve detail. Summaries reduce noise. In practice, a hybrid works well. Keep the failing snippet, the exception type, and the exact wrong output. Drop the irrelevant noise. That keeps the phrasebook sharp.

Runtime loops complement static review¶

Static review sees patterns before execution. Runtime loops see actual behavior after execution. These are complementary. A static checker may warn about possible null access. A runtime receipt shows the exact null input that crashed. A review model may suspect an off-by-one. A failing test proves it. Together they are much stronger than either alone.

The mature picture is simple. Generate. Run. Inspect. Repair. Rerun. Stop when evidence is good enough. That is how the translator learns from the vendor without pretending to be the vendor. Simple, no?

Where this lives in the wild¶

Cursor Agent — app developer: generates a patch, runs terminal commands, reads the failure, and proposes the next edit.
Replit Agent — solo founder: iterates against live app previews and runtime traces rather than static prompts alone.
GitHub Copilot CLI — backend engineer: uses terminal receipts to repair scripts and commands in short loops.
OpenHands — debugging engineer: chains code edits with execution feedback to converge on a working fix.
Amazon Q Developer — enterprise maintainer: benefits from compiler and test receipts when applying larger refactors.

Pause and recall¶

Why is runtime feedback usually sharper than generic prompt advice?
In the Fibonacci example, what exactly made the first draft return 13 for fib(6)?
Why should execution loops stay bounded?
What information from logs is most valuable to feed back into the next repair step?

Interview Q&A¶

Q: Why do execution feedback loops often outperform prompt-only self-correction? A: Because concrete runtime receipts expose exact failure modes, while prompt-only self-correction relies on the model guessing what went wrong. Common wrong answer to avoid: "Because the model becomes smarter after each prompt."

Q: Why can runtime feedback still produce bad repairs if you are careless? A: The model may overfit the visible failure, misread noisy logs, or introduce regressions unless the loop is curated and bounded. Common wrong answer to avoid: "Execution feedback makes the loop automatically reliable."

Q: Why keep old tests in the loop after a repair seems to work? A: Because a local fix can break previously correct behavior, and only regression checks reveal whether the patch generalized. Common wrong answer to avoid: "Once the current failing test passes, the bug is solved."

Q: Why are REPL loops especially effective for small utilities and data transforms? A: They shorten the cycle from draft to receipt, making it cheap to isolate assumptions and repair them with concrete evidence. Common wrong answer to avoid: "REPL loops matter only for notebooks, not engineering work."

Apply now (5 min)¶

Exercise. Write a tiny buggy function or copy the Fibonacci example. Run one failing input, write the exact receipt, and draft the repair prompt you would feed back. Use the numbers again: expected 8, got 13.

Sketch from memory. Draw draft → run → receipt → repair → rerun. Under the receipt box, write one example like assertion failure or stack trace. Look. That loop is the whole file.

Bridge. Small runtime loops are great, but many real bugs span routers, services, configs, and tests across many files. Next we study repository-level understanding. → 11-multi-file-code-understanding.md