Skip to content

13. Honest admission — what we still do not fully solve in structured data and code generation

~14 min read. Good translators exist. Perfect translators do not. This field still has sharp unsolved edges.

Built on the ELI5 in 00-eli5.md. Even with a strong phrasebook, the translator still mishears. Even with a valid receipt, the next bargaining step can still go wrong.


End-to-end systems fail by multiplication

Many substeps look individually strong. Schema linking may work often. SQL generation may parse often. Execution may succeed often. Result reasoning may work often. Multiply those probabilities and the end-to-end number drops fast. That is the first honest lesson.

Take one simple chain. Schema linking succeeds 90% of the time. SQL drafting succeeds 95% of the time. Execution and recovery succeed 90% of the time. Result reasoning succeeds 92% of the time. Now multiply. 0.90 × 0.95 = 0.855. 0.855 × 0.90 = 0.7695. 0.7695 × 0.92 ≈ 0.7079. So end-to-end success is about 70.8%. That is far below each individual component score. Simple, no?

This same multiplication hurts code systems too. Good retrieval. Good generation. Good compile pass. Good tests. One weak link drags the whole pipeline. Demos hide this. Production punishes it.

Verification is much better than before and still incomplete

People sometimes say, "Just verify the output." Yes, verify it. But verify with what? SQL can execute correctly and still violate policy intent. Code can pass visible tests and still break hidden assumptions. Review comments can sound precise and still be false alarms. Verification itself has blind spots.

Look at code generation. Unit tests are strong. They are not omniscient. If the tests miss concurrency, performance, or security, the translator may still ship a harmful patch. Similarly, text-to-SQL execution can verify a result on today's data while missing that the query was overbroad or leaked sensitive columns. So what to do? Layer checks. Do not worship one metric.

This is where honest engineering matters. We can say execution-based eval is the backbone. We should also say it is not the whole skeleton. Yes?

Open debates are real, not cosmetic

Debate one. Should we invest more in bigger models or better toolchains? Bigger models improve raw translation and pattern coverage. Better toolchains improve grounding, validation, and retrieval. Both camps have evidence. In practice, strong systems usually need both.

Debate two. Should structured outputs rely more on constrained decoding or on free generation plus repair? Constraints cut syntax errors and narrow search. Repair loops preserve flexibility and can recover from surprising cases. Again, both matter. The right balance depends on the domain risk.

Debate three. Should code agents act autonomously over many steps or stay tightly supervised? Autonomy improves speed and search. Supervision improves trust and governance. The answer changes with task stakes. A toy app tolerates more freedom. A production billing system does not. Look. These are not settled arguments. They are design tradeoffs.

A simple failure map and honest interview answer

When a structured-data or coding system fails, ask four questions. Did the translator misunderstand the shopping list? Did the phrasebook omit the key schema, type, or file? Did the vendor receipt reveal a wrong execution or weak test? Did the haggling loop stop too early? Those four buckets catch many real failures.

start
 ├── wrong task read? ──→ translator problem
 ├── missing context? ──→ phrasebook problem
 ├── bad execution check? ──→ receipt problem
 └── weak repair loop? ──→ haggling problem

Look. This little map keeps postmortems honest. It stops teams from blaming the model for every failure. Sometimes the retrieval was weak. Sometimes the tests were weak. Sometimes the model really was the weak link.

A strong answer sounds like this. LLMs are excellent translators between plain requests and strict systems. They become powerful when paired with schemas, tests, execution, retrieval, and review loops. They are still brittle on ambiguous specs, weak tests, hidden dependencies, and long-horizon autonomous changes. That is the honest core.

If asked about text-to-SQL limits, say schema ambiguity, governance, and evaluation leakage still matter. If asked about code generation limits, say repository retrieval, verification gaps, and overfitting visible tests still matter. If asked whether agents solve software engineering, say they accelerate parts of it but do not replace disciplined validation. Simple, no?

One more honest sentence. The field progresses by building better receipts, better phrasebooks, and better haggling loops. Not by pretending the translator became the vendor. That line usually lands well.

The right takeaway is confidence with humility

You should absolutely use these systems. You should absolutely design around their strengths. You should absolutely avoid their known traps. That is mature optimism. Not hype. Not cynicism.

See the final market picture. The translator is fast. The phrasebook can be rich. The vendor can return receipts. The bargaining loop can repair mistakes. Still, someone must decide whether the whole transaction is trustworthy enough for the situation. That judgment layer remains a live engineering problem. Yes?

That is a good place to stop. Not with fear. Not with blind confidence. With clear-eyed systems thinking.


Where this lives in the wild

  • Snowflake Cortex Analyst — data platform lead: sees strong demos and still has to worry about governance, ambiguity, and enterprise edge cases.
  • GitHub Copilot for Business — staff engineer: enjoys speed gains while still reviewing repository-scale changes for hidden regressions.
  • Cursor Agent — startup CTO: loves autonomous edits for prototypes but tightens supervision for production paths.
  • Code review platform teams — security lead: pair AI findings with human judgment because false positives and false negatives both matter.
  • Applied AI teams shipping SQL copilots — product manager: learn quickly that component metrics can look great while end-to-end success still feels uneven.

Pause and recall

  • Why does end-to-end success drop so quickly even when each component looks strong?
  • In the multiplication example, how did strong local numbers collapse to about 70.8%?
  • Why is execution-based verification necessary but not sufficient?
  • Name one real debate in the field and the tradeoff on each side.

Interview Q&A

Q: Why do strong component metrics still produce disappointing end-to-end systems? A: Because structured-data and coding pipelines compound errors multiplicatively, so moderate misses across stages quickly reduce overall success. Common wrong answer to avoid: "If each component is above 90%, the full system will also feel above 90%."

Q: Why is verification not a complete answer to model unreliability? A: Verification itself depends on what checks you run, and those checks may miss policy intent, hidden behavior, security, or long-tail edge cases. Common wrong answer to avoid: "Once you have tests or execution, the trust problem is solved."

Q: Why is the bigger-models-versus-better-tools debate still alive? A: Bigger models improve raw translation, while better tools improve grounding and control, and modern systems usually need both rather than one silver bullet. Common wrong answer to avoid: "Tooling matters only because current models are small."

Q: Why is human oversight still rational even for impressive coding and SQL agents? A: Task stakes, hidden dependencies, and verification gaps mean the final trust decision often exceeds what current automated receipts can guarantee. Common wrong answer to avoid: "Human review remains only because teams are conservative."


Apply now (5 min)

Exercise. Multiply four stage success rates from any pipeline you know, just like 0.90 × 0.95 × 0.90 × 0.92 ≈ 0.7079. Then write one sentence on which stage you would improve first and why.

Sketch from memory. Draw translator, phrasebook, vendor, receipt, and haggling in one final loop. Write one warning under it: never confuse a good translator with the source of truth. Look. That is the module's final lesson.


Bridge. You now have the core systems picture for structured data and code generation. Next: put it all together in the capstone. → ../33_capstone_project/00-eli5.md