04. SQL validation and execution — trust the receipt, not the first draft¶

~14 min read. Generated SQL becomes useful only after policy checks, safe execution, and recovery loops.

Built on the ELI5 in 00-eli5.md. The translator can draft a bargain, but the market vendor decides what is legal. The haggling loop starts when the first draft is unsafe, blocked, or wrong.

Never run first-draft SQL on faith¶

A strong text-to-SQL system does not stop at generation. It stops at controlled execution. This is a big difference. A language model can produce SQL that looks elegant and still be dangerous. Danger can mean data leakage. Danger can mean a full-table scan on a giant warehouse. Danger can mean DELETE or UPDATE when the product promised read-only answers.

So what to do? Insert validators before the query touches real data. Think like the vendor. Is the syntax valid? Is the query read-only? Does it touch allowed schemas? Is the estimated cost sane? Will row-level permissions still hold? Simple, no?

SQL draft
   │
   ▼
┌──────────────┐
│ parser check │
└──────┬───────┘
       ▼
┌──────────────┐
│ policy check │  read-only? allowed tables?
└──────┬───────┘
       ▼
┌──────────────┐
│ cost check   │  row count, joins, limits
└──────┬───────┘
       ▼
┌──────────────┐
│ sandbox run  │
└──────┬───────┘
       ▼
    receipt or error

Notice the order. We reject cheap failures early. Syntax first. Policy second. Cost third. Execution last. That keeps systems safer and cheaper.

The validator stack has different jobs¶

Parser validation answers, "Is this SQL even well formed?" This catches broken commas, missing aliases, bad grouping, and invalid functions. Good. Cheap. Necessary. Not sufficient.

Policy validation answers, "Should this query ever be allowed?" Maybe your product is read-only. Maybe it may access only analytics tables. Maybe PII tables are blocked. Maybe each caller gets a different allowlist. This is where governance lives.

Cost validation answers, "Even if allowed, is this query reasonable?" A user asked for top five categories. The translator drafts a scan over five billion rows without a partition filter. That may be technically valid and still operationally foolish. So we enforce limits, time windows, row caps, or sampled previews. Look. Production systems pay for bad SQL in money, latency, and trust.

Then comes sandboxed execution. Run on a constrained role. Apply timeouts. Capture logs. Trim oversized results. Return structured errors. The receipt should include metadata, not only rows. Metadata tells you how to phrase the answer and whether to retry.

Worked example: safe correction loop¶

Question: "What is average premium order value in April?" Suppose the relevant table is:

orders
┌──────────┬──────────┬────────┬──────────┐
│ order_id │ tier     │ amount │ month    │
├──────────┼──────────┼────────┼──────────┤
│ 21       │ premium  │ 120    │ 2025-04  │
│ 22       │ premium  │ 180    │ 2025-04  │
│ 23       │ premium  │ 240    │ 2025-04  │
│ 24       │ basic    │  90    │ 2025-04  │
└──────────┴──────────┴────────┴──────────┘

The translator's first SQL draft is: SELECT AVG(amounts) FROM orders WHERE tier = 'premium' AND month = '2025-04';

Problem one. Column amounts does not exist. Parser or execution returns an error. Good. Visible failure. No fake receipt.

Now the haggling loop feeds back the error. The translator retries with: SELECT AVG(amount) FROM orders WHERE tier = 'premium' AND month = '2025-04';

Now compute the answer. Premium April amounts are 120, 180, and 240. Add them first. 120 + 180 + 240 = 540. There are 3 rows. 540 / 3 = 180. So the safe receipt value is 180.

One more guard. If the draft had been SELECT * FROM orders;, policy may allow it but cost or product rules should reject it for this question. The user asked for an average. A full raw dump is unnecessary exposure. See why validators matter.

Error recovery is a design feature¶

Many teams treat failures as embarrassment. Bad idea. Failure messages are supervision. A missing column error teaches the model which field names are real. A permission denial teaches which tables are blocked. An empty result teaches that the filter may be wrong. A timeout teaches that the join path is too heavy.

So what to do? Build explicit retry classes. Syntax error retry. Schema mismatch retry. Empty receipt retry. Blocked query refusal. Each class gets a different response. That is smarter than dumping the whole error log into one desperate prompt.

start
 ├── syntax error? ──→ retry with parser message
 ├── unknown column? ──→ refresh phrasebook slice
 ├── blocked table? ──→ refuse or ask narrower question
 ├── empty receipt? ──→ confirm filter assumptions
 └── good receipt? ──→ summarize for user

The important discipline is bounded haggling. Do not retry forever. Maybe allow two or three controlled attempts. After that, surface the limitation. The system should be honest before it becomes chaotic. Yes?

Sandboxes preserve trust¶

Execution should happen in a sandboxed environment. Read-only credentials. Statement timeout. Result size cap. Audited query logs. Optional query approval for sensitive scopes. These are boring controls. They are also the difference between a fun demo and an acceptable enterprise tool.

A good sandbox also normalizes results. Maybe the receipt returns decimal values and nulls. Maybe it returns a thousand rows. Maybe it returns column types. The answer layer should know all this. If a mean is null because there were zero rows, say that. Do not turn an empty receipt into a fictional zero.

Look. The translator's draft is not truth. The vendor's receipt is closer to truth. The validated, sandboxed, permission-checked receipt is the one you should trust. That is the production mindset.

Where this lives in the wild¶

Snowflake Cortex Analyst — data governance lead: needs read-only, policy-aware execution before any natural-language SQL answer is shown.
Databricks SQL AI tools — platform engineer: adds Unity Catalog permissions and query limits so lakehouse chat does not become a runaway scan machine.
Power BI Copilot — BI admin: must respect semantic-model access rules before returning grounded dashboard answers.
Airtable AI — operations manager: benefits from safe scoped execution because row dumps are rarely the right product response.
Hex Magic — analytics engineer: uses notebook execution and error feedback loops instead of trusting first-draft queries blindly.

Pause and recall¶

Why is parser validation necessary but still not enough?
In the worked example, how did the system recover from amounts to amount?
Why can SELECT * be a policy or product failure even when it is valid SQL?
What does bounded haggling protect you from?

Interview Q&A¶

Q: Why validate generated SQL before execution instead of relying on database errors alone? A: Because syntax, policy, and cost failures have different meanings, and catching them earlier improves safety, latency, and governance. Common wrong answer to avoid: "Because databases are bad at error messages."

Q: Why keep retries bounded instead of looping until the model fixes the query? A: Unbounded retries increase cost, latency, and instability, and they can turn a recoverable miss into uncontrolled behavior. Common wrong answer to avoid: "More retries always increase accuracy."

Q: Why is sandboxed execution a core product feature and not only an infra detail? A: It enforces permissions, timeouts, result limits, and auditability, which directly determine whether users can trust the returned receipt. Common wrong answer to avoid: "Sandboxes matter only for write queries."

Q: Why treat empty results differently from syntax errors in recovery loops? A: Because empty results often indicate semantic mismatch or wrong filters, while syntax errors indicate malformed translation and need different corrective prompts. Common wrong answer to avoid: "Both just mean the SQL failed."

Apply now (5 min)¶

Exercise. Write one harmless read-only SQL query and one unsafe or overbroad query. Label which validator should catch each one: parser, policy, cost, or sandbox timeout. Then do the tiny average by hand: 120 + 180 + 240 = 540, and 540 / 3 = 180.

Sketch from memory. Draw the validator stack from SQL draft down to receipt. Add one branch for syntax retry and one branch for blocked-table refusal. Simple, no? That sketch is your control plane.

Bridge. Once the vendor sends back a clean receipt, we still need the translator to reason over tables and aggregates without getting confused. → 05-tabular-reasoning.md