02. Text-to-SQL pipeline — from question to query to grounded answer¶
~14 min read. The interesting work is not one SQL prompt. It is the full controlled chain around it.
Built on the ELI5 in 00-eli5.md. The shopping list is plain language, but the market vendor accepts only SQL. The translator needs a process, not vibes, to turn one into the other.
See the whole pipe before any details¶
People often say, "We will use an LLM for text-to-SQL." That sentence hides five subproblems: question understanding, schema linking, SQL generation, execution, and result formatting. Miss one and the system feels magical in demo, brittle in production.
Picture the flow first. The user speaks in business words, the schema lives in table words, and the answer layer must turn returned rows back into business words. This is a translation sandwich. Simple, no?
shopping list
│
▼
┌──────────────┐
│ translator │ understand intent
└──────┬───────┘
▼
┌──────────────┐
│ phrasebook │ map words to tables and columns
└──────┬───────┘
▼
┌──────────────┐
│ SQL draft │ build executable query
└──────┬───────┘
▼
┌──────────────┐
│ market vendor│ run query safely
└──────┬───────┘
▼
┌──────────────┐
│ receipt │ rows, numbers, metadata
└──────┬───────┘
▼
natural-language answer
Now what is the engineering lesson? The LLM sits near the top and bottom. The database sits in the middle. Grounding comes from the middle. Clarity comes from the top and bottom. Do not confuse them.
Schema linking is the first real bottleneck¶
A user rarely says the exact column name.
They say "customers," not account_dim.
They say "revenue," not net_amount_after_discount.
They say "last quarter," not invoice_date BETWEEN ....
So the first job is mapping language to schema.
This is schema linking.
We resolve nouns, metrics, filters, time ranges, and join paths.
If the question says "enterprise customers with unpaid invoices," we probably need customers, contracts, and invoices.
That choice is already half the answer.
Look at a tiny example.
Tables: customers(customer_id, segment, city) and orders(order_id, customer_id, amount, month).
Question: "What was March revenue from enterprise customers in Pune?"
The system must map revenue → orders.amount, enterprise customers → customers.segment = 'enterprise', Pune → customers.city = 'Pune', and March → the relevant month field.
Without good linking, SQL generation is already doomed. The translator may produce perfect syntax on the wrong tables. That is the worst kind of wrong. Looks valid. Semantically false. Yes?
Worked numerical example end to end¶
Use this tiny data. We will walk the whole pipe.
customers
┌─────────────┬─────────────┬───────┐
│ customer_id │ segment │ city │
├─────────────┼─────────────┼───────┤
│ 1 │ enterprise │ Pune │
│ 2 │ enterprise │ Delhi │
│ 3 │ self_serve │ Pune │
└─────────────┴─────────────┴───────┘
orders
┌──────────┬─────────────┬────────┬──────────┐
│ order_id │ customer_id │ amount │ month │
├──────────┼─────────────┼────────┼──────────┤
│ 11 │ 1 │ 120 │ 2025-03 │
│ 12 │ 1 │ 80 │ 2025-03 │
│ 13 │ 2 │ 300 │ 2025-03 │
│ 14 │ 3 │ 50 │ 2025-03 │
└──────────┴─────────────┴────────┴──────────┘
Question: "What was March revenue from enterprise customers in Pune?"
Step 1. Link entities.
Enterprise customers in Pune means customers.segment = 'enterprise' and customers.city = 'Pune', so we keep only customer 1.
Step 2. Join orders to customer 1.
That keeps order 11 and order 12 with amounts 120 and 80.
Step 3. Aggregate.
120 + 80 = 200, so the correct receipt value is 200.
A valid SQL draft is:
SELECT SUM(o.amount) AS march_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.segment = 'enterprise'
AND c.city = 'Pune'
AND o.month = '2025-03';
Step 4. Format for the user. Answer: "March revenue from enterprise customers in Pune was 200." Now the answer is grounded. The wording is soft, and the number is hard.
Generation, execution, and formatting are separate jobs¶
Many teams bundle SQL generation and answer generation into one prompt. This works for toy demos. It becomes messy in production. Why? Because failures happen at different layers. A missing join is not the same as a poor answer style. A timeout is not the same as a wrong aggregation. A verbose summary is not the same as leaked rows.
So what to do? Keep explicit pipeline states. Store the linked schema objects. Store the SQL draft. Store execution metadata. Store the returned rows. Store the final natural-language answer. Now you can debug each step.
The formatter should also know when not to freestyle.
If the receipt says NULL, say no data was found.
If the receipt returns five rows, summarize and maybe show a table.
If execution failed, surface that instead of pretending success.
The translator should never hide a bad receipt.
A useful production pattern is this. Ask the LLM for structured output first: chosen tables, chosen columns, filters, groupings, and SQL. Then run validators and execute. Then ask a second prompt to explain the receipt. One giant prompt can do this. Two smaller controlled prompts are usually easier to trust. Look. Pipelines beat monoliths here.
Error handling is part of the pipeline, not an afterthought¶
Text-to-SQL fails in recognizable ways: wrong table, wrong join, wrong time filter, correct SQL on the wrong metric, permission denial, too many returned rows, or ambiguous business terms.
Strong systems do not assume first-pass perfection. They capture parse errors, execution errors, and empty receipts. Then they enter controlled haggling. Maybe refine the schema context, ask for clarification, or retry with an error message.
question
├── bad schema link? ──→ wrong SQL, valid syntax
├── bad SQL syntax? ──→ parser error
├── blocked query? ──→ governance rejection
├── empty receipt? ──→ maybe wrong filter
└── good receipt? ──→ format answer
See why the word pipeline matters. It implies typed stages. It implies observability. It implies retries with boundaries. That is how a translator becomes a reliable data product component.
Where this lives in the wild¶
- Snowflake Cortex Analyst — BI analyst: turns a business question into governed SQL, then summarizes the result for non-SQL users.
- Databricks Genie — operations manager: maps plain requests onto lakehouse tables and explains the returned metrics.
- Google BigQuery with Gemini — data analyst: drafts SQL against warehouse schemas and then formats the receipt back into business language.
- ThoughtSpot Sage — sales leader: asks top-line questions whose answers depend on schema linking before SQL ever runs.
- Microsoft Excel Copilot — finance manager: converts natural requests into workbook operations and grounded summaries from current cells.
Pause and recall¶
- What are the five main stages in a production text-to-SQL pipeline?
- Why can syntactically valid SQL still be a serious failure?
- In the worked example, which schema-linking decisions determined the final sum of 200?
- Why is result formatting best treated as a separate step from SQL generation?
Interview Q&A¶
Q: Why separate schema linking from SQL generation instead of prompting for raw SQL directly? A: Because schema linking is the semantic alignment step; surfacing it explicitly makes table selection, metric choice, and debugging far more reliable. Common wrong answer to avoid: "Because LLMs are bad at writing SQL syntax."
Q: Why not ask one prompt to both generate SQL and explain the answer in one shot? A: Because generation, execution, and explanation fail differently, and separated stages give you control, observability, and safer retries. Common wrong answer to avoid: "Because two prompts are always cheaper than one."
Q: Why is execution metadata part of the product, not just backend plumbing? A: Row counts, nulls, timeouts, and permission failures explain whether the receipt is trustworthy and how the final answer should be phrased. Common wrong answer to avoid: "Users only care about the final sentence."
Q: Why can a valid SQL query still be worse than a syntax error? A: A syntax error is visible and recoverable, but valid SQL on the wrong join or metric produces confidently wrong business decisions. Common wrong answer to avoid: "If it runs, it is basically correct."
Apply now (5 min)¶
Exercise. Take one business question from your domain and decompose it into intent, linked tables, linked columns, filters, SQL, and answer format. Then do the arithmetic manually on a toy table, like 120 + 80 = 200, so you can feel where the receipt becomes exact.
Sketch from memory. Draw the six-box pipeline. Label where the translator works, where the phrasebook enters, where the market vendor executes, and where the receipt gets turned back into plain language. See. If you can draw that, you can design the system.
Bridge. The pipeline only works if the phrasebook is good enough for the translator to choose the right tables, keys, and metrics. → 03-schema-representation.md