01. Natural-language database failure — why direct asking sounds smart and acts dumb¶
~12 min read. The first trap is treating the model like it already has your rows inside its head.
Built on the ELI5 in 00-eli5.md. The translator helps only when it actually talks to the market vendor. If it skips the vendor and invents a receipt, you get fluent nonsense.
The wrong mental picture¶
Many people imagine a big model as a magical analyst. You ask, "How many premium users churned last week?" The model answers in one shot, and it sounds fast and elegant. It is still the wrong picture.
A database is not prose. It is a live state machine full of exact rows. The model does not hold those rows directly. It holds compressed language patterns about tables, reports, and likely answers. That is useful for translation. It is weak for exact counting.
See the market analogy. The shopping list says, "Count last week's churned premium users." The market vendor can open the ledger and count. The translator cannot honestly do that from memory. Not unless the ledger was tiny and memorized exactly. Enterprise data is never like that.
user question
│
▼
┌───────────────┐
│ translator │
│ weights only │
└──────┬────────┘
│
├──→ if asked for style or rough wording
│ ▼
│ maybe useful
│
└──→ if asked for live counts or joins
▼
polished guess
│
▼
wrong imaginary receipt
Now what is the real issue? Freshness, arithmetic, joins, and permissions all matter together. The model can fail on all four at once. Simple, no?
Why live rows beat memorized patterns¶
A trained model may have seen many sales tables and common business patterns. It may know refunds often dip on weekends and active users often exceed paid users. Those are patterns, not your ledger.
Suppose your finance table changes every hour. A new refund, a corrected invoice, and a fixed region tag may all land within minutes. The database knows this immediately. The model does not. Its internal memory is stale by design.
Even worse, many questions are relational. "Top three sellers in cities where churn rose after discount removal" needs joins, filters, ordering, and aggregation. That is a procedural job. The phrasebook must name the tables and keys. The market vendor must execute the plan. The receipt must come back with exact rows.
Look at the contrast. A model can explain churn and draft a polite summary of churn risk. A model cannot responsibly invent yesterday's churn count. That is like a translator guessing your grocery bill without asking the cashier.
So what to do? Use the model as a mapper into executable operations. Do not let it impersonate the execution engine. Yes?
A worked failure with numbers¶
Take this tiny orders table.
We will ask one plain question.
Then we will compare guessing with execution.
orders
┌────────┬──────────┬────────┬─────────┐
│ order │ tier │ region │ amount │
├────────┼──────────┼────────┼─────────┤
│ 101 │ premium │ west │ 120 │
│ 102 │ premium │ west │ 180 │
│ 103 │ basic │ east │ 90 │
│ 104 │ premium │ west │ 210 │
└────────┴──────────┴────────┴─────────┘
Question: "What is total premium revenue in the west?"
The exact computation is simple. Filter premium rows in west. That keeps amounts 120, 180, and 210. Now add them. 120 + 180 = 300. 300 + 210 = 510. So the correct answer is 510.
A direct-answer model might say 500, 540, or "about 5 hundred."
Why?
Because it is pattern-completing.
It sees premium revenue, west region, and a familiar small-table style.
It does not see a verified SUM result.
Now add one fresh row. Order 105, tier premium, region west, amount 60. The new total is 570. The database knows instantly. The model still has no built-in path to that new row. Without execution, freshness is broken. Without execution, arithmetic is guesswork.
See how brutal this is. One absent vendor call, and the translator invents a receipt.
The second failure is hidden: false authority¶
Direct answers are dangerous because they do not look broken. There is no stack trace, syntax error, or red light saying, "not grounded." You just get a neat sentence.
That neatness changes user behavior. If the answer is fluent, people trust it early. If the answer names a table or metric, people trust it more. If the answer carries a business tone, people trust it even more. Still wrong. This is why database hallucination is expensive.
There is also a governance issue. A real database checks permissions. The model alone does not know which rows the caller may access. A support agent may see only one account. A finance lead may see all accounts. The same natural-language question can require different SQL scopes. Only the controlled execution path can enforce that.
Look. The translator is good at intent. The market vendor is good at truth. Mix the jobs and you lose both. The translator becomes overconfident and the vendor becomes unused. Bad system design.
What the system must do instead¶
The fix is not, "use a bigger model." The fix is, "insert a proper translation pipeline." First understand the request. Then inspect the phrasebook. Then generate SQL. Then validate it. Then let the market vendor execute it. Then return the receipt in user language.
That pipeline separates concerns. Language understanding stays with the model. Data truth stays with the database. Formatting stays with the answer layer. Permissions stay with the execution environment. Retries stay inside controlled haggling loops.
A useful decision rule is this. If the answer depends on live rows, joins, filters, counts, or permissions, do not answer directly. Translate. Execute. Verify. Then speak. Simple, no?
shopping list ──→ translator ──→ SQL plan ──→ market vendor
│
▼
checked execution
│
▼
exact receipt
│
▼
natural-language answer
Next we build that pipeline carefully. Because once you agree not to guess, the real engineering starts. Schema linking starts. SQL generation starts. Execution control starts. And now the phrasebook quality suddenly matters a lot.
Where this lives in the wild¶
- Snowflake Cortex Analyst — finance analyst: asks warehouse questions in English, but the product still has to run governed SQL on live tables.
- Databricks Genie — operations analyst: chats with lakehouse data, where freshness and joins matter more than fluent wording.
- Microsoft Excel Copilot — sales manager: asks workbook questions whose answers depend on current cells, not generic spreadsheet patterns.
- Salesforce Einstein Copilot — revenue ops lead: wants CRM summaries that must respect the latest account rows and permissions.
- ThoughtSpot Sage — business analyst: gets natural-language BI, but trusted answers still come from database execution.
Pause and recall¶
- Why is a model's memorized pattern space a bad replacement for live table rows?
- In the worked example, why does adding one new order immediately break direct answering?
- Why is a fluent wrong answer more dangerous than a visible SQL error?
- What job should stay with the translator, and what job should stay with the market vendor?
Interview Q&A¶
Q: Why use text-to-SQL and execution instead of asking the LLM directly for the answer? A: Because live rows, joins, arithmetic, permissions, and freshness belong to the database engine, while the LLM only maps intent into an executable form. Common wrong answer to avoid: "A bigger model will eventually memorize the warehouse."
Q: Why is database hallucination often worse than a compiler error? A: Because the hallucinated answer looks polished and final, so users may act on it without noticing there was no grounded execution step. Common wrong answer to avoid: "Because SQL is harder than Python."
Q: Why is freshness not the only reason direct natural-language database answers fail? A: Exact aggregation, relational joins, row-level permissions, and policy enforcement also require controlled execution beyond model memory. Common wrong answer to avoid: "The only issue is that training data gets old."
Q: Why keep the LLM in the system at all if the database does the real computation? A: The LLM is still valuable for understanding intent, mapping business words to schema terms, drafting SQL, and summarizing the receipt clearly. Common wrong answer to avoid: "If execution matters, the LLM adds no value."
Apply now (5 min)¶
Exercise. Write one business question that needs a count, one that needs a join, and one that needs permissions. For each, mark which part is language understanding and which part must be true database execution. Then compute one tiny answer by hand, like 120 + 180 + 210 = 510, to feel the receipt step.
Sketch from memory. Draw the bad flow and the good flow. Bad flow: shopping list → translator → imaginary receipt. Good flow: shopping list → translator → market vendor → real receipt. Look. That picture is the whole file.
Bridge. Once we stop asking the model to fake the answer, we need a disciplined path from plain language to executable SQL. → 02-text-to-sql-pipeline.md