03. Schema representation — how the phrasebook shapes the SQL¶
~13 min read. Before the model writes good SQL, it must first see the schema in a form it can actually use.
Built on the ELI5 in 00-eli5.md. The phrasebook decides whether the translator maps a word to the right vendor term. A thin phrasebook produces confident mistranslation.
The same database can look easy or impossible¶
A schema exists whether you show it well or badly. That is the first lesson. The model does not see your warehouse the way your data team sees it. It sees only the text you provide. Table names, column names, descriptions, examples, and relationships are all prompt design choices.
Imagine entering the market with a useless phrasebook.
It says acct_dim, gmv_adj, and st_cd.
You know these mean accounts, adjusted revenue, and state code.
The translator does not know that.
It only sees cryptic tokens.
So what happens?
Wrong mappings happen early.
Now imagine a richer phrasebook.
It says accounts_dim — one row per customer account.
It says gmv_adjusted — gross merchandise value after refunds.
It says state_code — two-letter shipping state.
Suddenly the same model looks smarter.
Nothing mystical changed.
Representation changed.
Simple, no?
raw schema text enriched schema text
┌──────────────────────┐ ┌──────────────────────────────┐
│ acct_dim │ │ accounts_dim │
│ st_cd │ │ state_code: shipping state │
│ gmv_adj │ │ gmv_adjusted: after refunds │
└──────────┬───────────┘ └──────────────┬───────────────┘
│ │
▼ ▼
weak linking stronger linking
Look. The database did not improve. The model prompt improved. That is why schema representation deserves its own file.
Three useful layers of schema context¶
Layer one is structural DDL. Table names, column names, types, primary keys, and foreign keys. This is the minimum phrasebook. It tells the translator what exists. It also tells join paths. Without it, SQL generation is blind.
Layer two is semantic description.
Short natural-language notes explain business meaning.
For example, booked_revenue may exclude pending orders.
active_subscription may mean status in billing, not usage.
These tiny notes prevent expensive misreads.
Layer three is examples.
Show one or two example rows.
Show common filters.
Show canonical joins.
Examples anchor ambiguous columns quickly.
If a column named status appears in five tables, example values help the translator choose the right one.
status = 'paid' belongs to invoices.
status = 'active' may belong to subscriptions.
Examples are very powerful for LLMs.
So what to do in practice? Do not dump the full warehouse if it is huge. Retrieve the relevant slice. But inside that slice, include all three layers when possible. Structure. Meaning. Examples. Yes?
Worked example: bare schema versus enriched schema¶
Question: "How many active enterprise subscriptions renewed in April?" Suppose we have these tables.
Bare version:
subs(sub_id, acct_id, st, renew_dt)
acct(acct_id, seg)
Enriched version:
subscriptions(sub_id, account_id, status, renewal_date)
accounts(account_id, segment)
Description: subscriptions.status uses values active, paused, cancelled.
Description: accounts.segment uses values enterprise, self_serve.
Canonical join: subscriptions.account_id = accounts.account_id.
Now add tiny rows.
subscriptions
┌────────┬────────────┬─────────┬──────────────┐
│ sub_id │ account_id │ status │ renewal_date │
├────────┼────────────┼─────────┼──────────────┤
│ 1 │ 10 │ active │ 2025-04-02 │
│ 2 │ 11 │ paused │ 2025-04-05 │
│ 3 │ 12 │ active │ 2025-04-09 │
└────────┴────────────┴─────────┴──────────────┘
accounts
┌────────────┬─────────────┐
│ account_id │ segment │
├────────────┼─────────────┤
│ 10 │ enterprise │
│ 11 │ enterprise │
│ 12 │ self_serve │
└────────────┴─────────────┘
What is the answer? Active enterprise subscriptions in April means only sub 1. Sub 2 is paused. Sub 3 is active but self_serve. So the count is 1.
With the bare schema, the translator must guess that st means status and seg means segment.
With the enriched schema, the phrasebook states it clearly.
One representation produces ambiguity.
The other produces a clean count of 1 with a far better chance.
See the leverage.
Compact context beats giant dumps¶
A common beginner move is dumping every table into the prompt. This feels safe. It is usually bad. Huge schema dumps bury the relevant objects. Token cost rises. Attention diffuses. Important join hints get lost. The translator becomes slower and strangely more confused.
Better pattern. Use retrieval or rules to shortlist candidate tables. Then send rich descriptions only for that slice. Maybe 5 tables instead of 500. Maybe 30 columns instead of 4,000. Within that slice, include key descriptions and one example row. That is a much better phrasebook.
Look at the token tradeoff. Suppose a warehouse has 200 tables. If each table takes 20 tokens, raw listing alone is 4,000 tokens. Add columns and you may hit 20,000 tokens fast. Now shortlist to 6 tables. At 120 tokens per enriched table, you spend 720 tokens. That is both cheaper and more informative. Simple arithmetic. Much better context density.
The rule is not "always smaller." The rule is "smaller and more relevant." If you omit the join table, failure returns. So shortlist carefully. Compactness is good only when recall stays high.
Practical schema package design¶
A strong schema package usually includes five things. Table purpose. Column purpose. Key relationships. Common filters. One example row or value distribution. That is enough for many production systems.
You can also include business synonyms.
MRR maps to monthly_recurring_revenue.
renewal maps to invoice_type = 'renewal' in some systems.
churned may map to status = 'cancelled' plus a date window.
These synonyms are gold.
They connect user language to warehouse language.
That is the heart of the phrasebook.
A useful checklist is this. If a smart analyst new to the company could answer the question from the schema package alone, it is probably good enough for the translator too. If the analyst would ask, "what does this column actually mean?" then your LLM will struggle the same way. Look. Prompting is not magic. It is documentation engineering.
Where this lives in the wild¶
- dbt docs — analytics engineer: enriches tables with descriptions and tests so downstream SQL assistants map business words correctly.
- Looker semantic model — BI developer: encodes metrics, joins, and field descriptions to stop users from freehanding broken SQL.
- Snowflake semantic views — data platform lead: exposes governed business meaning instead of raw cryptic warehouse columns.
- Databricks Unity Catalog — data steward: adds column descriptions and lineage so AI query tools have a richer phrasebook.
- Microsoft Fabric semantic model — finance analyst: asks plain questions against curated metric definitions, not raw lakehouse names.
Pause and recall¶
- Why can the same database feel easy or impossible depending on schema representation?
- What are the three useful layers of schema context for LLMs?
- In the worked example, why does the enriched schema make the count of 1 easier to reach?
- Why is a relevant six-table slice often better than a full 200-table dump?
Interview Q&A¶
Q: Why provide enriched schema descriptions instead of raw DDL alone? A: Because structure tells the model what exists, but semantic notes and examples tell it what those fields actually mean in business language. Common wrong answer to avoid: "Because LLMs cannot read SQL table definitions."
Q: Why not always dump the full warehouse schema for maximum recall? A: Because excessive irrelevant context dilutes attention, raises token cost, and can still hide the few tables that really matter. Common wrong answer to avoid: "More context is always safer."
Q: Why are example rows often more helpful than another paragraph of prose? A: Example values quickly resolve ambiguous columns, categorical meanings, and canonical filter patterns in a way models latch onto reliably. Common wrong answer to avoid: "Rows are unnecessary once you have column names."
Q: Why is schema representation really a product design problem, not just a prompt tweak? A: Because it encodes governed business meaning, synonyms, and join conventions that determine whether the generated SQL matches organizational truth. Common wrong answer to avoid: "Prompt cleverness matters more than documentation quality."
Apply now (5 min)¶
Exercise. Pick one table from your world and rewrite it twice. First as raw DDL only. Then as an enriched phrasebook with descriptions, key relationships, and one example row. Notice which version would help a translator map your shopping list faster.
Sketch from memory. Draw the side-by-side boxes for raw schema and enriched schema. Under them, write one sentence: structure tells what exists, semantics tell what it means. See. That sentence is the file.
Bridge. A good phrasebook helps the translator draft better SQL, but we still cannot let that draft hit the vendor blindly. → 04-sql-validation-execution.md