05. Tabular reasoning — when the receipt itself becomes a small puzzle¶

~13 min read. Getting rows back is not the end. The model still has to read the table without drifting.

Built on the ELI5 in 00-eli5.md. The receipt from the market vendor may be a table, not one neat number. Now the translator must reason over rows and columns carefully.

A table is not just text with pipes¶

LLMs read tokens in sequence. Tables are grids. That mismatch matters. In a grid, columns define stable meaning. Rows define entities or events. Aggregation jumps across rows. Comparison jumps across columns. Text models do not get this structure for free.

So what happens in practice? A model often reads the top row well. It may confuse later rows. It may compare the wrong column. It may average values that should be summed. It may ignore a filter hidden in one header. See. The problem is not intelligence alone. The problem is representation plus attention.

row view                        column view
┌──────┬────────┬─────────┐    ┌────────┐
│ acct │ region │ churned │    │ region │
├──────┼────────┼─────────┤    ├────────┤
│ A1   │ APAC   │ yes     │    │ APAC   │
│ A2   │ APAC   │ no      │    │ APAC   │
│ A3   │ EMEA   │ yes     │    │ EMEA   │
└──────┴────────┴─────────┘    └────────┘
    one entity at a time          one attribute down the grid

If you want a model to reason reliably, tell it which axis matters. Is this a lookup task? A sort task? A group-by task? A ratio task? Simple, no?

Row reasoning, column reasoning, and aggregation are different jobs¶

Row reasoning asks, "Which record matches these conditions?" Column reasoning asks, "What does this field mean across all rows?" Aggregation asks, "What happens when we combine many rows into one number?" These three jobs feel similar in English. They are different computationally.

Consider a support table. If you ask, "Which ticket is assigned to Priya and marked urgent?" that is row selection. If you ask, "What values appear in the priority column?" that is column inspection. If you ask, "How many urgent tickets are in each queue?" that is grouped aggregation. A strong prompt or tool path should distinguish them.

The shopping list often hides the operation word. Users say, "Show me the slow cities." Do they want the maximum delay city? The average delay by city? Cities above a threshold? The translator must infer the right table operation. This is why direct answer generation over tables can still wobble.

Look. One helpful practice is converting the receipt into an explicit intermediate form. Maybe restate the headers. Maybe normalize types. Maybe ask the model to reason step by step. Maybe offload the actual aggregation back to Python or SQL. The right split depends on risk.

Worked numerical example: regional churn rate¶

Use this tiny receipt. We will compute one grouped rate.

accounts
┌──────┬────────┬─────────┐
│ acct │ region │ churned │
├──────┼────────┼─────────┤
│ A1   │ APAC   │ yes     │
│ A2   │ APAC   │ no      │
│ A3   │ APAC   │ yes     │
│ A4   │ EMEA   │ no      │
│ A5   │ EMEA   │ yes     │
└──────┴────────┴─────────┘

Question: "What is the churn rate in APAC?"

Step 1. Select APAC rows. That gives A1, A2, and A3. So the denominator is 3 accounts.

Step 2. Count churned = yes inside APAC. A1 is yes. A2 is no. A3 is yes. So the numerator is 2.

Step 3. Compute the rate. 2 / 3 = 0.666... Multiply by 100 if you want percent. That gives 66.7% after rounding.

Now notice the likely model mistakes. It may divide by all 5 rows and say 40%. It may count all yes values and say 3. It may miss the APAC filter and answer global churn. This is exactly why receipt reasoning needs structure. Yes?

What LLMs do well and where they still slip¶

LLMs are decent at small-table lookup. They are often decent at simple comparisons. They can summarize obvious trends. They can explain what a pivot table means in plain language. That is useful. Do not underrate it.

They slip when tables get wider, longer, or more nested. Merged headers hurt. Sparse cells hurt. Percentages plus counts hurt. Multi-step filters hurt. Ratios over subsets hurt. Sorting with ties hurts. And if the receipt is one screenshot instead of structured data, errors jump again.

So what to do? Let models explain. Let deterministic tools calculate when stakes rise. For instance, if the answer is one exact KPI, compute it in SQL or Python first. Then ask the model to narrate the result. That keeps the translator from freehanding arithmetic.

Another safe pattern is explicit decomposition. Ask for subset rows first. Then ask for numerator and denominator. Then ask for the final statement. This feels slower. It is often much more reliable. See how haggling becomes a reasoning scaffold here.

Design patterns for table-aware systems¶

Pattern one is schema-plus-table prompting. Restate headers, types, and units before values. That helps the model orient itself. Pattern two is program-aided reasoning. Use the LLM to write the operation. Use code to execute it. Pattern three is cell citation. Have the model point to rows or cells that support the answer.

A tiny guardrail is unit normalization. If one column is rupees and another is dollars, say it loudly. If one value is percentage points and another is percent, say it loudly. Many table mistakes are really unit mistakes. That is not deep reasoning failure. That is sloppy receipt design.

receipt table ──→ normalize headers ──→ choose operation
                                      │
                                      ├──→ lookup? use row selection
                                      ├──→ aggregate? use code or SQL
                                      └──→ summary? use translator prose

So the mature view is balanced. LLMs are useful table readers. They are not trustworthy spreadsheet engines by default. Use them where language helps. Use deterministic computation where exactness matters. Simple, no?

Where this lives in the wild¶

ChatGPT Advanced Data Analysis — strategy analyst: uploads CSVs and gets summaries, but exact metrics still benefit from Python execution.
Microsoft Excel Copilot — finance lead: asks workbook questions where row filters and percentage calculations must be grounded.
Tableau Pulse — revenue manager: gets narrative insights layered on top of deterministic dashboard computations.
Power BI Copilot — operations analyst: turns table selections into natural-language explanations of grouped metrics.
Hex notebooks with AI — data scientist: mixes model explanations with code execution when tables become too complex for freehand reasoning.

Pause and recall¶

Why are row reasoning, column reasoning, and aggregation different jobs for an LLM?
In the churn example, what produced 66.7% exactly?
Name two ways a model could answer the APAC churn question incorrectly.
Why is program-aided table reasoning often safer than pure free-text reasoning?

Interview Q&A¶

Q: Why use an LLM to explain a table but not trust it as the calculator for every KPI? A: Because language interpretation and narrative synthesis are strengths, while exact grouped arithmetic and subset accounting are better handled deterministically. Common wrong answer to avoid: "Because LLMs cannot read tables at all."

Q: Why can a tiny five-row table still fool a strong model? A: The model may still confuse the operative filter, denominator, or aggregation axis because token order does not naturally encode grid structure. Common wrong answer to avoid: "Errors only happen on million-row datasets."

Q: Why is explicit decomposition often better than one-shot table reasoning? A: Breaking the task into subset selection, counting, and final synthesis reduces hidden leaps and makes errors auditable. Common wrong answer to avoid: "One-shot is always better because it preserves more context."

Q: Why are unit annotations and header normalization part of reasoning quality? A: Many apparent reasoning failures come from ambiguous fields, mixed units, or unclear header semantics rather than weak model logic alone. Common wrong answer to avoid: "Reasoning quality depends only on model size."

Apply now (5 min)¶

Exercise. Make a five-row table from your domain and ask one lookup, one group-by, and one ratio question. Compute the ratio by hand, just like 2 divided by 3 gives 66.7%. Then mark which questions you would trust to pure language and which you would route to code.

Sketch from memory. Draw the flow from receipt table to normalized headers to operation choice. Under it, write lookup, aggregate, and summary as three separate branches. Look. That is your table reasoning map.

Bridge. Tables are one strict language. Code is another. Next we study how translators fill gaps inside code itself. → 06-code-completion-models.md