05. Indexing and Query Plans — Build the right card catalog for the right search¶

~16 min read. A bad index turns one quick lookup into a full hallway walk.

Built on the ELI5 in 00-eli5.md. The card catalog — building the right catalog for the right queries — now becomes index design and EXPLAIN reading.

1) An index is a card catalog, not a magic speed sticker¶

See. A table stores the full books. The card catalog stores shortcuts to find them quickly. Without an index, the database may scan every row. With the right index, it jumps near the answer. Most common database indexes are B-tree indexes. They support equality, range, ordering, and prefix matching well. Hash indexes are narrower. They help equality lookups only. That is the first distinction.

query without catalog
row 1 → row 2 → row 3 → ... → row 10,000,000

query with card catalog
search key → index leaf → matching row location

Worked example. Suppose orders has 10 million rows. A support query asks for order_id = 8827319. A full scan may inspect millions of rows. A B-tree index finds the matching leaf in a few page reads. Latency may drop from 1.8 seconds to 6 milliseconds. That is not small. That is product-visible. But do not worship indexes blindly. Every index adds storage and write cost. Insert one row. You now update table data plus every relevant index. So what to do? Index for repeated important queries, not for emotional comfort.

2) B-tree, hash, composite, and covering indexes solve different search paths¶

A B-tree index on email helps WHERE email = ?. It also helps ranges like WHERE created_at >= ?. A hash index mainly helps exact equality. Now composite indexes matter even more in real systems. An index on (user_id, created_at) is not the same as two single indexes. Why? Because order matters. The leftmost prefix rule matters. That index helps queries on user_id alone. It also helps user_id plus created_at range queries. It usually does not help created_at alone enough. Simple, no?

index: (user_id, created_at)
works well for:
- user_id = 42
- user_id = 42 AND created_at > t1
less useful for:
- created_at > t1

Worked example. Suppose a feed table stores 200 million events. Common query is “latest 20 events for user 42”. Index (user_id, created_at desc) is excellent. Two separate indexes force extra work. The composite index matches the sort and filter together. Now covering indexes. If the index contains all needed columns, the engine may avoid table lookups. Example index (user_id, created_at desc, status). Query asks only for created_at and status. That can be satisfied from index pages alone in some engines. Reads become faster because fewer pages are touched. Again, there is a tax. The wider the index, the heavier the writes. That is why index design is a workload trade, not a purity contest.

3) EXPLAIN tells you how the database plans the search¶

Many engineers create indexes but never read plans. That is like buying maps and never checking the route. EXPLAIN shows whether the optimizer scans, seeks, sorts, or joins expensively. Important words differ by database, but the logic is similar. Look for these clues. - sequential scan or full table scan - index scan or index seek - rows estimated - filter versus index condition - sort step or filesort - nested loop versus hash join shape Worked example. Query:

SELECT order_id, status
FROM orders
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 20;

Bad plan signs. The optimizer scans 8 million rows, then sorts a huge result. Good plan signs. The optimizer uses index (user_id, created_at desc) and stops after 20 rows.

Bad EXPLAIN idea
Seq Scan on orders  rows=8000000
Sort on created_at

Better EXPLAIN idea
Index Scan using idx_orders_user_created on orders rows=20

Why do estimates matter? Because wrong estimates produce wrong plans. If statistics think user_id = 42 returns half the table, a scan may seem cheaper. If reality is 20 rows, that plan is terrible. So what to do? Read EXPLAIN before and after adding indexes. Check actual rows when the database supports EXPLAIN ANALYZE. Plans are the ground truth for optimizer behavior.

4) Wrong indexes quietly hurt write throughput and memory budgets¶

Here is the trap. Teams add indexes for every dashboard, export, and curious analyst. Soon inserts slow down and storage explodes. Why? Because each index is another maintained data structure. Suppose orders receives 5,000 inserts per second. Each row is 1 KB. Base write volume is 5 MB per second. Now add five secondary indexes averaging 300 bytes each. Effective indexed write volume becomes about 6.5 MB per second before engine overhead. That is already a 30% increase. Add page splits, WAL, and vacuum cost. The real write burden climbs further. Worked example. A team adds indexes on status, city, created_at, coupon_code, and salesperson_name separately. Only two are used daily. Three sit idle but still tax every insert and update. This is why unused-index review matters. Another trap is low-selectivity indexes. An index on status with values only OPEN and CLOSED may help little. If half the table matches, scanning may still be cheaper. Better design might be (status, created_at) for a frequent dashboard. Or maybe a partial index on status = 'OPEN'. Simple, no? The right index matches the exact query shape. The wrong index matches a column name only.

5) Index selection is really query selection and discipline¶

Before creating an index, say the query aloud. What is the filter? What is the sort? What columns are returned? How selective is the predicate? How often does this query run? How often does the table get written? That checklist prevents lazy indexing. Use one index to serve several frequent shapes when possible. Prefer composite indexes matching your most valuable path. Drop indexes that nobody uses. Measure before and after.

index checklist
filter → sort → returned columns → selectivity → write cost

One more worked example. Table size is 50 million support tickets. Hot query is “open tickets for tenant 88 by newest first”. Index on (tenant_id, status, created_at desc) is strong. Index on created_at alone is weak. Index on status alone is noisy. Index on (status, tenant_id) may still force sorting. The best index follows the exact search path. That is the card catalog mindset. Catalog by how people actually search. Not by how columns look in a schema diagram.

6) Partial and covering indexes are precision tools¶

Sometimes the table is huge, but the hot slice is tiny. Then a partial index can be beautiful. Example. Only 3% of support tickets are OPEN. A partial index on open tickets is smaller and cheaper than indexing everything. Covering indexes are similar precision tools. They make a hot query cheap by carrying exactly the needed columns. Use them when one query repeats constantly and latency really matters. Do not create them casually for every report. The card catalog should stay purposeful, not bloated.

Where this lives in the wild¶

GitHub backend engineer needs strong indexes for issue lists, repository lookups, and newest-first activity views under constant writes.
Stripe support-tools engineer depends on composite indexes so user, payment, and dispute screens open quickly during live support calls.
Swiggy operations engineer tunes order and delivery indexes because wrong plans hurt dispatch dashboards immediately.
LinkedIn feed engineer benefits from indexes aligned to member, time, and ranking filters rather than random column popularity.
Amazon marketplace engineer reviews unused indexes aggressively because catalog writes and updates never stop.

Pause and recall¶

Why is an index more like a card catalog than a duplicate table?
What does the leftmost prefix rule mean for a composite index?
What EXPLAIN clues usually reveal a bad plan quickly?
Why can an unused index still be harmful every single day?

Interview Q&A¶

Q: Why choose a composite index instead of two single-column indexes? A: Because one composite structure can match filtering and ordering together, reducing extra sorting and row lookups for the real query path. Common wrong answer to avoid: “More separate indexes are always more flexible.” Q: Why might an index on status alone be weak? A: Because low-selectivity predicates return too much of the table, so the optimizer may still prefer scanning instead of bouncing through the index. Common wrong answer to avoid: “Any indexed column automatically speeds every query using it.” Q: Why read EXPLAIN after adding an index? A: Because the optimizer may still ignore it, estimate rows badly, or choose a scan plus sort plan you did not expect. Common wrong answer to avoid: “If the index exists, the database will definitely use it.” Q: What is a covering index in practical terms? A: It is an index carrying enough columns to answer the query without visiting the main table for additional values. Common wrong answer to avoid: “It just means the index includes every column in the table.”

Apply now (5 min)¶

Exercise: Take one familiar query from work or practice. Write its filter columns, sort columns, and returned columns. Then propose one index and defend the column order. Sketch from memory: Draw a table on one side and a card catalog on the other. Show how one bad query scans everything and one good query jumps directly. Then write one sentence on the write cost of that catalog.

Bridge. Fast lookup is not enough when many transactions race together. Next we study isolation, locks, and why correctness under concurrency is hard. → 06-transactions-and-isolation.md