04. Storage Engines: B-Tree and LSM — How the shelf is physically built changes everything¶

~16 min read. The data model is visible; the storage engine silently decides performance.

Built on the ELI5 in 00-eli5.md. The shelf — how the shelf is physically built changes everything — now becomes B-tree and LSM behavior on disk.

1) B-trees keep data in ordered pages for efficient reads and updates¶

See. A B-tree organizes keys in sorted pages. Each page points to child pages below it. You walk from root to leaf until the key range matches. Because pages hold many keys, tree height stays small. That keeps reads efficient on disk.

        [40 | 80]
        /    |     [10|20|30] [50|60] [90|100]

Databases like PostgreSQL and InnoDB rely heavily on B-tree structures. Why do they work well? Ordered keys support equality lookup, range scans, and sorted traversal. That is a huge practical advantage. Worked example. Suppose one page stores 200 keys. A table has 8 million rows. Level one holds 200 pointers. Level two covers 40,000 keys. Level three covers 8 million keys. So a three-level tree can already reach the full dataset. That means a lookup touches only a few pages. Simple, no? B-trees also allow in-place updates. If one row value changes, the engine can modify the relevant page. That feels natural for read-heavy transactional systems. But there is pressure. Pages split when they fill. Random writes can scatter I/O. Maintaining sorted order costs work during heavy write bursts. Still, for many OLTP systems, the read behavior is excellent.

2) LSM trees absorb writes first, then organize later¶

An LSM tree thinks differently. It says writes should be cheap now, cleanup later. New writes go to an in-memory memtable first. They also go to the WAL for durability. When the memtable fills, it flushes as an immutable sorted file. Those files are called SSTables in many systems.

write path
app write → WAL append → memtable
memtable full → flush SSTable
many SSTables → compaction merge

This is great for write-heavy workloads. Appending to a log and memory is cheaper than random in-place page edits. That is why RocksDB, Cassandra, LevelDB, and similar systems love this family. Worked example. Assume each second brings 50,000 writes. Each write is 1 KB. Raw ingest is 50 MB per second. Appending 50 MB sequentially is disk-friendly. Doing 50,000 random page updates is harder. That is the heart of the LSM advantage. But nothing is free. Now reads may need to check memtable plus multiple SSTables. Old versions must be cleaned through compaction. Space temporarily grows during merges. Bloom filters reduce unnecessary file checks for many point reads. That is why LSM systems often pair sorted files with probabilistic helpers. So what to do? Choose LSM when write throughput and sequential I/O matter more than pristine point-read simplicity. Choose B-tree when balanced reads, updates, and range queries dominate.

3) Write amplification and read amplification are the hidden taxes¶

Interviewers love these phrases. You should explain them concretely. Write amplification means one logical write causes several physical writes. Read amplification means one logical read touches several places. In B-trees, write amplification appears through page splits, index maintenance, and WAL writes. In LSM trees, write amplification appears through repeated compaction across levels. In LSM trees, read amplification appears because one key may exist in many files. Worked example. Suppose an LSM system ingests 100 GB daily. During compaction, each byte may be rewritten 8 times overall. Then physical write volume becomes about 800 GB daily. That is write amplification. Now a point lookup checks memtable, cache, and four SSTables. That is more read work than one tidy B-tree leaf access. Conversely, a B-tree update may touch data page, index page, and WAL. A single 1 KB logical update might trigger 12 KB physical writes. Different engine. Different tax. Neither tax is imaginary. They decide SSD wear, latency variance, and cost.

engine      good at                 pays with
B-tree      reads, ranges, updates  random-write pressure
LSM         high write ingest       compaction and multi-file reads

When people say one engine is faster, ask faster at what exactly. That question saves many wrong decisions.

4) Compaction is housekeeping, and housekeeping can dominate the bill¶

Compaction merges SSTables, removes overwritten entries, and reclaims tombstones. Without compaction, reads get slower and storage bloats. With compaction, background I/O grows. Again, trade-off. Suppose an LSM system receives 2 TB daily. If average write amplification is 6x, the cluster writes 12 TB physically. Now imagine SSD bandwidth is limited. Background compaction can compete with foreground traffic. That is why tuning levels, file sizes, and compaction policy matters. Leveled compaction reduces read amplification but rewrites data more often. Size-tiered compaction lowers immediate rewrite cost but can increase read work. Simple, no? You are choosing which pain to pay more often. Databases expose different knobs because workloads differ. Cassandra tuning feels different from RocksDB tuning for exactly this reason. B-tree engines also have housekeeping. Vacuum, page cleanup, and index maintenance all matter. But the operational smell is different. LSM systems often show performance cliffs when compaction falls behind. Learn that sentence well. It is very interview useful.

5) Picking the shelf build from workload signals¶

Now combine the intuition. If a workload serves many range scans and balanced point reads, B-tree is comfortable. If a workload ingests huge write streams with sequential persistence, LSM is comfortable. Examples help. An order database with frequent lookups by order ID and date ranges fits B-tree nicely. A metrics pipeline ingesting device events every second often fits LSM better. Worked comparison. Workload A does 5,000 reads per second and 500 writes per second. Queries include ORDER BY created_at LIMIT 50 and date ranges. B-tree likes that mix. Workload B does 2,000 reads per second and 40,000 writes per second. Most reads are latest value by key. LSM likes that mix. Now add latency intuition. If one read must visit five files, tail latency can widen. If one write triggers page splits, write spikes can widen. Storage engines are below the SQL or NoSQL label. PostgreSQL is relational with B-tree-heavy indexing. Cassandra is wide-column with LSM-style internals. RocksDB can sit under many higher-level systems. So what to do? When diagnosing performance, do not stop at query language. Ask how data is laid out physically. That physical story explains surprising bottlenecks.

6) Practical interview shorthand for engine choice¶

Say this clearly. B-tree is read-optimized and range-friendly. LSM is write-optimized and compaction-heavy. Then add the workload sentence. “Orders feel B-tree-ish because reads, ranges, and updates dominate.” “Telemetry feels LSM-ish because append rate dominates.” That is crisp and useful. Also mention operational follow-through. B-tree systems need index health and vacuum awareness. LSM systems need compaction, tombstone, and file-count awareness. The shelf design is not abstract theory. It changes alerts, dashboards, and SSD bills. It also changes capacity planning. More write amplification means more SSD budget. More read amplification means more cache pressure. Those are real infrastructure consequences.

Where this lives in the wild¶

GitHub database engineer depends on B-tree-heavy MySQL or PostgreSQL behavior for issue lookups, ordered queries, and transactional updates.
Stripe storage engineer values B-tree-style range scans and predictable transactional reads for financial records.
Discord backend engineer benefits from LSM-style write paths when message or event ingestion grows relentlessly.
Meta infrastructure engineer works with RocksDB-backed systems where compaction tuning directly affects latency and SSD cost.
Cassandra platform engineer lives inside tombstones, compaction strategies, and write amplification trade-offs every day.

Pause and recall¶

Why can a small-height B-tree serve very large tables efficiently?
What exact problem is an LSM tree trying to make cheaper first?
How would you explain write amplification without using vague buzzwords?
Why can compaction become the main operational headache in LSM systems?

Interview Q&A¶

Q: Why does a B-tree work well for range queries? A: Because keys remain ordered in pages, so adjacent values live close together and can be scanned without jumping across many unrelated files. Common wrong answer to avoid: “Because B-trees are always in memory.” Q: Why do LSM trees handle heavy writes so well? A: Because writes land as sequential log appends and memory updates first, avoiding immediate random in-place page rewrites. Common wrong answer to avoid: “Because LSM trees never rewrite data.” Q: What is write amplification in simple terms? A: It is the difference between logical data written by the application and larger physical data rewritten by the engine underneath. Common wrong answer to avoid: “It only means duplicate application requests.” Q: Why can read latency vary more in an LSM system? A: Because reads may check memtables, caches, bloom filters, and multiple SSTables while compaction changes the file landscape in the background. Common wrong answer to avoid: “LSM systems are bad at reads by definition.”

Apply now (5 min)¶

Exercise: Take two workloads you know. One should be read-heavy. One should be write-heavy. For each, decide whether B-tree or LSM feels more natural and why. Sketch from memory: Draw a three-level B-tree. Then draw an LSM write path from WAL to memtable to SSTables. Underline one hidden tax for each engine.

Bridge. The shelf build decides raw behavior, but users still need faster lookup paths. Next we build the card catalog and read query plans properly. → 05-indexing-and-query-plans.md