00. Databases, Storage, and Caching — The Five-Year-Old Version¶
Every system produces data. Where it sits, how you find it, and how fast you reach it — that decides everything.
Imagine a huge library. Not a small one. One with millions of books, thousands of visitors per hour, and branches across the country.
Every book sits on a shelf. The shelf is physical storage. Big shelves hold many books but finding one takes time — you walk down long aisles, scan labels, pull the book. Small shelves near the front desk hold popular books for quick access.
To find a book, you check the card catalog. The card catalog is an index. It tells you exactly which shelf, which row, which position. Without it, you scan every book. With it, you walk straight to the right spot.
Some books are so popular that every visitor wants them. You make copies and put them at the reservation desk — right by the entrance. No walking to the shelves. Grab and go. That is caching. The reservation desk is faster but smaller than the main shelves.
Not every visitor comes to your main library. Some live across the city. So you build branch libraries — smaller copies of the main collection in different neighborhoods. Each branch library has the most popular books. If a visitor needs a rare book, the branch library requests it from the main collection. That is replication.
Books don't last forever. Outdated editions need replacing. Damaged books need repair. Lost books need tracking. You keep an overdue list — a record of every book that's been checked out, returned late, or gone missing. The overdue list is your transaction log. It tracks every change to the library's state.
Database design is library design. You choose shelves (storage engines), build card catalogs (indexes), set up reservation desks (caches), open branch libraries (replicas), and maintain overdue lists (transaction logs). Get any one wrong and the library either loses books or makes visitors wait in line for hours.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| shelf | the storage engine — where data physically lives (B-tree, LSM, heap) |
| card catalog | the index — the data structure that makes lookups fast |
| reservation desk | the cache layer — Redis, Memcached, or in-memory structures for hot data |
| branch library | replicas and partitions — copies of data in different locations |
| overdue list | the write-ahead log (WAL) / transaction log — record of every change |
Top resources¶
- Designing Data-Intensive Applications by Martin Kleppmann — the bible of storage, replication, and distributed data; read chapters 3, 5, 7 first
- Use The Index, Luke — free, visual guide to SQL indexing and query performance
- Redis University — free courses on caching patterns, data structures, and Redis internals
- PostgreSQL Documentation — surprisingly readable; the EXPLAIN and indexing sections are essential
- CMU Database Course (Andy Pavlo) — the gold standard academic database course with practical focus
- DynamoDB Best Practices — single-table design and partition key strategies from AWS
What's coming¶
- 01-relational-data-modeling.md — tables, normalization, foreign keys, and when to denormalize
- 02-nosql-document-keyvalue.md — MongoDB, DynamoDB, Redis — schema flexibility and access-pattern-first design
- 03-wide-column-and-graph.md — Cassandra, HBase, Neo4j — specialized shelves for specialized goods
- 04-storage-engines-btree-lsm.md — how data physically sits on disk and why it matters
- 05-indexing-and-query-plans.md — building the card catalog and reading EXPLAIN output
- 06-transactions-and-isolation.md — ACID, isolation levels, and what "consistent" really means
- 07-replication-strategies.md — leader-follower, multi-leader, leaderless, and the lag problem
- 08-partitioning-and-sharding.md — splitting the library by wing, floor, or letter
- 09-caching-patterns-deep-dive.md — cache-aside, write-through, invalidation, stampede, and TTL math
- 10-object-storage-and-data-lakes.md — S3, GCS, Parquet, and the data lake pattern
- 11-search-and-vector-stores.md — Elasticsearch, Pinecone, pgvector — full-text and semantic search
- 12-connection-pooling.md — PgBouncer, HikariCP, and why connections are expensive
- 13-cap-theorem-in-practice.md — what CAP actually means for real database choices
- 14-honest-admission.md — what we don't fully understand about data storage
Bridge. The library starts with its most fundamental design choice: how to organize the shelves. Tables, rows, and relationships — relational modeling. → 01-relational-data-modeling.md
One more thing. A smart librarian doesn't organize books randomly. High-demand books go on eye-level shelves near the entrance. Rare manuscripts go in the basement vault. The organization strategy depends on who's visiting and what they need. Read-heavy libraries look different from write-heavy archives. Understanding these access patterns is what separates a librarian from a pile of books on a floor.