00. Databases, Storage, and Caching — The Five-Year-Old Version¶

Every system produces data. Where it sits, how you find it, and how fast you reach it — that decides everything.

Imagine a huge library. Not a small one. One with millions of books, thousands of visitors per hour, and branches across the country.

Every book sits on a shelf. The shelf is physical storage. Big shelves hold many books but finding one takes time — you walk down long aisles, scan labels, pull the book. Small shelves near the front desk hold popular books for quick access.

To find a book, you check the card catalog. The card catalog is an index. It tells you exactly which shelf, which row, which position. Without it, you scan every book. With it, you walk straight to the right spot.

Some books are so popular that every visitor wants them. You make copies and put them at the reservation desk — right by the entrance. No walking to the shelves. Grab and go. That is caching. The reservation desk is faster but smaller than the main shelves.

Not every visitor comes to your main library. Some live across the city. So you build branch libraries — smaller copies of the main collection in different neighborhoods. Each branch library has the most popular books. If a visitor needs a rare book, the branch library requests it from the main collection. That is replication.

Books don't last forever. Outdated editions need replacing. Damaged books need repair. Lost books need tracking. You keep an overdue list — a record of every book that's been checked out, returned late, or gone missing. The overdue list is your transaction log. It tracks every change to the library's state.

Database design is library design. You choose shelves (storage engines), build card catalogs (indexes), set up reservation desks (caches), open branch libraries (replicas), and maintain overdue lists (transaction logs). Get any one wrong and the library either loses books or makes visitors wait in line for hours.

The placeholders you will see called back¶

Placeholder	Meaning
shelf	the storage engine — where data physically lives (B-tree, LSM, heap)
card catalog	the index — the data structure that makes lookups fast
reservation desk	the cache layer — Redis, Memcached, or in-memory structures for hot data
branch library	replicas and partitions — copies of data in different locations
overdue list	the write-ahead log (WAL) / transaction log — record of every change

Top resources¶

Designing Data-Intensive Applications by Martin Kleppmann — the bible of storage, replication, and distributed data; read chapters 3, 5, 7 first
Use The Index, Luke — free, visual guide to SQL indexing and query performance
Redis University — free courses on caching patterns, data structures, and Redis internals
PostgreSQL Documentation — surprisingly readable; the EXPLAIN and indexing sections are essential
CMU Database Course (Andy Pavlo) — the gold standard academic database course with practical focus
DynamoDB Best Practices — single-table design and partition key strategies from AWS

What's coming¶

01-relational-data-modeling.md — tables, normalization, foreign keys, and when to denormalize
02-nosql-document-keyvalue.md — MongoDB, DynamoDB, Redis — schema flexibility and access-pattern-first design
03-wide-column-and-graph.md — Cassandra, HBase, Neo4j — specialized shelves for specialized goods
04-storage-engines-btree-lsm.md — how data physically sits on disk and why it matters
05-indexing-and-query-plans.md — building the card catalog and reading EXPLAIN output
06-transactions-and-isolation.md — ACID, isolation levels, and what "consistent" really means
07-replication-strategies.md — leader-follower, multi-leader, leaderless, and the lag problem
08-partitioning-and-sharding.md — splitting the library by wing, floor, or letter
09-caching-patterns-deep-dive.md — cache-aside, write-through, invalidation, stampede, and TTL math
10-object-storage-and-data-lakes.md — S3, GCS, Parquet, and the data lake pattern
11-search-and-vector-stores.md — Elasticsearch, Pinecone, pgvector — full-text and semantic search
12-connection-pooling.md — PgBouncer, HikariCP, and why connections are expensive
13-cap-theorem-in-practice.md — what CAP actually means for real database choices
14-honest-admission.md — what we don't fully understand about data storage

Bridge. The library starts with its most fundamental design choice: how to organize the shelves. Tables, rows, and relationships — relational modeling. → 01-relational-data-modeling.md

One more thing. A smart librarian doesn't organize books randomly. High-demand books go on eye-level shelves near the entrance. Rare manuscripts go in the basement vault. The organization strategy depends on who's visiting and what they need. Read-heavy libraries look different from write-heavy archives. Understanding these access patterns is what separates a librarian from a pile of books on a floor.