04. Data Modeling and Storage — Pick the right warehouse before traffic starts¶
~14 min read. One wrong storage choice can bend your whole architecture the wrong way.
Built on the ELI5 in 00-eli5.md. The warehouse — the storage building in the city — now becomes the place where each kind of data must live wisely.
First see the city map, not the table columns¶
At HLD level, you do not start with column names. You start with data shapes, access paths, and failure cost. See. The question is not, "Which schema looks elegant?" The question is, "Which warehouse fits these goods and this traffic?" If you store the wrong goods in the wrong building, every later decision becomes expensive. Your APIs slow down. Your scaling story becomes awkward. Your backup plan becomes messy. Simple, no? So what do you inspect first? 1. What is the main read pattern? 2. What is the main write pattern? 3. Do records depend on strong relationships? 4. Do fields change shape often? 5. Do you need joins across many entities? 6. Do you need millisecond lookups by key? 7. Do you need graph traversal across connections? 8. Do you need huge write throughput across partitions? That is the HLD lens. You are placing warehouses on the city map, not deciding every shelf label. Here is the first rough chooser: ┌──────────────────────┬──────────────────────────────┐ │ Data need │ Likely storage choice │ ├──────────────────────┼──────────────────────────────┤ │ Transactions + joins │ Relational database │ │ Flexible JSON shape │ Document database │ │ Fast key lookup │ Key-value store │ │ Huge partitioned rows│ Wide-column database │ │ Deep relationships │ Graph database │ └──────────────────────┴──────────────────────────────┘ Do not misuse this table. It is a first filter, not divine truth. Sometimes one system uses two or three storage engines. That is polyglot persistence, and at HLD level it can be the correct move.
SQL first: when the business cares about correctness across entities¶
Use a relational database when the data has strong structure. Orders, payments, invoices, subscriptions, ledgers, and inventory reservations fit this pattern. These are not loose notes. These are business facts. Now what is the signal? The signal is cross-entity correctness. If order creation, payment capture, and stock reservation must stay coordinated, a relational warehouse usually wins. Why? - transactions across related records - strong constraints - joins for reporting and operations - mature indexing and query tooling - clearer audit trails A small HLD pattern looks like this: ┌──────────┐ write order ┌──────────────┐ │ Checkout │ ───────────────→ │ Orders SQL │ └──────────┘ └──────────────┘ │ │ │ read payment state │ join order, payment, item ▼ ▼ ┌──────────┐ ┌──────────────┐ │ Support │ ───────────────→ │ Reports SQL │ └──────────┘ └──────────────┘ This does not mean SQL is always slow or old-fashioned. That is interview mythology. A properly indexed relational system can serve massive traffic. The real question is whether your main pain is correctness or flexibility. Worked example. Suppose you are designing a ticketing platform. - 8 million users - 120,000 events - 2 million ticket bookings per day - each booking changes seat availability and payment state Now calculate the pressure. Step 1: booking writes per second. 2,000,000 bookings per day ÷ 86,400 seconds = about 23 bookings per second on average. Step 2: peak multiplier. Assume big concert launches create 20x peak. 23 × 20 = 460 booking flows per second. Step 3: per booking row changes. One booking touches booking row, payment row, seat inventory row, and audit row. That is about 4 coordinated writes. 460 × 4 = 1,840 row operations per second at peak. This is not absurd for a modern relational cluster. But correctness matters a lot. If two people buy the same seat, your brand takes the hit. So the architecture-level choice is simple: put booking and payment truth in a relational warehouse.
NoSQL is not one thing: separate the families properly¶
People say "Use NoSQL" as if it is one bucket. That is too fuzzy. At HLD level, split it into families. Each family solves a different pain.
Document database¶
Choose a document store when each record is mostly self-contained and fields vary between entities. Product catalogs, CMS pages, user-generated profiles, and merchant settings are common fits. Good signals: - one record fetched as one document - nested attributes make sense together - schema changes happen often - joins are rare or can be precomputed Example. An online marketplace has 40 million products. Fashion items have size and fabric. Electronics have battery and wattage. Furniture has dimensions and wood type. If you force all of this into one rigid table, you get many nullable columns and awkward migrations. A document warehouse may fit better.
Key-value store¶
Choose key-value when the access pattern is brutally simple. You know the key, and you need the value fast. Session store, cart snapshot, feature flag lookup, and rate-limit counters fit here. Signals: - read by primary key only - very low latency matters - TTL expiry matters - joins do not matter Example with numbers. Suppose 3 million active shopping carts exist. Each cart snapshot averages 2 KB. 3,000,000 × 2 KB = 6,000,000 KB, which is about 6 GB of raw cart data. Add replication factor 3. 6 GB × 3 = 18 GB. That comfortably fits a distributed key-value tier. So what do you do? Keep carts in a fast key-value warehouse, not in the same place as long-term order history.
Wide-column database¶
Choose wide-column when data arrives at huge scale, is partitioned naturally, and queries mostly stay inside one partition. Time-series metrics, event logs, feed activity by user, and IoT ingestion fit this rhythm. Signals: - very high write throughput - queries by partition key and time range - sparse columns across massive datasets - horizontal scaling matters more than joins Think of it this way. You are not browsing a neat spreadsheet. You are pouring trucks of events into grouped lanes. A wide-column warehouse is built for that rhythm.
Graph database¶
Choose graph when the relationship is the product. Fraud rings, friend suggestions, dependency maps, route finding, and knowledge graphs fit this pattern. Signals: - repeated multi-hop traversal - answers depend on connection depth - joins become recursive and painful - "friends of friends" is a normal query, not a rare one If your core question is, "How are these entities connected over two, three, or five hops?" then graph becomes an architecture choice, not an afterthought.
A worked HLD decision: one product, five data types¶
Let us design a grocery delivery platform. One product. Multiple kinds of goods. Multiple warehouses. Assume: - 12 million registered users - 1.5 million daily active users - 300,000 orders per day - 80,000 catalog items - 50,000 delivery partners - 40 million app events per day Now classify the data.
1) Orders and payments¶
Need strong correctness, refunds, audit trails, and joins between order, payment, store, and line items. Use relational SQL.
2) Product catalog¶
Each item has flexible attributes. Milk has fat percentage. Rice has grain type. Detergent has fragrance and pack size. Updates are document-shaped. Use document storage.
3) User cart and session¶
Need millisecond lookup by user ID. Can expire after inactivity. No multi-record joins needed. Use key-value storage.
4) Delivery partner location history¶
App sends frequent pings. Queries ask for partner X between 9:00 and 9:30. Partition by partner and time. Use wide-column storage.
5) Referral and fraud network¶
Need to find linked accounts sharing cards, addresses, or devices. That is relationship-heavy. Use graph storage. Here is the map: ┌──────────────┐ orders/payments ┌──────────────┐ │ App + APIs │ ───────────────────→ │ SQL cluster │ ├──────────────┤ catalog docs ├──────────────┤ │ │ ───────────────────→ │ Document DB │ │ │ cart/session ├──────────────┤ │ │ ───────────────────→ │ Key-value DB │ │ │ location events ├──────────────┤ │ │ ───────────────────→ │ Wide-column │ │ │ fraud links ├──────────────┤ │ │ ───────────────────→ │ Graph DB │ └──────────────┘ └──────────────┘ Now the number check. Orders per day = 300,000. If each order record plus line items averages 4 KB, then raw daily order storage is 300,000 × 4 KB = 1,200,000 KB. That is about 1.2 GB per day. One year of raw order data: 1.2 GB × 365 = 438 GB. Add indexes and replicas. Assume 2.5x overhead. 438 GB × 2.5 = 1,095 GB, which is roughly 1.1 TB. So your SQL tier must be designed for terabyte-scale truth. Now events. 40 million events per day. Assume each event is 500 bytes. 40,000,000 × 500 bytes = 20,000,000,000 bytes, or about 20 GB per day raw. In one year, raw is about 7.3 TB. This is exactly why event-style data should not sit in the same relational path as payments. Different goods. Different warehouses. Different scaling story.
What architects usually get wrong at this stage¶
Mistake one. Using one database type for everything because the team already knows it. Comfort is not architecture. Fit is architecture. Mistake two. Choosing a document database just to avoid schema thinking. You still need data discipline. You are only postponing the pain if relationships are strong. Mistake three. Putting ephemeral fast-changing state in the system of record. Carts, sessions, counters, and cacheable flags often need a separate home. Mistake four. Picking graph because it sounds clever. If you run one graph query per month, a relational solution may be enough. Graph should earn its cost. Mistake five. Confusing schema design with storage strategy. Schema design is LLD detail. HLD decides which warehouse class stores which business capability. That choice shapes scaling, failure isolation, and team boundaries. A good final question is this. If one storage engine is down, which business capability should stop, and which should continue? That is how you know the boundaries are clean.
Where this lives in the wild¶
- Netflix — viewer profiles, playback state, and personalization features sit across Cassandra-style wide-column storage because global scale and partitioned access matter more than cross-row joins.
- Amazon DynamoDB — shopping cart, session, and metadata workloads often use key-value access where the application already knows the primary key and needs single-digit millisecond reads.
- Stripe — payments, balances, ledger entries, and dispute records rely on relational correctness because financial truth must survive retries, audits, and multi-step workflows.
- MongoDB in Adobe Experience Manager deployments — flexible document structures fit pages, assets, and variable metadata without rigid row expansion.
- LinkedIn — relationship-heavy features like social graph traversal and second-degree connection reasoning depend on graph-style modeling, even when other product data lives elsewhere.
Pause and recall¶
- What is the HLD question for storage selection: schema elegance or data shape plus access pattern?
- Why do orders and payments usually prefer a relational system even when average QPS is moderate?
- When does a key-value store beat a document store for the same product capability?
- Why should high-volume event data usually live outside the transactional SQL path?
Interview Q&A¶
Q: Why choose SQL for bookings and payments instead of a document database with flexible JSON fields? A: Because the core problem is coordinated correctness across multiple entities. Bookings, seats, payments, and refunds must stay consistent under retries and concurrency. Relational systems give transactions, constraints, and joinable truth, which matches that risk profile better. Common wrong answer to avoid: "SQL is only for structured data" — the real reason is transactional integrity across related business facts, not just whether columns look tidy. Q: Why use a document store for product catalog but not for ledger entries? A: Catalog items vary by category and are often fetched as self-contained records, so flexible document shape helps. Ledger entries need strict invariants, ordered reconciliation, and reliable cross-record reasoning, so relational storage is safer. Common wrong answer to avoid: "Documents are more scalable, so use them everywhere" — scalability is not the only axis; correctness and access pattern decide the fit. Q: Why use key-value for carts instead of putting carts in the primary relational database? A: Cart access is usually by user or cart ID, latency matters, and data can expire. That pattern rewards simple fast lookups and TTL support. Keeping carts separate also prevents bursty session traffic from disturbing transactional order storage. Common wrong answer to avoid: "Because SQL cannot handle reads" — SQL can handle many reads; the issue is workload isolation and a simpler access path. Q: Why pick graph for fraud-link analysis and not just keep adding joins in SQL? A: Because the question itself is graph-shaped. You repeatedly traverse accounts, cards, devices, and addresses across multiple hops. When connection depth becomes the product logic, graph databases reduce query complexity and operational pain. Common wrong answer to avoid: "Graph databases are always faster" — they are faster for relationship traversal, not automatically for every workload.
Apply now (5 min)¶
Take one familiar product. Maybe food delivery, edtech, banking, or social media. List five data types inside it. For each one, write three things: 1. main read pattern 2. main write pattern 3. failure cost if stale or wrong Now assign each data type to one storage family. Relational, document, key-value, wide-column, or graph. If you choose only one family for all five, force yourself to justify why. See if that reason is comfort or true fit. Sketch from memory: draw one box for the app, then draw four or five warehouses around it. Label each with the data type it stores and one reason it belongs there. Do not look back while sketching.
Bridge. The warehouses are placed. But delivery style still matters. Some requests must wait for a reply, while others should be dropped and processed later. → 05-sync-vs-async-communication.md