03. Wide-Column and Graph — Specialized shelves win when the goods are unusual¶

~15 min read. General-purpose databases struggle when partitioned writes or traversals dominate.

Built on the ELI5 in 00-eli5.md. The shelf — specialized shelves for specialized goods — now becomes wide-column partitions and graph edges.

1) Wide-column stores group writes by partition, not by neat spreadsheet habits¶

See. Cassandra and HBase are not bigger SQL tables with extra columns. They are storage models built around partitioned access and massive throughput. A wide-column row lives under a partition key. Inside that partition, clustering columns order related entries. This is superb for time-ordered events and append-heavy histories. Suppose we track reading activity by user. Queries are mostly these. Get recent events for user 42. Get events for user 42 between 9:00 and 10:00. Count failures for user 42 today. That maps beautifully to partition plus time.

Partition key: user_id
Clustering key: event_time desc

USER#42
├── 10:05 opened_book_91
├── 10:02 closed_book_91
├── 09:58 turned_page
└── 09:55 opened_book_17

Why is this strong? Because writes for one user append into one logical lane. Reads stay targeted inside that lane. The system avoids global joins and wide random scans. Worked example. Assume 8 million daily active readers. Each reader creates 30 events daily. That is 240 million events every day. Average event size is 300 bytes. Raw daily ingest becomes about 72 GB. Replication factor 3 lifts raw storage need near 216 GB daily. A relational design can store this. But wide-column stores handle this append-heavy rhythm more naturally. Cassandra especially likes predictable partition-key queries at huge scale. The warning is equally important. If you need arbitrary filters across all columns, pain begins. Wide-column modeling starts from queries, not from elegance.

2) Cassandra and HBase solve similar volume pain with different operating assumptions¶

Cassandra is distributed and peer-oriented. It is comfortable with multi-node writes and high availability. HBase sits closer to the Hadoop ecosystem. It shines when you already live in HDFS-heavy data infrastructure. Both reward sparse, massive tables. Both dislike join-heavy application logic. Both expect you to know the hot access path clearly. So how do you choose? Ask where the workload lives. If the serving system needs always-on writes across regions, Cassandra feels natural. If the workload sits near batch analytics and large scans, HBase can fit. Worked example. A telemetry platform ingests 150,000 device updates per second. Each update is 500 bytes. That is 75 MB every second. In one hour, raw ingest reaches about 270 GB. The main query is “show device history by device and recent window.” That is partition-shaped. A wide-column store earns its place here. Now compare with a support dashboard asking, “Find all failing devices in Pune with firmware 5.1.” That cross-cutting filter is not a natural wide-column query. You may need a separate index, search system, or warehouse export. Simple, no? Use specialized stores for their natural path. Do not demand every unnatural path from them too. One more signal helps. If the application can tolerate eventual consistency on some reads, Cassandra becomes easier operationally. If the organization already runs Hadoop deeply, HBase reuse can be attractive. Technology choice follows surrounding ecosystem too.

3) Graph stores make relationships first-class, not accidental join chains¶

Graph databases flip the modeling center. Entities become nodes. Relationships become edges. Traversal becomes the star query. Neo4j and Neptune are useful when connections are the product logic. Fraud rings, social graphs, recommendation paths, and dependency maps fit this shape. Suppose a library app wants to detect suspicious accounts. Questions are these. Which members share devices, addresses, and payment methods? Which of those members borrowed the same rare books within two days? Which members connect to banned accounts within two hops? That is a graph question.

(member)-[USED]->(device)
(member)-[PAID_WITH]->(card)
(member)-[BORROWED]->(book)
(member)-[LIVES_AT]->(address)

Now traversal can walk two or three hops naturally. Worked example. Assume 5 million members exist. Each member has 2 devices on average, 1.2 cards, and 40 borrow edges yearly. The graph quickly grows beyond 200 million edges. SQL can still store these facts. But repeated multi-hop traversals become awkward and expensive. A graph store keeps the edge lookup close to the model itself. That is the core win. Not magic speed for every query. Just better fit for connection-heavy reasoning. Now remember the caution. Graph stores are wonderful for traversals. They are usually not your cheapest place for giant analytical scans. So what to do? Use graph when “how are these connected?” is the core product question. Do not use graph because the schema diagram looked cool.

4) Specialized stores beat general-purpose ones when their hardest query is truly different¶

Here is the decision test. Would a general relational store spend most of its effort fighting the data shape? If yes, specialization may pay. If not, stay general-purpose longer. Wide-column beats relational when writes are massive and queries stay inside partitions. Graph beats relational when repeated traversals across many hops become normal.

best fit chooser
┌──────────────────────────────┬──────────────────────┐
│ dominant question            │ likely store         │
├──────────────────────────────┼──────────────────────┤
│ append events by partition   │ Cassandra / HBase    │
│ traverse connected entities  │ Neo4j / Neptune      │
│ join business facts safely   │ Relational database  │
└──────────────────────────────┴──────────────────────┘

Worked comparison. A reading app has two separate needs. Need one: 300 million page-turn events daily. Need two: detect fraud rings around shared devices and cards. Trying one database for both sounds clean. In practice, it bends both workloads awkwardly. Cassandra handles the event firehose better. Neo4j handles suspicious connection walks better. The application can still keep payments and memberships in relational truth tables. That is not chaos. That is honest workload separation. See the deeper lesson. Specialized stores are not replacements for design thinking. They are rewards for precise workload diagnosis.

5) Model the query path first, then the schema picture¶

This is the common interview rescue point. When confused, write the dominant query on paper first. For wide-column, write partition key, clustering key, and expected row count. For graph, write starting node, hop count, and stopping rule. Worked mini-example for wide-column. Partition by library_id alone sounds easy. But one giant city library may create a monster partition. Better key may be (library_id, yyyy_mm). Now one month stays bounded. Worked mini-example for graph. Starting from a member node, traverse device and card edges up to depth two. Stop after 500 candidate nodes to keep latency controlled. That is real modeling. Not vague nouns. Also watch data lifecycle. Wide-column tables need TTL and tombstone awareness. Graph stores need edge cleanup when relationships expire. Operational hygiene is part of design. Simple, no?

Where this lives in the wild¶

Netflix data platform engineer uses Cassandra-style wide-column storage when member-scoped timelines and huge write throughput dominate.
Uber telemetry engineer benefits from partitioned event history stores for rider, driver, and trip updates arriving continuously.
LinkedIn trust engineer relies on graph-style traversal for connection abuse, fake-account clustering, and relationship-driven investigations.
Amazon fraud platform engineer can use Neptune-like graph queries to follow shared cards, devices, and addresses across rings.
Hadoop ecosystem data engineer chooses HBase when sparse giant tables must sit near batch pipelines and large scans.

Pause and recall¶

Why do wide-column stores start from partition keys instead of normalized tables?
What kind of question makes graph storage feel natural immediately?
When would Cassandra feel better than HBase operationally?
Why is one-database-for-everything often the wrong goal here?

Interview Q&A¶

Q: Why choose Cassandra for activity events instead of a relational database? A: Because the workload is append-heavy, naturally partitioned, and usually queried within a key plus time window. Cassandra matches that rhythm far better. Common wrong answer to avoid: “Because Cassandra is always faster than SQL.” Q: When does a graph database beat repeated SQL joins? A: When multi-hop traversal is routine and relationship depth drives product logic. Graph stores keep edges central instead of simulated through many join steps. Common wrong answer to avoid: “Use graph whenever tables have foreign keys.” Q: Why can specialized stores coexist with relational truth tables? A: Because different workloads ask fundamentally different questions. Transactional correctness, partitioned ingest, and relationship traversal do not always belong together. Common wrong answer to avoid: “Multiple databases mean the architecture is broken.” Q: What is the first modeling choice in Cassandra? A: The access pattern and partition key. Table shape follows the query path, not abstract entity beauty. Common wrong answer to avoid: “Start by listing all business entities like SQL tables.”

Apply now (5 min)¶

Exercise: Pick one product that emits many events and also has many relationships. Write one query that is partition-shaped. Write one query that is traversal-shaped. Then assign each to a better storage family. Sketch from memory: Draw one wide-column partition with time-ordered rows. Next to it, draw three nodes connected by two edge labels. Then explain aloud why those pictures answer different questions.

Bridge. These shelf choices differ not only by model, but by physical construction. The way data sits on disk changes write cost and read cost completely. → 04-storage-engines-btree-lsm.md