Skip to content

08. Partitioning and Sharding — Splitting Data That Is Too Big for One Machine

~13 min read. Ten billion rows. One server chokes. Partitioning is how you breathe again.

Built on the ELI5 in 00-eli5.md. The branch library catalogue split across locations is perfect: some books live at the north branch, others at the south branch, none duplicated.


1. What Partitioning Is and Why It Differs From Replication

Replication copies the same data to multiple nodes. Partitioning splits data so each node holds a different subset. Used together, you get both scale and resilience.

Without partitioning:           With partitioning:
┌────────────────────┐          ┌──────────┐  ┌──────────┐  ┌──────────┐
│  All 10B rows on   │          │ Shard 1  │  │ Shard 2  │  │ Shard 3  │
│  one giant server  │          │ rows 0-3B│  │ rows 3-6B│  │ rows 6-9B│
└────────────────────┘          └──────────┘  └──────────┘  └──────────┘

Each partition (shard) can be replicated independently. Most large systems combine both.


2. Hash Partitioning

Compute hash(partition_key) mod N to assign each row to a partition.

hash("patron_A") mod 3 = 0  →  Shard 0
hash("patron_B") mod 3 = 1  →  Shard 1
hash("patron_C") mod 3 = 0  →  Shard 0

Advantages: - Even data distribution. No hot spots from natural clustering. - Simple routing logic. Any node can compute the target shard instantly.

Disadvantages: - Range queries are impossible. WHERE patron_id BETWEEN 100 AND 200 must hit all shards. - Adding or removing shards changes the modulo. Most or all data must move. This is catastrophic at scale.

Consistent hashing solves the rebalancing problem. Each node owns an arc on a ring. Adding a node takes over only the adjacent arc, moving roughly 1/N of the data.

           0
          / \
    270  ─   ─  90
          \ /
          180

Node A owns: 0–120
Node B owns: 120–240
Node C owns: 240–360
Add Node D at 60: takes 0–60 from Node A only.

3. Range Partitioning

Assign rows based on sorted ranges of the partition key.

Partition key: book_id (sorted)
Shard 0: book_id  0 – 999,999
Shard 1: book_id  1,000,000 – 1,999,999
Shard 2: book_id  2,000,000 – 2,999,999

Advantages: - Range queries hit one or a few contiguous shards. - Efficient for time-series data partitioned by date. Recent data is on recent shards.

Disadvantages: - Uneven distribution if data is skewed. All January traffic hits the January shard. - Hot shard problem: monotonically increasing keys (auto-increment IDs, timestamps) send all new writes to the last shard.

The branch library example: partitioning by acquisition date works great for historical queries but creates a hot-shard problem when every new book arrives in one shard.

Fix for hot shards: Prefix the key with a random salt (shard suffix).

Original key:  2024-01-15
Salted key:    2024-01-15#3   (random suffix 0–9)
Result: writes spread across 10 shards for the same date.
Cost:   range queries must now query all salt values.

4. Partition Key Selection — The Most Important Decision

A poor partition key causes uneven load, hot spots, or expensive cross-shard queries. Evaluate every candidate key on three dimensions:

Dimension         Good key                Bad key
──────────────────────────────────────────────────────────────
Cardinality       High (user_id, order_id) Low (status: 3 values)
Access pattern    Most queries include it  Queries rarely filter by it
Skew              Even distribution        Viral user gets 80% of traffic

In the branch library system, partition by patron_id if most queries are per-patron. Partition by branch_id only if most queries are per-branch and you have many branches.


5. Cross-Partition Queries and Secondary Indexes

Cross-Partition Queries (Scatter-Gather)

A query that cannot be answered by one shard must be broadcast to all shards. Each shard executes locally and returns partial results. A coordinator merges them.

SELECT * FROM loans WHERE due_date < '2024-12-01'
  → sent to all 10 shards
  → each returns subset
  → coordinator sorts and returns top N

Scatter-gather works but is slow. At 100 shards, one slow shard delays the entire query. Design your access patterns to avoid it.

Secondary Indexes on Partitioned Data

Two approaches:

Local (partition-local) secondary index: Each shard maintains its own index covering only its rows.

Shard 0 index: genre=mystery  →  [book_id: 3, 7, 21]
Shard 1 index: genre=mystery  →  [book_id: 1002, 1045]
Writes are fast (update only local index). Reads on the indexed column require scatter-gather.

Global secondary index: A separate partitioned index covers all shards but is itself partitioned by index value.

Global genre index:
  mystery shard  →  [book_id: 3, 7, 21, 1002, 1045, ...]
  fiction shard  →  [book_id: 9, 14, ...]
Reads on the indexed column are fast (one index shard). Writes are slow (must update the remote index shard atomically). DynamoDB GSIs work this way and are eventually consistent.


6. Rebalancing Partitions

When you add capacity, you must move data between nodes without downtime.

Naive approach (mod N): add 1 node, rehash everything → 90%+ data moves
Consistent hashing:     add 1 node, move ~1/N data  → 10% data moves
Fixed partitions:       pre-create 1000 logical shards, assign to 10 nodes
                        add 10 nodes → move 5 logical shards each

Fixed partition rebalancing (used by Cassandra, Elasticsearch): - Create far more logical partitions than physical nodes at startup. - Each node owns multiple partitions. - Adding a node: steal a few partitions from existing nodes. Data in those partitions moves; other data stays put. - No rehashing of keys. Partition-to-node assignment changes, not the partition assignment of each row.


Where this lives in the wild

Cassandra at Discord (Infrastructure) — Messages partitioned by (channel_id, bucket) where bucket is a time-based window. This prevents hot shards on active channels and enables efficient range reads per channel.

DynamoDB at Amazon (Platform) — Hash partitioning on the primary key. Engineers explicitly warned about "hot key" anti-patterns in AWS documentation. Adaptive capacity re-allocates throughput automatically to compensate.

MySQL at Uber (Storage) — Migrated from Postgres to MySQL and sharded by city_id. Cross-city queries require application-level joins across shards. Partition key selection drove the entire data model.

Elasticsearch at LinkedIn (Search) — Full-text index sharded across nodes. Each shard is a Lucene index. Rebalancing is automatic but pauses indexing throughput while moving shards.

Vitess at Slack (Database) — MySQL sharding middleware. Slack shards by team_id. The branch library analogy holds: each workspace is isolated to its own shard group, preventing noisy-neighbour issues.


Pause and recall

  1. What is the hot-shard problem with range partitioning on timestamps? How do you fix it?
  2. Consistent hashing solves the rebalancing problem with modulo hashing. Explain the mechanism in two sentences.
  3. What is the trade-off between local and global secondary indexes on partitioned data?
  4. You are designing the branch library system with 50 million loan records. The most common query is "all active loans for patron X." Choose a partition key and justify it.

Interview Q&A

Q: What is the difference between hash partitioning and range partitioning? When do you use each?

Hash partitioning applies a hash function to distribute rows evenly. Range partitioning assigns rows based on sorted key ranges. Use hash when distribution matters most and range queries are rare. Use range when queries filter on ranges or time windows and you can tolerate hot-shard risk.

Common wrong answer to avoid: Saying hash partitioning is always better because it is more even. Hash partitioning makes range queries require scatter-gather across all shards. For time-series databases and anything with date range filters, range partitioning is often the right choice.


Q: How does consistent hashing help when you add or remove nodes?

Consistent hashing places nodes and data keys on a ring using the same hash space. Each key belongs to the nearest node clockwise. Adding a node only takes over the adjacent arc, moving roughly 1/N of data. Removing a node transfers its arc to the next node. Other nodes are unaffected.

Common wrong answer to avoid: Saying consistent hashing means "no data moves." Adding a node always moves some data. The advantage is that only 1/N moves instead of most of the dataset as with modulo hashing.


Q: What is a cross-partition (scatter-gather) query and why is it slow at scale?

A scatter-gather query broadcasts to all shards, waits for each to respond, then merges results. It is slow because the total latency equals the slowest shard response, and at 100+ shards there is always a straggler. It also multiplies read load across the entire cluster.

Common wrong answer to avoid: Saying scatter-gather is fine if you have fast hardware. The straggler problem is statistical, not hardware-dependent. One GC pause or network blip on any one shard delays all 100 results. Design queries to avoid scatter-gather for latency-sensitive paths.


Q: Why is partition key selection the most critical decision in a sharded system?

The partition key determines data distribution, query routing, and hot-spot risk. A bad key causes skewed load, forces scatter-gather for common queries, or creates migration costs later. Changing the partition key requires moving all data — an expensive operation on a live system.

Common wrong answer to avoid: Treating partition key selection as a performance tuning detail. It is a schema design commitment. Changing it post-launch is closer to a database migration than an index change.


Apply now (5 min)

Exercise: You are designing a sharded database for the branch library loan system. There are 20 branches, 500,000 patrons, and 2 million loan records growing at 10,000 per day. The three most common queries are: (1) all loans for a patron, (2) all overdue loans across all branches, (3) all loans at a specific branch today. Choose a partition key, identify which query becomes a scatter-gather, and suggest a mitigation strategy for it.

Sketch from memory: Draw consistent hashing with four nodes on a ring. Add a fifth node. Show which data moves, which stays, and why modulo hashing would have been worse.


Bridge. Your data is split across shards and replicated for resilience. Now: how do you make reads blazingly fast without hitting the database at all? → 09-caching-patterns-deep-dive.md