09. Scaling the Write Path — split the loading dock¶
~16 min read. Reads can fan out. Writes usually collide at one narrow door.
Built on the ELI5 in 00-eli5.md. The warehouse — the storage layer — now gets split into multiple loading docks so one overloaded door does not block every incoming write.
Why writes are harder than reads¶
Reads can use copies. Writes create truth. That is the whole headache. If ten users read the same page, many copies can answer. If ten users update the same order, one system must decide the final state. So write scaling is not only about throughput. It is also about correctness. See. A single database writer works well until all writes land on one machine. Then CPU saturates, disk flush queue grows, locks wait, transaction latency climbs, and tail latency becomes ugly. Now what is the problem? You cannot solve write pressure with only more caches. A cache can hide reads. It cannot invent committed writes. So we change the write path itself. Think of one giant warehouse with a single loading dock. All trucks line up there. Orders, messages, payments, inventory updates, logs, everything. At 1,000 trucks per hour, fine. At 100,000 trucks per hour, jam. So what to do? Split one loading dock into many. That is sharding or partitioning. Add a durable log before deeper storage. That is the write-ahead log. Combine small writes into larger chunks. That is batching. Each technique reduces pressure on the central warehouse.
writes
│
▼
┌────────────────────┐
│ single loading dock│
│ primary DB │
└─────────┬──────────┘
│ overload
┌─────────┴──────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Shard A │ │Shard B │ │Shard C │
│dock 1 │ │dock 2 │ │dock 3 │
└────────┘ └────────┘ └────────┘
Sharding strategies: hash, range, and geo¶
Sharding means splitting data across multiple storage nodes. Each shard owns only part of the key space. Writes route to one shard based on a rule. If the rule is good, load spreads. If the rule is bad, one shard burns while others nap.
Hash-based sharding¶
Take a key.
Run a hash.
Map hash to a shard.
Example.
Suppose we have 4 shards.
Use user_id % 4 as a simple stand-in for hashing.
User 101 goes to shard 1 because 101 % 4 = 1.
User 102 goes to shard 2 because 102 % 4 = 2.
User 103 goes to shard 3 because 103 % 4 = 3.
User 104 goes to shard 0 because 104 % 4 = 0.
Good news.
Distribution is usually even.
Bad news.
Range queries become painful.
Finding users 100 through 200 may touch every shard.
Range-based sharding¶
Each shard owns a value range. Shard A stores ids 1-1,000,000. Shard B stores ids 1,000,001-2,000,000. Good for range scans. Good for time windows. Bad when new writes all land in the newest range. That creates a hot tail.
Geo-based sharding¶
Users in India go to an India shard. Users in Europe go to a Europe shard. Good when laws, latency, or business boundaries follow geography. Bad when one region suddenly grows far faster than others. Now let us compare with numbers. You need 120,000 writes per second. One node safely handles 15,000 writes per second. Minimum shards needed = 120,000 / 15,000 = 8. So 8 shards is the floor. Do not stop there. Leave headroom. If you want 25% spare capacity, divide by 0.75. 120,000 / 0.75 = 160,000 effective needed capacity. 160,000 / 15,000 = 10.67. Round up. You provision 11 shards. Now each shard at steady state handles about 120,000 / 11 = 10,909 writes per second. That leaves about 4,091 writes per second spare on each node. Simple, no? We sized for peak plus safety. Routing picture looks like this.
┌────────────┐ hash/range/geo ┌────────┐
│ Write API │──────────────────→ │Shard A │
└─────┬──────┘ └────────┘
├─────────────────────────→ ┌────────┐
│ │Shard B │
│ └────────┘
└─────────────────────────→ ┌────────┐
│Shard C │
└────────┘
Write-ahead logs and batching¶
Before updating the final data pages, many systems append intent to a log. That is the write-ahead log, or WAL. Why first to a log? Because append is simple and durable. Random page updates are slower and riskier. If the system crashes after the log write but before page update, recovery can replay the log. That is why the WAL is like a signed truck register at the loading dock. Every truck entry is recorded before goods move inside the warehouse. Worked example. Suppose a disk can persist 5,000 random page writes per second. The same disk can append 40,000 sequential log entries per second. Your system receives 32,000 writes per second. Direct random writes will fail because 32,000 > 5,000. Sequential WAL appends fit because 32,000 < 40,000. Now add batching. Assume each final page flush can pack 200 writes together. 32,000 incoming writes / 200 per batch = 160 page flushes. If each batch flush takes 4 ms, total flush work is 160 × 4 ms = 640 ms per second. That fits inside one second of wall-clock time. See the sequence. Step 1: append 32,000 intents to WAL. Step 2: acknowledge when durability rule is met. Step 3: group many writes by page or shard. Step 4: flush compact batches to storage. So what did we gain? We turned scattered random work into ordered append plus grouped flushes. That is why WAL plus batching is everywhere. Now a smaller business example. Suppose checkout service gets 9,000 order updates per second. Without batching, each update costs 1.2 ms of downstream storage work. Total work = 9,000 × 1.2 ms = 10,800 ms per second. Impossible. Now batch 300 updates. 9,000 / 300 = 30 batches. Suppose each batch commit costs 18 ms. 30 × 18 ms = 540 ms per second. Work drops from 10,800 ms to 540 ms. Saved work = 10,260 ms each second. Reduction percentage = 10,260 / 10,800 = 95%. Batching is not a micro-optimization. It changes feasibility. Tradeoff? Per-item latency can rise while you wait for a batch to fill. If batch window is 50 ms, some writes wait nearly 50 ms before commit. So choose batch size and batch time together.
Hot partitions and rebalancing¶
Here is where people get hurt.
Even with many shards, one key or range can go hot.
That is a hot partition.
Example one.
Hash sharding by seller_id sounds fine.
Then one celebrity seller runs a flash sale.
Their shard gets 25,000 writes per second.
Other 10 shards get 9,500 writes per second each.
Average looks safe.
One shard still melts.
Example two.
Range sharding by timestamp sends all new events to the latest range.
Yesterday's shard is idle.
Today's shard catches fire.
So what to do?
First, detect skew.
Do not only monitor global QPS.
Monitor per-shard QPS, CPU, queue depth, and p99 latency.
Second, choose keys carefully.
A bad partition key creates permanent pain.
Third, rebalance.
Rebalancing means moving ranges, remapping hashes, or splitting hot shards.
Worked example.
Suppose shard H handles 24,000 writes per second.
Safe limit is 12,000.
It is 2x overloaded.
If you split shard H into H1 and H2 evenly, each gets 12,000 writes per second.
Calculation is 24,000 / 2 = 12,000.
Now it fits exactly.
Better to leave headroom and split earlier.
Maybe split when a shard crosses 8,000 of a 12,000 limit.
Now what is the challenge?
Moving data while traffic continues.
During rebalance, routers need old and new maps.
Background copy must happen.
Dual-write or forwarding may happen briefly.
Consistency rules must stay clear.
Look at the flow.
before: after split:
┌────────┐ ┌────────┐ ┌────────┐
│Hot H │ 24k writes/s │ H1 │ │ H2 │
└────────┘ │12k/s │ │12k/s │
└────────┘ └────────┘
Where this lives in the wild¶
- Amazon DynamoDB — partition keys spread writes across partitions, and hot keys can throttle a single partition even when table-level capacity looks healthy.
- LinkedIn Kafka — member activity writes land in partitioned append-only logs, which is the practical face of WAL-style ordered ingestion at huge scale.
- Google Cloud Bigtable — row-key ranges split into tablets, and hot ranges are moved or split to rebalance sustained write pressure.
- Shopify — high-volume merchants create bursty order and inventory writes, so partitioning by tenant-like keys helps isolate traffic instead of forcing one central write queue.
- Slack — message and event pipelines batch writes through log-based infrastructure before search, analytics, and downstream stores consume them.
Pause and recall¶
- Why does write scaling need correctness thinking, not only more throughput?
- When is hash sharding better than range sharding, and when is it worse?
- What exact durability problem does a WAL solve?
- Why can average shard load look safe while one partition still fails?
Interview Q&A¶
Q: Why hash sharding and not range sharding for user-generated writes? A: Hash sharding usually spreads unpredictable user traffic more evenly, which is great when hotspot avoidance matters most. Range sharding is better for ordered scans and time windows, but it can create a hot edge if new writes cluster in one range. Common wrong answer to avoid: "Hash sharding is always better" — it is better for balance, not for every query shape. Range queries and locality may get worse. Q: Why use a WAL and not write data pages directly first? A: A WAL turns the first durable step into an append, which is faster and easier to recover from after crashes. Data pages can be updated later in a controlled way. Without the log-first step, partial writes are harder to reason about and replay. Common wrong answer to avoid: "Because logs are cheaper storage" — storage price is not the main point. Ordered durability and crash recovery are the real reasons. Q: Why batch writes and not commit each item immediately at peak traffic? A: Immediate single-item commits maximize per-item freshness but destroy throughput when storage overhead per commit is high. Batching amortizes that fixed cost across many items. The system trades a bounded delay for a massive capacity gain. Common wrong answer to avoid: "Batching is only for analytics" — no. Many operational write paths batch safely when strict per-item immediacy is not required. Q: Why shard and not just buy a bigger database box? A: Bigger boxes help for a while, and you should usually exhaust that path first. But every machine has a ceiling on IOPS, CPU, memory bandwidth, and failover blast radius. Sharding adds horizontal capacity when one box is no longer enough. Common wrong answer to avoid: "Vertical scaling never works" — it often works very well until it does not. The real issue is limits and risk concentration, not ideology.
Apply now (5 min)¶
Your system must handle 96,000 writes per second. One shard safely handles 12,000 writes per second. Calculate the minimum shard count. Then add 20% spare capacity and recalculate. Now imagine one hot customer generates 30,000 writes per second on a single key. Say in two lines why adding more shards may still not fix that hotspot. Finally, choose one partition rule for each case: chat messages, daily logs, and country-specific compliance data. Sketch from memory: draw one write API, one WAL, three shards, and mark where batching happens before final storage flush.
Bridge. Writes are distributed. But now the same data lives in multiple warehouses, logs, or replicas. What if they disagree after delay, failure, or retry? → 10-consistency-and-replication.md