06. Data Flow First Design — Follow the ticket, not your panic¶
~15 min read. When the whiteboard feels blank, trace one request and let the system reveal itself.
Built on the ELI5 in 00-eli5.md. The order ticket — one user request moving through the restaurant — shows us which prep station touches data, and when.
1) Start with movement, not boxes¶
See. Most candidates freeze because they start with nouns. Load balancer, cache, queue, database. Nice words. No flow. So the page stays blank. Data flow fixes that. We start with one user action. We follow the order ticket end to end. We ask four things. What data arrives? What changes? What must persist? What goes back to the user?
┌────────┐ tap place order ┌──────────┐ validate ┌─────────────┐
│ client │ ──────────────────→ │ API tier │ ───────────→ │ order logic │
└────────┘ └──────────┘ └──────┬──────┘
│
write rows ────────────────┤
publish event ─────────────┤
send response ◀────────────┘
2) The write path is about state change¶
Take a food-delivery app. A user taps "Place order." That single tap becomes multiple internal steps. Look.
┌────────┐ ┌────────────┐ ┌──────────────┐ ┌───────────┐
│ client │→ │ auth check │→ │ order service │→ │ SQL store │
└────────┘ └────────────┘ └──────┬───────┘ └─────┬─────┘
│ │
├──→ outbox table │
│ │
└──→ response │
│
outbox poller ───────┘
│
└──→ queue ──→ payment/inventory/notification
3) The read path is about serving a question fast¶
Now the same user opens "Track order." Very different goal. No new truth is being created. We are shaping existing truth.
┌────────┐ ask for status ┌──────────┐ fan-in ┌─────────────────┐
│ client │ ────────────────→ │ API tier │ ─────────→ │ order view svc │
└────────┘ └──────────┘ └──────┬──────────┘
│
cache lookup ────────────────────────┤
read model query ────────────────────┤
ETA service call ────────────────────┤
response shaping ◀───────────────────┘
4) Worked example: trace one order with numbers¶
Suppose we are designing order placement for a quick-commerce app. Dinner peak is 500 order attempts per second. Eighty percent pass payment and become accepted orders. So accepted writes are: 500 attempts/sec × 0.8 = 400 committed orders/sec. Each accepted order writes these items. - order row = 2 KB - payment reference row = 0.5 KB - inventory reservation row = 0.5 KB - outbox event row = 1 KB Total durable write per accepted order is: 2 + 0.5 + 0.5 + 1 = 4 KB. Total durable write throughput is: 400 orders/sec × 4 KB = 1600 KB/sec. 1600 KB/sec is about 1.6 MB/sec. Per minute, storage ingest is: 1.6 MB/sec × 60 = 96 MB/min. Per hour, storage ingest is: 96 MB/min × 60 = 5760 MB/hour. That is about 5.76 GB/hour. Now latency. Our house rules say checkout confirmation must return in under 700 ms at p95. We budget the path. - client to edge = 60 ms - edge to API + auth/validation = 35 ms - pricing call = 40 ms - payment authorization = 260 ms - SQL transaction plus outbox write = 35 ms - response serialization and network back = 70 ms Total estimated latency is: 60 + 35 + 40 + 260 + 35 + 70 = 500 ms. So we have 200 ms headroom. Good. Now what should stay synchronous? Everything needed to safely say, "Order accepted." That includes the SQL commit. That includes the payment authorization result. That includes the outbox record for downstream work. What should move async? - sending SMS - pushing restaurant tablet notification - analytics counters - search index updates If each of those took even 80 ms extra, four side effects add: 80 × 4 = 320 ms. Our 500 ms path becomes 820 ms. That breaks the p95 target. See. The numbers tell us where to cut. This is the real power of data-flow-first design. You stop arguing in vague words. You start protecting the critical path.
5) A practical whiteboard checklist¶
When you feel stuck, say this out loud. What is the user action? What is the minimum durable commit? What events leave that commit? What question does the first read path answer? Which parts can be stale? Which parts cannot? Which component owns each transition? That is enough to begin. A good first-pass design often fits this pattern. - API receives the menu request - one service owns the business transition - one primary store commits truth - one queue carries deferred side effects - one read model serves the common query Now what is the problem after that? Hot reads. Cold writes. Rebuild lag. Fanout cost. Search indexing delay. Good. Those are solvable second-order problems. Blank-whiteboard syndrome is worse. Kill that first. Follow the data. Let the path draw the system for you.
Where this lives in the wild¶
- Swiggy checkout — backend engineer separates the order-accept write path from the order-tracking read path so dinner spikes do not corrupt truth.
- Uber trip request screen — rider platform engineer commits trip creation first, then serves live driver ETA through a different read flow.
- Stripe PaymentIntent — payments engineer treats confirmation writes and dashboard/reporting reads as different data paths with different latency needs.
- LinkedIn feed publishing — feed engineer handles post creation, fanout, and timeline reads as separate flows because each path has different bottlenecks.
- Shopify order admin — merchant tools engineer keeps order writes strict while search-heavy merchant views use read-optimized indexes.
Pause and recall¶
- Why does tracing one order ticket reduce blank-whiteboard syndrome?
- What belongs in the minimum safe commit, and what can move async?
- Why can the read path tolerate a different schema from the write path?
- In the worked example, which step dominated latency, and what design move followed from that?
Interview Q&A¶
Q: Why design the write path and read path separately instead of drawing one generic request flow? A: They optimize for different things. Writes protect correctness and durability. Reads protect latency and response shape. One diagram hides that tension.
Common wrong answer to avoid: "Because reads are usually more than writes" — volume matters, but the real reason is different constraints on truth versus serving.
Q: Why trace one request end to end before listing components? A: Flow exposes ownership, state transitions, and critical latency edges. Component lists sound smart but miss causality.
Common wrong answer to avoid: "Because interviewers like storytelling" — the value is operational clarity, not presentation style.
Q: Why push notifications and analytics behind async events instead of keeping them in the checkout call? A: They are not part of the minimum safe commit. Keeping them synchronous burns latency budget and couples availability to non-critical work.
Common wrong answer to avoid: "Because queues are faster" — queues help decouple and smooth traffic, but they do not magically make work free.
Q: Why build a read model instead of querying normalized write tables directly for every screen? A: Write schemas preserve clean mutations. Read screens need pre-joined, pre-shaped answers. Forcing one model to do both usually hurts both.
Common wrong answer to avoid: "Because denormalization is always faster" — speed helps, but the deeper issue is matching storage shape to the question asked.¶
Apply now (5 min)¶
Exercise: Pick one flow from BookMyShow, Zomato, or UPI. Write the first user action, the minimum safe commit, and three async side effects. Sketch from memory: Draw one write path and one read path. Mark the primary store, one queue, one cached read model, and the slowest synchronous step.
Bridge. Data flows through the system. But where does it rest? The storage decision — SQL or NoSQL — shapes everything downstream. → 07-storage-decision-framework.md