01. Batch vs Stream for Data Platforms¶
⏱️ Estimated time: 22 min | Level: intermediate
ELI5 callback: In the car factory, the loading dock sets arrival rhythm, the conveyor belt sets work rhythm, the showroom exposes finished output, the reject bin protects trust, and the manifest explains every move. This file teaches shift work versus nonstop assembly.
Start with latency, not fashion¶
See. Batch means waiting, then crunching a big pile together.
Streaming means every event moves almost immediately after arrival.
Neither mode is superior by default.
The right choice depends on business pain.
A payroll report can wait until morning.
Fraud blocking cannot wait until morning.
Batch gives cheap scans and simpler recovery.
Streaming gives freshness and faster reaction loops.
So what to do?
Write the required freshness in numbers.
Then write the acceptable compute bill in numbers.
Now watch.
Low latency usually increases operational complexity.
High throughput usually prefers larger grouped work.
Windowing, checkpointing, and replay matter mostly in streams.
Partition planning and file sizing matter mostly in batches.
Start from the consumer, not the producer.
Design for late data before first production incident.
Architecture names matter less than recovery¶
Lambda keeps one batch path and one streaming path.
That sounds flexible, but duplication appears quickly.
Business rules drift between the two paths.
Testing becomes annoying because answers disagree.
Kappa says one streaming path should handle everything.
Reprocessing then becomes replay, not separate batch logic.
Simple, no?
Only if storage retains enough history.
Only if stream jobs can rebuild safely.
┌─────────┐ ┌──────────────┐ ┌────────────┐ │ Sources │──▶│ Batch engine │──▶│ Daily truth│ └─────────┘ └──────────────┘ └────────────┘ │ └────────▶┌──────────────┐──▶┌────────────┐ │ Stream engine│ │ Live views │ └──────────────┘ └────────────┘
The diagram shows why teams fight about maintenance.
Lambda helps mixed workloads and legacy estates.
Kappa helps one code path and fewer branches.
Many companies run a practical hybrid, not pure theory.
They stream hot metrics and batch cold backfills.
That is often the sane answer.
See.
Ask how the system replays six bad hours.
That answer is more valuable than architecture labels.
Correctness, cost, and backpressure¶
Throughput improves when you amortize startup across bigger chunks.
Latency improves when you commit smaller units sooner.
Those goals pull in opposite directions.
Exactly-once marketing language can confuse beginners.
Usually you manage duplicates, order, and idempotent sinks.
Watermarks decide when a result is final enough.
Wrong watermark settings create either delay or wrong counts.
Batch jobs hide this pain by closing the input.
Streams keep input open, so ambiguity stays alive.
Now watch.
A slow downstream table can throttle the whole path.
Backpressure is a feature, not an embarrassment.
It protects memory before the cluster collapses.
Streaming costs often look small until round-the-clock uptime.
Batch costs spike harder, but only during runs.
So what to do?
Price the steady state and the replay day.
Then choose the cheaper regret.
Practical design rules¶
Start with one question: how stale can the answer be?
If the answer is hours, batch is usually enough.
If the answer is seconds, stream the narrowest path.
Keep the expensive joins off the real-time path.
Precompute dimensions when possible.
Use batch to rebuild truth and stream to update views.
That pattern reduces panic during outages.
Define replay steps before launch.
Define lag alerts before launch.
Define ownership before launch.
Think again using the factory analogy.
The loading dock sets arrival rhythm, the conveyor belt sets processing rhythm, the showroom exposes freshness, the reject bin protects trust, and the manifest explains replays.
Simple, no?
Do not chase real-time because it sounds senior.
Chase the smallest system that meets the decision deadline.
Mature teams evolve mode per use case, not company-wide.
One platform can host both patterns peacefully.
Good design is disciplined compromise.
Where this lives in the wild¶
- Streaming personalization at Netflix coexists with daily finance closes.
- Payment companies stream fraud signals but batch settlement truth.
- Logistics platforms stream tracking events and batch route analytics.
- Ad platforms stream bidding and batch invoicing reconciliation.
Pause and recall¶
- When does freshness justify permanent cluster cost?
- Why does lambda create duplicate business logic risk?
- What problem do watermarks solve in streams?
- When is replay easier in kappa than lambda?
Interview Q&A¶
Q: When would you choose batch over streaming? A: When decisions tolerate staleness and scan economics dominate. Common wrong answer to avoid: Streaming is always better because it is modern.
Q: What makes lambda hard to maintain? A: Two code paths often drift semantically under deadline pressure. Common wrong answer to avoid: Only infra cost matters.
Q: Why is backpressure useful? A: It protects the system by slowing intake before memory collapses. Common wrong answer to avoid: It means the platform is broken.
Q: How do you mix batch and stream safely? A: Keep canonical truth and real-time views explicitly separated. Common wrong answer to avoid: Use the same table for everything.
Apply now (5 min)¶
Pick one product metric, like daily active users or fraud blocks. Write its freshness target and tolerance for duplicates. Sketch batch, lambda, and kappa options on paper. Mark where replay happens and who gets paged. Choose one design and justify the regret you accept.
Bridge. Processing mode chosen. But how does raw data get in? → 02