08. Where the pipe hits its limits — open problems, contested practices, and when streaming is the wrong tool¶
~25 min read. A staff engineer reviews your design and asks one question the whole module has been circling: "You built seven layers of real-time machinery. The product is a copilot that answers within a human's patience. Could a thirty-minute micro-batch have done the same job at a tenth the cost — and would the user have noticed?" If you can't answer honestly, you didn't design the platform; you defaulted to it. This file is the honest answer, and every place the clean architecture frays.
Pulls together every recurring concept from 00-first-principles.md — the freshness gap, replay log, modality cost asymmetry, derived artifact, exactly-once vs at-least-once, incremental indexing, lambda vs kappa, and lineage across modalities — and stresses each at the boundary where it stops being obviously right.
What the module built, and the questions it deliberately left open¶
The module built a complete streaming multimodal platform. Chapter 01 turned "fresh enough" into a latency budget. Chapter 02 absorbed bursts with a durable replay log. Chapter 03 tiered storage so raw stays cheap and derived stays queryable. Chapter 04 ran models in-stream with idempotent writes. Chapter 05 served fresh cross-modal retrieval. Chapter 06 spent freshness only where it's worth it. Chapter 07 made the whole flow traceable and erasable. Each chapter resolved its pressure cleanly enough to teach.
Real systems are not clean. Every mechanism in this module sits on a tradeoff that the teaching version smoothed over, and several sit on practices the industry actively disagrees about. The freshness budget assumes you can measure the value of a second; often you can't. Exactly-once is advertised everywhere and end-to-end almost nowhere. Kappa is the default for new builds and lambda refuses to die. And the largest open question hangs over the whole thing: a lot of streaming platforms are over-engineered batch jobs wearing a Flink cluster. This file walks each boundary, gives both sides where the industry disagrees, and ends where the module should — closer to operational ambiguity than to a clean answer.
What this file solves¶
A reader who has the seven mechanisms can still over-build, because the module taught each one as the right answer to its pressure without marking where the pressure is too weak to justify the machinery. This file shows where each mechanism hits its honest limit, which practices are genuinely contested (exactly-once, lambda-vs-kappa, in-stream vs async model calls, multimodal-shared-space vs caption-and-index), and the single most useful judgment in this domain: recognizing when a streaming platform is solving a problem a scheduled batch job already solved more cheaply.
1) Why "is it fresh enough?" is the wrong first question — the need to question the premise¶
Every chapter optimized the freshness gap. The unexamined assumption underneath all of them is that freshness has measurable value worth its cost. For the live copilot mid-call, it plainly does — a 90-second-old chat changes the answer. But that single strong case quietly justified seconds-fresh machinery across a platform where most data is not consumed by a human waiting in real time.
The honest first question is not "how do we make it fresh?" It is "who perceives the freshness, and what does a slower answer actually cost them?" Run it against the real consumers:
CONSUMER PERCEIVES A FASTER ANSWER? VALUE OF SECONDS-FRESH
copilot, customer mid-call yes, directly high — changes the answer
supervisor glancing at a wall barely (refresh ~30–60s) low — UI hides the freshness
nightly CSAT report no zero
retraining export no zero
quarterly audit no zero
The symptom of getting this wrong is not a failure — it's a bill with no complaint behind it. Nothing breaks. The dashboards are green. You are simply paying the steep end of the freshness curve for latency four of five consumers cannot perceive. So the root cause of most streaming over-spend is not bad engineering; it is answering "how fresh can we make it?" before answering "who is waiting?" The natural question that should come first: which consumer's perception, if any, justifies the always-on machinery — and for everyone else, would a batch job be indistinguishable?
Why this question comes first. Freshness is the module's anchor, but an anchor optimized without a value function becomes a cost sink. The freshness gap is worth closing only where a human or a deadline perceives the closing. Asking "who waits?" before "how fresh?" is the one habit that separates a platform sized to its product from a platform sized to its tooling.
2) The core picture: the over-engineering boundary¶
VALUE OF FRESHNESS (per query)
high │ ● live copilot ← streaming earns its cost here
│ (human mid-call)
│
│ ● live fraud block
│
──────────┼───────────────────────────────────────────▶ QUERY RATE
│ (how often the path is hit)
low │ ● supervisor wall ● nightly report
│ (UI-throttled) ● audit ● export
│ ┌─────────────────────────┐
│ │ STREAMING HERE = the │
│ │ over-engineering trap │
│ │ (scheduled batch wins) │
└──────────────────────────┴─────────────────────────┘
Streaming is justified ONLY in the high-freshness-value region.
Most data lives in the low-value region and should be batch.
The picture the module never drew explicitly: a plane of freshness-value against query-rate, with a small region where streaming earns its always-on cost and a large region where it doesn't. The live copilot sits in the high-value corner — that one case is real and justifies everything chapters 02–05 built for that path. The mistake is letting the copilot's strong case license streaming for the whole plane. Most paths sit in the low-value region where a scheduled batch over landed data (chapter 06) is correct, cheaper, and indistinguishable to the user.
3) The running example, stress-tested: was the platform over-built?¶
Apply the boundary to the support-copilot platform honestly. Score each path:
PATH freshness value query rate verdict
live copilot retrieval high high streaming justified (the anchor case)
live call sentiment high high streaming justified (shares the pipe)
supervisor dashboard low (UI 60s) medium 1-min micro-batch, not per-event stream
daily CSAT report zero low scheduled batch query (ch.06) — streaming = waste
retraining export zero low batch read of landed data — streaming = waste
quarterly audit zero very low cold-tier query (ch.03) — streaming = waste
Two of six paths justify the streaming pipe. The other four were riding it because it existed. The platform was correctly built for the copilot and over-built for everything else — which is the common real outcome, not a design error so much as a failure to draw this line. The fix is not to tear out streaming; it's to let the four patient paths read landed data on a schedule (chapter 06's kappa-with-tiering), keeping the streaming machinery scoped to the two paths a human waits on.
The sharper stress test: even the copilot's freshness budget is contestable. Chapter 01 set ≤10 s. But a customer who just spoke is mid-conversation; would a 30-second-fresh retrieval actually feel worse? Possibly not — the human's turn-taking gives slack. If the true perceptible budget is 30 s, a 30-second micro-batch (Spark Structured Streaming, chapter 04) could serve the copilot at a fraction of continuous-Flink cost. The honest answer is measure it — A/B the copilot at 10 s vs 30 s freshness and see if CSAT moves. Most teams never run that test and pay for 10 s on faith.
4) The contested practices — both sides, fairly¶
These are the places the industry genuinely disagrees. A senior engineer should hold both sides, not recite a winner.
Exactly-once: real guarantee or marketing word?¶
One camp treats end-to-end exactly-once as achievable and worth designing for (Flink + transactional sinks, Kafka transactions). The other camp (Kleppmann-flavored) argues exactly-once is effectively-once — at-least-once delivery plus idempotent or transactional writes — and that "exactly-once delivery" over a network with side effects is impossible; the model call's GPU-second can't be un-spent (chapter 04). The practical resolution most shops reach: don't chase exactly-once delivery; make the sink idempotent and accept at-least-once. The disagreement persists because the frameworks advertise "exactly-once" and the marketing outruns the guarantee's actual end-to-end scope.
Lambda is dead vs lambda refuses to die¶
The 2026 consensus is kappa-by-default for new builds: one logic path, replay for reprocessing, no drift (chapter 06). But lambda persists for two unglamorous reasons — many platforms predate durable replayable logs and lakehouse table formats, so their batch layer is still load-bearing; and some regulated shops want an independent batch recomputation as an audit cross-check. "Lambda is dead" is true for greenfield and false for the installed base. The honest position: kappa for new, but don't sneer at a lambda you inherited that's correctly serving a real reprocessing-authority need.
Models in the stream vs async side-path¶
Chapter 04 ran ASR and embedding inline. But a heavy model (multi-minute document understanding, large re-ranking) inline stalls the operator and blows checkpoints. The contested line is how heavy is too heavy for inline. One side keeps everything in-stream for simplicity; the other pushes anything slow to an async side-path (emit a job, process off-stream, write back). There is no fixed threshold — it depends on the model's latency relative to the freshness budget and the operator's checkpoint cadence. The judgment is per-model, and reasonable engineers draw the line differently.
Shared multimodal space vs caption-and-index¶
Chapter 05 favored a shared multimodal embedding space. But caption-and-index (caption every image, embed the caption as text) is simpler, keeps a pure-text stack, and is sometimes better when images are simple. The shared space retains visual detail a caption drops but a general multimodal model can underperform a specialist text model on text-to-text relevance. The disagreement is real and workload-dependent — error screenshots favor shared space; simple diagrams favor captions.
Teacher voice. The mark of a junior answer is picking the "modern" side of each of these and defending it as obviously correct. The mark of a senior answer is naming the workload condition that flips the choice: exactly-once matters when the sink can't dedupe; lambda survives when replay can't reach the history or an audit demands a second computation; inline models work until their latency threatens the checkpoint; shared space wins when visual detail matters. The contested practices aren't unsolved because people are confused — they're contested because the right answer genuinely depends on the workload.
5) The empirical-but-untheorized parts — what works without a clean reason¶
Several things in this module work in practice and lack a tidy theory telling you the right setting.
- Window and batch sizes (chapter 04). "Flush at 200 messages or 1 second" works, but the 200 and the 1 s are tuned by observation, not derived. The right window is workload- and provider-dependent; there's no formula, only a measured cost/latency curve.
- Compaction cadence (chapter 05). How often to compact sealed segments to hold recall is empirical — too rare and recall rots, too frequent and CPU thrashes. The "right" cadence is found by watching recall@k against ground truth, not computed.
- Confidence thresholds for quality gates (chapter 07). The τ below which an ASR transcript is quarantined is a tuned dial; too high quarantines good data, too low admits garbage. No theory fixes it.
- Freshness budgets themselves (chapter 01). "≤10 s for the copilot" is a defensible guess, rarely an A/B-measured value. The perceptible threshold is empirical and usually un-measured.
These aren't flaws — they're the honest state of the art. The danger is treating a tuned constant as a law. A number that worked at last quarter's volume and provider pricing can be wrong this quarter, and nothing alerts you because the system still runs.
6) The property that breaks intuition: cost scales with idle readiness, not work done¶
The dimension that surprises people is that a streaming platform's dominant cost is often not the work it does — it's the readiness it maintains while doing nothing. The consumers, the warm GPU pool, the always-on index, the compaction workers all bill continuously whether or not a query arrives. A batch job bills only when it runs.
COST DRIVER BATCH STREAMING
compute only during the run continuous (idle readiness)
when no queries ~zero full always-on cost
cost ∝ work performed readiness maintained
cheap when queries are rare/patient queries are frequent/urgent
This inverts the naive cost intuition. People assume "streaming costs more because it does more work." It often does less work than a nightly batch (incremental updates vs full recompute) yet costs more, because it pays to be ready every second. So the cost question is never "how much work?" — it's "how much readiness, and is anyone using it?" A streaming path serving one query a day pays 86,400 seconds of readiness for one second of value. The pressure evolution across the whole module: every freshness mechanism relieved latency by converting occasional batch work into continuous readiness, and the bill is the readiness, absorbed whether or not a reader shows up.
Mini-FAQ. "If streaming does less work, why is my Flink bill higher than the old nightly Spark job?" Because the Spark job ran for 40 minutes and stopped; the Flink job runs 1,440 minutes a day to be ready. You traded total work for total readiness, and you pay for readiness by the second. That trade is worth it exactly when a reader is waiting often enough that the readiness is used — and a waste when it isn't.
7) Cost table: where each mechanism stops being worth it¶
Order-of-magnitude judgment, not provider-exact. The point is the boundary, not the number.
| Mechanism | Worth it when | Stops being worth it when | Cheaper alternative past the boundary |
|---|---|---|---|
| Replay log (02) | bursty producers, replay needed | steady low volume, no replay | a queue or direct writes |
| Storage tiering (03) | large raw, mixed access | small data, uniform access | one bucket, no tiering ops |
| In-stream models (04) | per-event freshness matters | heavy model, no freshness need | async batch inference |
| Incremental index (05) | interactive fresh reads | reads are rare/patient | periodic rebuild, even nightly |
| Kappa one-path (06) | durable log + lakehouse exist | regulator wants 2nd computation | lambda's batch as audit cross-check |
| Per-event streaming (all) | human waits, frequent queries | no human waits / rare queries | scheduled batch over landed data |
| Full lineage (07) | PII, audit, erasure required | throwaway internal stream | back-pointer only |
Read the middle column as the over-engineering detector. The moment a path matches a "stops being worth it" condition, the streaming machinery for that path is cost without perceived value, and the right-column alternative is the honest design. The whole-platform version: per-event streaming is worth it only where a human waits frequently; everywhere else, scheduled batch over the landed lakehouse (chapter 06) is correct and far cheaper.
Concrete: the platform's ~$41k/month (chapter 06) is justified for the copilot's two paths; the four patient paths riding the same machinery were perhaps $8–12k/month of readiness serving queries a nightly job would have served for a few hundred dollars. The over-spend is invisible because nothing fails — it's readiness nobody uses.
8) Operational signals: detecting over-engineering and boundary stress¶
- Healthy: always-on cost concentrated on the paths with high query rate and high freshness value (the copilot); patient paths run as scheduled batch and idle between runs; A/B evidence exists for the freshness budgets that cost the most.
- First metric to degrade: cost-per-perceived-second-of-freshness — always-on spend on a path divided by how much a user perceives that path's freshness. It climbs silently when streaming machinery serves UI-throttled or batch-suitable consumers; no SLO catches it because nothing is failing. This is the over-engineering smell.
- Misleading metric people watch: uptime and latency SLOs, both green. A platform can be 100% available, p99 excellent, and 60% over-built — the green board is the comfort that hides the waste. Great SLOs on a never-perceived path are the symptom of over-engineering, not health.
- First graph an expert opens: cost per path overlaid with query rate and a freshness-value estimate (who waits, can they perceive it). They hunt for the high-cost, low-perceived-value paths — the streaming machinery serving consumers a batch job would satisfy invisibly — and for freshness budgets set on faith with no A/B behind them.
9) Boundary: where this whole module's architecture fits, and where it's the wrong tool¶
- Strong fit: a product with a genuine real-time human-in-the-loop consumer over multimodal data — the live support copilot, live fraud blocking, live trading signals. The freshness gap is perceived, the query rate is high, and the always-on cost buys perceived value. Here the full module applies and earns its keep.
- Pathological: a "real-time platform" built because streaming is fashionable, serving dashboards refreshed every minute and reports read once a day. Every mechanism is correct in isolation and the sum is an expensive batch job. The pathology is invisible — nothing fails, the board is green, and the bill is the only symptom.
- Scale/workload limit that breaks intuition: at low query rate or patient consumers, the entire streaming apparatus is the wrong tool, and a scheduled batch over a lakehouse beats it on cost and simplicity. The intuition "real-time is the modern, superior architecture" fails hardest exactly where it's most confidently applied — most data does not have a human waiting on it, and for that data, batch was never legacy; it was always the right answer.
10) Wrong model to drop: "real-time is strictly better than batch"¶
The seductive idea, reinforced by every vendor and conference talk, is that streaming is the modern superior architecture and batch is legacy to be migrated away from. It feels like progress. The correct model: streaming and batch are different cost/latency tradeoffs, and the right one depends entirely on whether a consumer perceives the latency. Streaming converts occasional work into continuous readiness — worth it when a reader is waiting often, pure waste when they aren't. Batch pays only for work performed — wasteful when a human is blocked on staleness, optimal when nobody is. "Real-time is better" leads directly to the over-engineering trap: building always-on machinery for consumers who can't perceive the freshness it delivers. The senior model is freshness-value-per-query, not real-time-as-default.
11) Other boundary and tradeoff failure shapes¶
- Cargo-cult streaming — adopting Kafka+Flink because the architecture diagram looks modern, for a workload a cron job served fine; the bill is the only feedback.
- Freshness budget on faith — the most expensive path's latency target was never A/B-measured against a slower one; you pay for 10 s when 30 s was imperceptible.
- Exactly-once theater — a "exactly-once" Flink job pointed at a non-transactional sink; the guarantee was never end-to-end and duplicates appear (chapter 04).
- Inherited-lambda guilt — ripping out a working lambda that genuinely serves an audit-recomputation need, in the name of "kappa is modern."
- Tuned-constant rot — a window size or compaction cadence tuned last quarter is wrong at this quarter's volume/pricing; nothing alerts because nothing fails.
- Shared-space cargo cult — forcing multimodal embeddings where simple images would caption-and-index more cheaply and accurately.
- Idle-readiness blindness — sizing cost by work done, not readiness maintained; the platform serving one query a day looks "efficient" per-query while burning continuous spend.
- SLO comfort — declaring health from green uptime/latency while 60% of the machinery serves consumers who can't perceive its output.
12) Pattern transfer¶
- Freshness-value-per-query = caching's working-set principle — spend on what's accessed often enough to matter; the over-engineering boundary is the same shape as caching cold data nobody reads (chapter 06's tiering, chapter 03's storage tiers).
- Idle readiness vs work-performed = provisioned vs on-demand everywhere — the streaming/batch cost inversion is the same as reserved-vs-spot, always-on-server-vs-lambda, and connection-pool-vs-per-request; readiness is a cost you pay whether or not it's used.
- Contested-practice judgment = the workload-flips-the-answer pattern — exactly-once, lambda-vs-kappa, inline-vs-async all resolve by naming the workload condition, the same move as every "why X not Y under this workload" comparison across the module.
- Effectively-once = idempotency's universal reach (chapter 04) — the exactly-once debate resolves to the same idempotent-write antidote that recurs in payments, queues, and retries; the guarantee lives at the write, not the delivery.
13) Design test¶
- For each read path, can you name who perceives its freshness and what a slower answer costs them — before defending its streaming machinery?
- Have the most expensive freshness budgets been A/B-measured against a slower, cheaper alternative, or are they set on faith?
- For each contested choice (exactly-once, lambda/kappa, inline/async, shared-space/caption), can you name the workload condition that would flip your answer?
- Is your cost reasoning based on readiness maintained (always-on) rather than work performed, and is every always-on path serving a consumer who perceives its freshness?
- Could any path be moved to a scheduled batch over landed data with no perceptible difference to its consumer — and if so, why hasn't it been?
Where this appears in production¶
The over-engineering debate and batch's enduring role: - Many "real-time" platforms — built on Kafka+Flink for dashboards and reports that a nightly job served, the most common over-engineering pattern in the field. - Scheduled-query reporting (Snowflake / BigQuery / dbt) — patient analytics deliberately kept batch because no human waits, the cost-correct default for the low-freshness-value region. - Stripe / fraud platforms — streaming reserved for the scoring path that blocks a transaction (perceived freshness), batch for the analytics no one waits on. - Netflix / Uber — streaming for personalization and real-time features, batch for the large reporting and training paths — freshness-value-per-query in practice.
Contested practices, both sides in real systems: - Flink + transactional sinks — the exactly-once-is-achievable camp, end-to-end with participating sinks. - Kafka idempotent producers + idempotent consumers — the effectively-once camp: at-least-once plus idempotent writes (chapter 04). - RisingWave / Materialize — streaming databases collapsing speed + serve, pushing kappa further toward one system. - Inherited lambda at large enterprises — batch layer kept as an independent audit recomputation, the legitimate surviving lambda. - ColPali / caption-and-index pipelines — the simpler text-stack alternative to shared multimodal embeddings, chosen when images are simple. - Async inference side-paths (SageMaker async, batch transform) — heavy models pushed off-stream rather than run inline, the async-side-path camp. - OpenLineage / Iceberg lineage adopters — building provenance in at write time precisely because retrofit is impossible (chapter 07).
Pause and recall¶
- Why is "who perceives the freshness?" the question that should come before "how fresh can we make it?"
- Draw the over-engineering boundary (freshness-value vs query-rate). Where does streaming earn its cost, and where is batch the honest answer?
- Stress-test the running platform: how many of its six read paths actually justify the streaming pipe, and why were the others riding it?
- Take the exactly-once debate: state both sides and the workload condition that decides it.
- Why does a streaming platform's cost scale with idle readiness rather than work performed, and how does that invert the naive intuition?
- Name two tuned constants in this module that work empirically but lack a theory, and the danger of treating them as laws.
- Which metric reveals over-engineering, and why do green uptime/latency SLOs hide it?
- Why is "real-time is strictly better than batch" the wrong model, and what replaces it?
Interview Q&A¶
Q1. You inherit a Kafka+Flink platform serving dashboards and a nightly report. How do you tell if it's over-engineered? A. Score each read path by freshness-value (who perceives a faster answer, what a slower one costs them) and query rate. Paths with low perceived value and low/medium query rate — UI-throttled dashboards, once-a-day reports — don't justify always-on machinery; a scheduled batch over landed data would be indistinguishable and far cheaper. Compute cost-per-perceived-second-of-freshness; high cost on imperceptible paths is the over-engineering signal. The green SLO board won't show it because nothing is failing. Common wrong answer to avoid: "It's fine, uptime and latency are green." Green SLOs are consistent with 60% over-build; the waste is readiness nobody perceives, not a failure.
Q2. Is end-to-end exactly-once real or marketing? Defend a position you'd hold under scrutiny. A. Both, with a condition. Exactly-once delivery over a network with side effects is effectively impossible — the model call's GPU-second can't be rolled back. What's achievable is effectively-once: at-least-once delivery plus idempotent or transactional sinks. Frameworks advertise "exactly-once" meaning the end-to-end pipeline with a participating transactional sink; point the same job at a non-transactional sink and the guarantee evaporates. So design for idempotent writes (chapter 04) and reserve transactional exactly-once for sinks that can't dedupe. Common wrong answer to avoid: "Just enable exactly-once in Flink." It's only end-to-end if the sink participates and never makes the model call itself exactly-once.
Q3. The copilot's freshness budget is 10 s and it's the most expensive path. How do you know 10 s is right? A. You usually don't — it's a defensible guess, rarely A/B-measured. The honest move: A/B the copilot at 10 s vs 30 s freshness and watch CSAT. A customer mid-conversation has turn-taking slack, so 30-second-fresh retrieval (a cheaper micro-batch, chapter 04) may be imperceptible. If CSAT doesn't move, you've been paying for 10 s on faith. The most expensive freshness budgets are the ones most worth measuring. Common wrong answer to avoid: "10 s is the standard for copilots." There's no law; the perceptible threshold is empirical and product-specific — measure it.
Q4. When is lambda still the right choice in 2026, despite kappa being the default? A. Two cases. When the stream processor can't reprocess history correctly because there's no durable replayable log or lakehouse reaching the needed history — the batch layer is load-bearing. And when a regulator or audit requires an independent batch recomputation as a cross-check on the streaming path — the duplication is the control, not an accident. For greenfield with a durable log and lakehouse, kappa wins; for those two cases, lambda is correct, not legacy guilt. Common wrong answer to avoid: "Lambda is dead, always migrate to kappa." True for greenfield, false for an inherited lambda serving a real reprocessing-authority or audit need.
Q5. Why can a streaming platform cost more than the batch job it replaced while doing less work? A. Because cost scales with readiness maintained, not work performed. The batch job ran 40 minutes and stopped; the streaming platform runs 1,440 minutes a day to be ready — always-on consumers, warm GPU, continuous compaction. Incremental updates may do less total work than a nightly full recompute, yet the bill is higher because you pay for readiness by the second. The trade is worth it only when a reader is waiting often enough to use the readiness. Common wrong answer to avoid: "Streaming costs more because it processes more data." It often processes less; the cost is idle readiness, not work volume.
Q6. (Cumulative) A reviewer says "this should have been a 30-minute batch job." Walk the full-stack argument for whether they're right. A. Locate the consumers. If a human waits on retrieval mid-interaction (the copilot), a 30-minute gap is a wrong answer with a confident tone (chapter 01) — streaming is justified, and chapters 02–05 earn their cost. If the consumers are dashboards (chapter 06's budget says ~minutes), reports, and exports (zero perceived freshness), the reviewer is right: a scheduled batch over the landed lakehouse (chapter 06) serves them indistinguishably for a fraction of the idle-readiness cost. The answer is per-path: streaming for the perceived-freshness paths, batch for the rest. A blanket "it's all real-time" or "it should all be batch" both miss the boundary. Common wrong answer to avoid: "It's real-time, so streaming is correct" — or its inverse. The honest answer scores each path by freshness-value × query-rate; most platforms are correctly streaming for one path and over-built for the others.
Design/debug exercise (10 min)¶
Step 1 — Modeled example. Over-engineering audit for one path:
Path: daily CSAT report
Who perceives: nobody (read at 6am, by analysts later)
Freshness value: zero — a 24h-old number is identical in usefulness
Query rate: 1/day
Current build: always-on streaming aggregation (~$2k/month readiness)
Verdict: OVER-ENGINEERED — readiness serving 1 query/day
Honest design: scheduled query over landed Iceberg table (ch.06), cents/day
A/B needed?: no — zero perceived freshness, no test required
Step 2 — Your turn. Run the audit on the supervisor live dashboard (refreshes on screen every ~60 s, watched intermittently). Decide: who perceives the freshness and how much (note the 60 s UI throttle), the query rate, whether per-event streaming is justified or a 1-minute micro-batch suffices, and what A/B test (if any) would settle it. Then do the same for the live copilot and state which of the two you'd actually leave on the streaming pipe.
Step 3 — Reproduce from memory. Redraw the section-2 over-engineering boundary (freshness-value vs query-rate, with the streaming-justified corner and the batch region), place all six platform paths on it, and write one sentence connecting the idle-readiness cost model to chapter 06's freshness budgets and one connecting "who perceives it?" to chapter 01's freshness gap.
Operational memory¶
This chapter explained why a reader who has all seven mechanisms can still over-build: the module taught each one as the right answer to its pressure without marking where the pressure is too weak to justify the machinery. The important idea is freshness-value-per-query — streaming earns its always-on cost only where a consumer perceives the freshness and queries often enough to use the readiness, and most data lives in the low-value region where a scheduled batch over landed data is correct, cheaper, and indistinguishable.
You learned to ask "who perceives the freshness?" before "how fresh can we make it?", to score each read path by freshness-value × query-rate against the over-engineering boundary, to hold both sides of the contested practices (exactly-once, lambda/kappa, inline/async, shared-space/caption) by naming the workload condition that flips each, and to reason about cost as idle readiness maintained rather than work performed. That turns the seven mechanisms from a default you reach for into a toolset you size to the product.
Carry this diagnostic forward: when the bill climbs while every SLO is green, compute cost-per-perceived-second-of-freshness and find the always-on paths serving consumers who can't perceive their output; when defending a freshness budget, ask whether it was A/B-measured or set on faith; when arguing a contested practice, name the workload condition, not the modern winner. Real-time is not better than batch — it is a different tradeoff, right only where a consumer is waiting.
Remember:
- Ask "who perceives the freshness?" before "how fresh can we make it?" — most data has no human waiting.
- Streaming earns its cost only in the high-freshness-value, high-query-rate corner; the rest should be scheduled batch over landed data.
- A streaming platform's cost is idle readiness maintained, not work performed — one query a day pays for a full day of readiness.
- The contested practices (exactly-once, lambda/kappa, inline/async, shared/caption) resolve by naming the workload condition, not by picking the modern side.
- Green uptime and latency SLOs are consistent with being 60% over-built; the over-engineering smell is cost-per-perceived-second-of-freshness, not a failure.
Bridge. This module built and then honestly bounded one end-to-end system: live multimodal interactions ingested, transformed, indexed, served fresh, costed, and governed. In a real design interview these pieces never appear alone — they combine. "Design a support copilot," "design real-time fraud detection over calls and chats," "design a multimodal search platform" all force you to assemble the replay log, in-stream models, fresh cross-modal retrieval, the freshness-vs-cost budget, and the governance story under time pressure, defending each boundary out loud. The next module puts you in that chair: end-to-end system design case studies where this module's streaming and multimodal mechanisms recombine with everything earlier in the curriculum. (Its entry file is
00-eli5.md; the case studies follow.) → ../13_interview_case_studies/00-eli5.md