02. The replay log and the Monday flood — ingestion when consumers fall behind¶
~22 min read. A three-week backlog of weekend problems hits your support line at 09:00 Monday. The ASR consumers can transcribe 200 calls a minute; 600 are arriving. Where do the other 400 go? Your answer is the whole chapter.
Built on the freshness gap and the replay log named in 00-first-principles.md, and the Monday-surge failure from 01-batch-vs-streaming-pressure.md. Chapter 01 proved work must be event-triggered. This file builds the thing the events flow through, and confronts backpressure — what happens when producers outrun consumers.
What chapter 01 settled, and what it left dangerous¶
Chapter 01 forced two conclusions. First, a copilot is only as fresh as the last write into its index, so work must start on event arrival, not on a clock. Second, the moment you go event-triggered, the Monday flood stops queuing behind a schedule — but it does not vanish. The events still arrive faster than ASR can process them. Chapter 01 waved at "consumer throughput we can provision" and moved on. That wave hid the hardest operational question in the whole module: when 600 calls arrive in a minute and you can process 200, the other 400 events have to go somewhere, and what you do with them decides whether you lose data, lose freshness, or lose the source.
That decision has a name — backpressure — and getting it wrong is how streaming platforms drop customer calls silently and discover it in a CSAT post-mortem three weeks later.
What this file solves¶
When producers outrun consumers, a streaming platform must either buffer the excess, slow the producers down, or drop events — and a system that picks the wrong one drops customer interactions without anyone noticing. This file shows how a durable, ordered, replayable log decouples bursty producers from slower consumers, how partitions buy parallelism while keeping per-customer order, and how to choose a backpressure strategy on purpose — with the Monday-morning call flood as the concrete stress test.
1) Why a queue is not enough — the need for a log¶
The first instinct for decoupling producers from consumers is a message queue: producer pushes, consumer pops, the queue absorbs the difference. That works until you hit two requirements a copilot platform actually has.
First, replay. You ship an embedding bug on Tuesday; every screenshot embedded since Monday is wrong. With a queue, those messages were popped and gone — you cannot reprocess them because the queue deleted each message on consumption. You would have to re-fetch from the source systems, if they even kept the data. Second, multiple independent consumers. The audio stream feeds the transcription pipeline and a separate quality-monitoring system and a compliance archiver. A queue hands each message to one consumer; fan-out means either duplicating the queue or coupling the consumers.
So the real need is not "a buffer between producer and consumer." It is a durable, ordered, replayable record of every event that many consumers can read independently, each tracking its own position. That is a log.
QUEUE (pop = gone) LOG (read = position moves, data stays)
producer ─▶ [m3 m2 m1] ─▶ consumer producer ─▶ append-only log
consume m1 → deleted [e1 e2 e3 e4 e5 e6 ...]
▲ ▲
ASR consumer offset=2 compliance offset=5
(each reads independently, data retained)
The log is an append-only file, partitioned for parallelism, retained for a configured window (hours to days, or tiered to object storage for longer). Consumers do not consume in the destructive sense; they advance an offset — a bookmark. Replay is "reset the offset and read again." Fan-out is "each consumer keeps its own offset." Kafka, Amazon Kinesis Data Streams, and Apache Pulsar are all log-structured systems built on exactly this primitive. (Kafka 4.x is now KRaft-only — ZooKeeper is gone — and added Share Groups for queue-like competing consumers when you don't need per-partition order; we will use the classic partitioned-log model here because per-customer order is exactly what the copilot needs.)
Why this rule exists. A queue couples "delivered" with "deleted," which destroys your ability to reprocess and to fan out. A log separates delivery position (the offset, per consumer) from retention (how long the data stays). That separation is what makes a stream both replayable and multi-consumer — the two things a multimodal copilot platform cannot live without.
2) Picture: the log as the platform's spine¶
This is the core mental model for the entire module. Memorize this one in ASCII.
PRODUCERS THE LOG (partitioned) CONSUMERS
(own offsets)
voice gateway ──┐ topic: interactions
chat widget ──┤── ingest (key=customer_id) ┌─ partition 0 ─ e e e e ───────┐ ┌─▶ ASR pipeline
upload service ─┘ ├─ partition 1 ─ e e e ─────────┤──fan──▶ ├─▶ image embedder
├─ partition 2 ─ e e e e e ─────┤ out ├─▶ text indexer
└─ partition 3 ─ e e ───────────┘ └─▶ compliance archive
▲ each tracks its own
same customer_id ───┘ always lands on the same partition offset; replay = reset
→ per-customer order preserved
BURST ▶ arrivals spike ──▶ log absorbs (retained, durable) ──▶ consumers drain at their own rate
the log is the shock absorber; freshness gap = how far behind the offsets fall
Three things this picture encodes. The key (customer_id) decides the partition, so all of one customer's events — call, chat, screenshot — land on the same partition in arrival order. Partitions are the unit of parallelism: four partitions means up to four ASR workers reading in parallel. And the offset gap — how far a consumer's offset trails the log head — is the freshness gap from chapter 01, now directly observable as a number called consumer lag.
3) The running example: three channels onto one log¶
Recall the platform: ~12k calls/day (audio), ~80k chat messages/day (text), ~9k screenshots/day (images). Put all three onto a log topic interactions, keyed by customer_id, so the copilot can read one customer's full cross-channel story in order.
14:32:01 produce {type:chat, customer:88213, body:"payment failed again"} → partition 1, offset 90412
14:32:03 produce {type:image, customer:88213, s3:"s3://raw/img/88213/..."} → partition 1, offset 90413
14:32:05 produce {type:audio, customer:88213, s3:"s3://raw/audio/88213/..."} → partition 1, offset 90414
All three on partition 1 (same customer), in order. The chat indexer reads offset 90412 and indexes it in ~2 seconds. The image embedder reads 90413. The ASR pipeline reads 90414 and starts transcribing. Each consumer advances independently. The copilot, querying at 14:32:10, can already see the chat; the transcript lands a couple of minutes later — exactly the modality freshness skew from chapter 01, now visible as different offsets advancing at different rates.
Notice what the key buys and costs. Buys: per-customer ordering, so the copilot never sees the screenshot before the chat that referenced it. Costs: if one customer (a power user, or a bot loop) floods a single partition, that partition becomes a hotspot — one partition's worth of throughput, no matter how many you have. We will hit that pathology in section 6.
4) Rule: the log absorbs bursts; backpressure decides what happens when it can't keep up downstream¶
The chapter's invariant: a durable log decouples producer rate from consumer rate, so a burst is absorbed as retained data and felt as growing consumer lag — until the lag exceeds what freshness or retention allows, at which point you must buffer more, slow producers, or drop. Backpressure is not a feature you turn on. It is the policy for what happens at that breaking point, and it is a choice you make on purpose or suffer by accident.
The flow of the pressure, end to end:
producers fast ─▶ log buffers (cheap, durable) ─▶ consumers slow ─▶ lag grows
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
BUFFER MORE SLOW PRODUCERS DROP
(raise retention, (reject / throttle (shed load:
scale storage) at the gateway) sample, TTL)
Teacher voice. The log buys you time, not infinity. People hear "Kafka decouples producers and consumers" and assume the problem is solved forever. It is not — it is deferred. The lag grows silently inside the log, and if you never look at the offset gap, the first signal you get is data falling off the retention window unprocessed: silent loss. The log turns a loud, immediate failure (queue full, requests rejected) into a quiet, delayed one. That trade is good only if you watch the lag.
5) The Monday flood through the three backpressure choices¶
Stress test: Monday 09:00. Three weeks of weekend problems. Calls arrive at 600/min. ASR consumers process 200/min. The log absorbs the difference at 400/min of growing backlog. Retention is 24 hours. What do you do?
Attempt A — buffer more (raise retention, hope it drains)¶
The log can hold the backlog: at 400/min surplus over a 2-hour surge, that is ~48k extra calls buffered. If retention is 24 h and the surge ends by 11:00, consumers drain the backlog over the afternoon. By 16:00 lag is back to seconds.
Helps: zero data loss, zero producer impact, no code change — just provision enough retention and storage.
Hurts: freshness collapses during the surge. A call at 09:30 sits in the log for hours before ASR reaches it. The copilot on the 09:30 call sees no transcript — the very failure chapter 01 was about — precisely when load is highest. Buffering trades freshness for completeness.
Use when: completeness matters more than freshness for that modality, and surges are bounded. Compliance archival: buffer freely, nobody needs it fresh.
Attempt B — slow the producers (throttle / reject at the gateway)¶
Reject or throttle inbound events when lag crosses a threshold. The voice gateway gets a 429 and must retry or queue client-side.
Helps: lag stays bounded; the data that is accepted stays fresh.
Hurts: you cannot throttle a customer who is on a live call. The audio is happening in real time; "please call back later" is not a backpressure option for voice. Slowing producers works for pull-based or retryable sources (a batch uploader, a webhook with retry), not for live human interaction.
Use when: the producer can tolerate being slowed — internal batch feeds, retryable webhooks. Almost never for live customer audio.
Attempt C — drop on purpose (load-shedding)¶
Decide explicitly what to drop and make it cheap to recover. For the surge: still ingest all raw audio to object storage (cheap, durable), but during the surge, ASR only processes calls flagged high-priority (escalations, VIP tier); the rest get transcribed later from the retained raw audio when the surge clears.
Helps: bounded lag for what matters, freshness preserved for priority traffic, no raw data lost (it is in object storage), backfill the rest off-peak.
Hurts: needs a priority signal and a backfill path; non-priority calls have a large freshness gap during the surge (transcribed at 14:00 instead of 09:30).
Use when: load is genuinely above sustainable capacity and not all events are equally urgent — which is the realistic Monday case.
The honest answer¶
The realistic production answer is C layered on A: raw audio always lands in object storage (never dropped — chapter 03), the log absorbs the burst (A), and ASR processing is prioritized during the surge (C) with off-peak backfill. Producer throttling (B) is reserved for the few retryable sources. The skill is naming, per modality, which lever you pull — not assuming the log "handles it."
Mini-FAQ. "Why not just auto-scale the ASR consumers to 600/min for the surge?" You can, and you should up to a point — but ASR runs on GPUs, scaling them takes minutes (cold start, model load), the surge may be over before they warm, and the steady-state cost of provisioning for peak is brutal when peak is 3× the average for two hours a week. Auto-scaling is part of the answer; it does not remove the need for a deliberate backpressure policy for the minutes before capacity arrives.
6) The property that changes the design: partition count and key choice¶
Two knobs dominate ingestion behavior: how many partitions, and what key you partition by. They trade ordering against parallelism, and getting the key wrong creates a hotspot no amount of scaling fixes.
KEY = customer_id (our choice) KEY = none / round-robin
+ per-customer order preserved + perfect load balance across partitions
+ copilot reads one story in order − NO ordering: chat may index after the
− a flooding customer = hot partition screenshot that referenced it
(one partition's max throughput) − copilot may retrieve out-of-order context
Ordering in a log is per-partition only — this is true for Kafka, Kinesis (per-shard), and Pulsar alike. There is no global order across partitions, and you do not want one: global order means one partition, which means one consumer, which means no parallelism. So you buy the minimum ordering you need — per-customer — and accept that two different customers' events have no defined relative order, which is fine because the copilot only ever reasons within one customer's story.
Partition count sets the ceiling on consumer parallelism: N partitions → at most N consumers in a group reading that topic in parallel. Under-partition and you cannot scale ASR consumers past the partition count when the Monday flood hits. Over-partition and you pay coordination overhead and tiny files downstream. Kafka 4.x clusters can run far more partitions per broker than the old ~4k rule of thumb, but the discipline is the same: partition for your peak consumer parallelism plus headroom, not your average.
The hotspot pathology: if customer 88213 is actually a misbehaving integration emitting 10k events/min, every one lands on partition 1 (same key), and partition 1's single-consumer throughput becomes the bottleneck while partitions 0, 2, 3 sit idle. The fix is not more partitions — it is a composite key (customer_id + session_id) or detecting and quarantining the abusive producer. Same shape, different layer: this is head-of-line blocking, the same geometry as a single slow request stalling a connection pool.
7) Cost and behavior table: three log-structured systems under this workload¶
Order-of-magnitude guidance for the running platform (~100k events/day, peaks 5×). Verify against current pricing for your region and provider.
| Dimension | Kafka (self-managed / MSK) | Amazon Kinesis Data Streams | Apache Pulsar |
|---|---|---|---|
| Ordering unit | per partition | per shard | per partition |
| Scaling knob | add partitions / brokers | shards, or on-demand auto-scale | partitions; brokers and storage (BookKeeper) scale separately |
| Replay | offset reset, retention-bound | shard iterator, up to 365-day retention | offset reset; tiered storage to object store for long retention |
| Multi-tenancy | topic ACLs, manual | account/stream level | built-in namespaces/tenants |
| Best fit here | high throughput, full control, ecosystem | low-ops on AWS, bursty load (on-demand) | many tenants, long retention, geo-replication |
| Watch out for | you operate it (KRaft simplifies, doesn't remove ops) | per-shard limits; on-demand cost under sustained peak | more components (brokers + BookKeeper) to reason about |
| Rough cost driver | broker instances + storage | shard-hours or throughput | broker + bookie nodes + storage |
The decision is rarely about raw throughput — all three handle 100k/day trivially. It is about who operates it and how long you must retain for replay. Need long replay windows for reprocessing embeddings? Pulsar's tiered storage and Kinesis's extended retention both push old data to cheap object storage. Want minimum ops on AWS with spiky load? Kinesis on-demand. Want maximum control and the broadest ecosystem (Flink, Spark connectors)? Kafka.
8) Operational signals: watching the offset gap¶
The single number that governs an ingestion layer is consumer lag — the offset gap between the log head and each consumer group's position, per partition, per consumer.
- Healthy: lag oscillates near zero (consumers keep up); end-to-end lag flat; no partition skew (all partitions drain evenly).
- First metric to degrade: consumer lag on the ASR group climbs during the Monday surge — minutes before any stale-answer complaint. Lag is the leading indicator; the lagging indicator is a bad copilot answer.
- Misleading metric people watch: producer throughput / ingest rate. High ingest looks healthy ("we're processing tons of events!") while consumers silently fall behind. Ingest rate measures the producers; lag measures the gap. Watch the gap.
- First graph an expert opens: consumer lag per partition, overlaid with arrival rate. They look for (a) lag diverging from zero, signaling under-provisioned consumers, and (b) one partition's lag far above the others, signaling a hot key / head-of-line block. Even drain = healthy; one hot line = key problem.
A second expert check: data approaching the retention edge unprocessed. If lag on a partition implies events will age out before consumers reach them, that is imminent silent loss — page someone.
9) Boundary: where the log helps unusually well, and where it hurts¶
- Strong fit: bursty, multi-consumer, replay-needed workloads — exactly the copilot platform. The log turns a flood into bounded lag and lets four consumers fan out from one stream.
- Pathological: a single producer hammering one key (hot partition), or using a log where a simple synchronous call would do. If one event has exactly one consumer and never needs replay, a log adds operational weight for no benefit — a direct call or a plain queue is simpler.
- Scale/workload limit that breaks intuition: the log's decoupling makes failure quiet. At low volume this is fine. At high volume with under-provisioned consumers, lag grows invisibly and data ages out of retention unprocessed — a silent-loss failure that looks healthy on every producer-side dashboard. The intuition "the log handles backpressure" is the trap: the log defers backpressure into lag, and lag is loss if you don't watch it and don't have a shed-load policy.
10) Wrong model to drop: "the log gives us infinite buffering"¶
The seductive idea is that putting events on a durable log means you never lose data and never have to think about consumer speed — the log "handles it." It feels true because the log does absorb bursts and does survive crashes. The correct model: the log converts an immediate, visible overload into a delayed, invisible one — growing lag that becomes silent data loss at the retention edge. Buffering is finite and time-bounded. You still must choose a backpressure policy (buffer more, slow producers, or shed load) and you must watch the offset gap, or the log's quietness hides the failure until a customer escalates.
11) Other ingestion failure shapes¶
- Hot partition — one key takes disproportionate traffic; that partition's single-consumer throughput caps the whole stream while others idle.
- Rebalance storm — consumers join/leave and the group reshuffles partition assignments repeatedly, pausing processing; KIP-848's redesigned protocol reduces this in Kafka 4.x but does not eliminate it under churn.
- Poison message — one malformed event a consumer cannot deserialize stalls the partition (head-of-line block) as it retries forever; needs a dead-letter path.
- Silent retention loss — lag exceeds retention; unprocessed events age out and vanish with no error.
- Duplicate delivery — at-least-once redelivery on consumer crash produces duplicate events; downstream must dedupe (chapter 04).
- Out-of-order across partitions — a customer's number changes mid-session and their events split across partitions, breaking per-customer order.
- Backfill stampede — replaying a large offset range to fix a bug floods the same consumers that serve live traffic, blowing live freshness; needs a separate replay consumer group.
12) Pattern transfer¶
- Backpressure as a recurring constraint — the producer-outruns-consumer geometry here is the same one a thread pool, a TCP window, or a Flink operator chain faces; the levers (buffer, slow source, drop) are universal, only the layer changes.
- Head-of-line blocking (chapter 04, and any pipeline) — the hot partition and the poison message are both head-of-line blocks: one slow/bad item stalls everything behind it on the same ordered channel.
- Replay = lineage root (chapter 07) — the log's retention is what makes lineage and reprocessing possible; "fix the embedding bug and replay" depends on the log still holding the raw events.
- Freshness gap (chapter 01) — consumer lag is the freshness gap made observable; chapter 01's "how stale" becomes this chapter's "how far behind the offset."
13) Design test¶
- Does every event have a clear partition key that gives the minimum ordering you need (usually per-entity, not global)?
- For each modality, have you named the backpressure policy — buffer, slow, or drop — before the surge, not during it?
- Is raw data durably stored before any lossy processing decision, so "drop" means drop processing, not drop data?
- Do you alarm on consumer lag and on data approaching the retention edge — not just on producer throughput?
- Is there a separate consumer group for replay/backfill so reprocessing doesn't starve live traffic?
Where this appears in production¶
- LinkedIn — originated Kafka to ingest activity and metrics streams; the canonical log-as-spine deployment fanning one stream to many consumers.
- Uber — runs one of the largest Kafka footprints, ingesting trip, location, and pricing events with per-entity partitioning for ordered processing.
- Netflix Keystone — ingests playback and interaction events at massive scale, fanning out to real-time and batch consumers from one durable log.
- Stripe — event ingestion where per-account ordering matters for correct ledger and fraud processing.
- Robinhood — Kafka-based market-data and order ingestion where per-symbol ordering and replay are non-negotiable.
- Datadog — ingests metric and trace streams; backpressure and shedding policies protect the platform during customer-side spikes.
- Amazon Kinesis (AWS service) — per-shard ordered ingestion with on-demand scaling, the low-ops choice for bursty AWS-native pipelines.
- Apache Pulsar at Yahoo / Tencent — multi-tenant log with tiered storage for long retention and geo-replication across data centers.
- Cloudflare — high-volume event ingestion with deliberate load-shedding under attack-scale spikes.
- DoorDash — order and courier event streams keyed for per-order ordering, feeding live dispatch consumers.
- Confluent Cloud customers — managed Kafka where consumer-lag dashboards are the primary operational signal.
- Shopify — ingests storefront and checkout events, keyed per shop, fanning to fraud, analytics, and fulfillment consumers.
- Slack — message and presence event ingestion where per-channel ordering keeps conversations coherent downstream.
- PayPal — transaction event streams with strict per-account ordering and replay for dispute reprocessing.
- Pinterest — ingests pin and engagement events, partitioned for parallel real-time ranking consumers.
Pause and recall¶
- Why is a log, not a queue, the right ingestion primitive for a copilot platform? Name the two requirements a queue fails.
- What does the offset represent, and how does it relate to the freshness gap?
- Ordering in a log is guaranteed at what granularity? Why don't you want global ordering?
- Name the three backpressure levers. Which one cannot be applied to a live customer call, and why?
- What is the realistic layered answer to the Monday flood?
- What is a hot partition, and why does adding partitions not fix it?
- Which metric is the leading indicator of trouble, and which producer-side metric misleads people?
- Why is "the log gives infinite buffering" a dangerous mental model?
Interview Q&A¶
Q1. Why use a partitioned log instead of a managed message queue (e.g., SQS) for ingesting support interactions? A. The copilot needs replay (reprocess after an embedding bug) and fan-out (ASR, quality monitoring, compliance all read the same stream), plus per-customer ordering. A queue couples delivery with deletion, so it can't replay, and hands each message to one consumer, so fan-out means duplication. A log separates per-consumer offset from retention, giving replay, independent fan-out, and per-partition order. Common wrong answer to avoid: "A log is just a more scalable queue." The defining difference is non-destructive reads (offsets) and retention, not scale.
Q2. Calls arrive at 600/min, ASR processes 200/min. Walk me through your backpressure decision. A. Raw audio always lands in object storage first, so no data is lost. The log absorbs the burst as bounded lag. During the surge, ASR prioritizes high-value calls (escalations, VIP) and backfills the rest off-peak from retained raw audio. Producer throttling isn't an option for live calls. So: shed processing by priority, never shed data, and backfill from durable storage. Common wrong answer to avoid: "Auto-scale ASR to 600/min." GPU scaling takes minutes, the surge may end first, and provisioning for 3×-peak steady-state is wasteful; auto-scaling helps but doesn't replace a deliberate shed policy.
Q3. Why partition by customer_id and not round-robin for maximum balance? A. The copilot reasons within one customer's cross-channel story, so it needs the chat, screenshot, and call to arrive in order — which requires same-key-same-partition. Round-robin gives perfect balance but zero ordering, so the copilot could retrieve a screenshot before the chat that referenced it. You buy the minimum ordering (per-customer) and accept hot-partition risk for power users. Common wrong answer to avoid: "Round-robin, because balance is always better." Balance without ordering corrupts the copilot's view of a single conversation.
Q4. Your producer-throughput dashboard is green but customers complain the copilot is stale. Where do you look? A. Consumer lag per partition. Producer throughput measures intake, not the gap; consumers can fall behind while ingest looks healthy. Check whether lag is climbing uniformly (under-provisioned consumers) or on one partition (hot key / head-of-line block), and whether any partition's lag implies data will age out of retention unprocessed. Common wrong answer to avoid: "Ingest is fine so the pipeline is fine." Healthy ingest with growing lag is exactly the silent-loss setup.
Q5. One partition's lag is 100× the others. What's happening and how do you fix it? A. A hot key — one customer_id (often a misbehaving integration) floods one partition, and that partition's single-consumer throughput caps the stream while others idle. Adding partitions doesn't help because the key still maps to one partition. Fix with a composite key (customer_id+session_id) to spread the load, or detect and quarantine the abusive producer. Common wrong answer to avoid: "Add more partitions." Same key → same partition; more partitions leaves the hotspot exactly where it was.
Q6. (Cumulative) The copilot missed a chat from 20 minutes ago during Monday's surge. Is this a chapter-01 freshness problem, a chapter-02 backpressure problem, or a chapter-05 indexing problem? A. Locate the lag. If the chat is in the log but the indexer's offset trails it by 20 minutes → backpressure: consumers fell behind the surge (chapter 02). If the chat was processed but never written to the vector index → indexing (chapter 05). If it never reached the log → ingestion. Chapter 01 gave the frame (measure end-to-end lag per modality); chapter 02 says: check consumer lag per partition first, because surges hit there first. Common wrong answer to avoid: "It's the index." During a surge the most likely culprit is consumer lag, not a write failure; check the offset gap before the index.
Design/debug exercise (10 min)¶
Step 1 — Modeled example. Backpressure plan for the audio modality:
Modality: audio (calls)
Raw landing: ALL audio → object storage immediately (never dropped)
Log: keyed by customer_id, retention 24 h
Normal load: ASR consumers keep up, lag ~seconds, transcript fresh ~90 s
Surge policy: if ASR lag > 5 min → process priority calls (VIP/escalation) live,
flag the rest for off-peak backfill from retained raw audio
Alarm: page if any partition's lag implies aging out before processing
Step 2 — Your turn. Write the equivalent backpressure plan for the chat modality (80k/day, very bursty). Decide: what is raw landing, what retention, can you slow this producer, and what's your surge policy? (Hint: chat is cheap to process, so the lever is different from audio.)
Step 3 — Reproduce from memory. Redraw the core mental-model diagram from section 2 — producers → keyed ingest → partitioned log → fan-out consumers with independent offsets — and label where the freshness gap shows up (the offset gap) and where a hot partition would appear. Connect it to chapter 01 by explaining how consumer lag is the freshness gap made observable.
Operational memory¶
This chapter explained what happens to events when a Monday flood makes producers outrun consumers, and how a durable, replayable log turns that flood into bounded, observable lag instead of immediate failure or silent loss. The important idea is that the log decouples producer rate from consumer rate and separates per-consumer offset from retention — it does not give infinite buffering; it defers overload into lag that you must watch.
You learned to put all modalities onto one keyed log, partition by customer_id to buy per-customer ordering at the cost of hot-partition risk, and choose a backpressure policy per modality before the surge — buffer, slow the producer, or shed processing while never dropping raw data. That solves the Monday-flood failure because raw audio lands durably, the log absorbs the burst, and ASR prioritizes what matters with off-peak backfill from retained data.
Carry this diagnostic forward: when the copilot is stale during load, look at consumer lag per partition, not producer throughput. Uniform lag means under-provisioned consumers; one hot line means a key problem; lag near the retention edge means imminent silent loss. The log is a shock absorber with a finite travel distance — instrument the offset gap or the failure stays invisible until a customer escalates.
Remember:
- A log beats a queue because reads are non-destructive (offsets) and retained — enabling replay and multi-consumer fan-out.
- Ordering is per-partition only; partition by the minimum entity you need ordered (per-customer), never globally.
- Backpressure has three levers — buffer, slow producers, drop; you can't slow a live call, so shed processing by priority and never drop raw data.
- Consumer lag is the freshness gap made observable; watch it and the retention edge, not producer throughput.
- The log defers overload into lag; lag is loss at the retention edge — "infinite buffering" is the trap.
Bridge. We can now ingest a multimodal flood into a durable, ordered log and survive the surge. But the log is a transit layer, not a home — it retains for hours, not forever, and it holds events, not the 6-minute audio file or the 4 MB screenshot those events point to. Before we can transcribe or embed anything, we need a place to put the raw audio, video, and images cheaply and durably, alongside the transcripts and embeddings we derive from them. The next file builds that storage layer and confronts where raw bytes live versus where queryable artifacts live. → 03-multimodal-unstructured-storage.md