00. Streaming Multimodal Data Platform — First-Principles Overview¶
A support copilot is only as smart as the freshest interaction it can see. This module builds the pipe that turns live calls, chats, and screenshots into something a model can retrieve within seconds — and pays the bill for that freshness honestly.
A customer calls your support line at 14:32. She is angry: the same payment failed three times, she already chatted with a bot an hour ago, and she uploaded a screenshot of the error. The copilot that picks up the call needs all of that — the chat transcript, the screenshot's contents, the words she is speaking right now — to avoid asking her to repeat herself. If the platform feeding that copilot runs on a nightly batch job, the chat from an hour ago is invisible until tomorrow morning, the screenshot is an unindexed blob in a bucket, and the live audio is not text yet. The copilot answers as if it knows nothing. She escalates. The CSAT score drops, and nobody can explain why the "AI" feels dumber than a human who simply scrolled up.
That gap — between when data arrives and when a model can use it — is the single pressure this module fights. Call it the freshness gap. For a dashboard, an 18-hour freshness gap is fine. For a copilot that must ground its answer in what happened ninety seconds ago, an 18-hour gap is a wrong answer with a confident tone. The whole architecture of a streaming multimodal platform is a set of mechanisms that shrink the freshness gap without letting the compute bill, the storage bill, or the operational burden explode.
What makes this harder than ordinary stream processing is the word multimodal. The interactions are not tidy rows. They are call audio that must be transcribed before it means anything, chat text that is already searchable, and screenshots that must be captioned or embedded before a retriever can find them. Each modality has a different unit cost, a different latency to "usable," and a different failure mode. Transcribing one hour of audio with a self-hosted model costs cents; doing it with a streaming API costs about $0.0077 per minute. A screenshot embedding is one model call. A chat message is nearly free. The platform must accept all three from the same unbounded stream, route each to the right processor, and land them in indexes a copilot can query — fast enough that the freshness gap stays under the human's patience threshold.
The naive instinct is to bolt streaming onto the batch warehouse you already have. That instinct fails in a specific, repeatable way, and the first chapter walks you through the felt failure. From there the module builds the pipe one pressure at a time: how to ingest an unbounded stream without dropping data when consumers fall behind, where to store raw audio and video versus their derived artifacts, how to run ASR and embedding models inside the stream, how to keep vector and metadata indexes fresh without rebuilding them, how to decide how fresh is fresh enough, and how to govern PII and lineage across modalities that hide secrets in different places.
The recurring pressures and concepts¶
| Pressure / concept | Meaning |
|---|---|
| Freshness gap | The delay between an event arriving and a model being able to retrieve it. The dominant pressure of the whole module. |
| Backpressure | What a system does when producers outrun consumers: buffer, slow the source, or drop. The honest answer separates real platforms from demos. |
| The replay log | A durable, ordered, replayable ingestion log (Kafka/Kinesis/Pulsar) that decouples bursty producers from slower processors and lets you reprocess history. |
| Modality cost asymmetry | Each modality costs a different amount and takes a different time to become usable: audio needs ASR, images need captioning/embedding, text is nearly free. |
| Derived artifact | Anything computed from raw input — a transcript, a caption, an embedding. The raw stays cheap and cold; the derived artifact is what gets queried. |
| Exactly-once vs at-least-once | The guarantee a stream processor offers on each record. Cheaper guarantees mean duplicates the index must tolerate. |
| Incremental indexing | Updating a vector/metadata index as events arrive instead of rebuilding it nightly. The mechanism that actually closes the freshness gap at the serving layer. |
| Lambda vs Kappa | Two ways to reconcile fast-but-approximate and slow-but-correct paths. The choice that decides how much duplicated logic you maintain. |
| Lineage across modalities | Tracing a retrieved chunk back through its embedding, its transcript, and its raw audio — so you can audit, delete, or explain it. |
These nine names recur in every chapter. When a later file says the replay log absorbs the burst or this is modality cost asymmetry biting again, it is pulling one of these forward under new pressure.
Top resources¶
- Kafka 4.3 documentation — https://kafka.apache.org/documentation/ (KRaft-only, Share Groups, KIP-848 rebalance)
- Apache Flink — Stateful Stream Processing — https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/stateful-stream-processing/
- Apache Iceberg spec — https://iceberg.apache.org/spec/
- Designing Data-Intensive Applications (Kleppmann), ch. 11 Stream Processing — the canonical text on logs, ordering, and exactly-once
- Amazon Kinesis Data Streams developer guide — https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
- Voyage / Cohere multimodal embedding docs — current cross-modal embedding model behavior
- Milvus / Qdrant streaming-ingest docs — real-time upsert and index-freshness mechanics
What's coming¶
- 01-batch-vs-streaming-pressure.md — why the nightly warehouse produces stale, confidently-wrong copilot answers, and what "fresh enough" must mean in numbers.
- 02-ingestion-and-backpressure.md — the replay log: partitions, ordering, and what happens when consumers fall behind a Monday-morning call surge.
- 03-multimodal-unstructured-storage.md — object store plus lakehouse: where raw audio/video/images live, where derived artifacts live, and how tiering keeps the bill sane.
- 04-streaming-transforms-and-embeddings.md — running ASR, vision, and embedding models in the stream; windowing; exactly-once vs at-least-once when a model call is the expensive step.
- 05-multimodal-indexing-and-retrieval.md — vector + metadata (+ optional graph) indexes, per-modality embeddings, cross-modal retrieval, and incremental updates that keep the index fresh.
- 06-freshness-vs-cost-lambda-kappa.md — lambda vs kappa, real-time vs offline paths, and the cost of always-on processing for data nobody queries.
- 07-governance-lineage-and-quality.md — schema evolution, lineage, streaming data quality, and PII that hides differently in audio, images, and text.
- 08-boundary-tradeoff-review.md — open problems, contested practices, and where streaming is plain over-engineering.
Memory map¶
| Concept | Prerequisite | Pressure family | Recurs later as | Layer touched |
|---|---|---|---|---|
| Freshness gap | none (module anchor) | latency / freshness | the number every chapter is optimizing | user → API → pipeline |
| Replay log | freshness gap | backpressure / durability | the source of truth for replay (06), the lineage root (07) | producer → broker → consumer |
| Multimodal storage | replay log | cost / bandwidth | tiering vs query latency (06), deletion for PII (07) | pipeline → object store → lakehouse |
| Streaming transforms | replay log + storage | compute / ambiguity | exactly-once cost (06), quality on streams (07) | consumer → model → sink |
| Multimodal indexing | transforms + embeddings | latency / freshness | incremental update vs rebuild (06) | sink → vector DB → retriever |
| Lambda vs Kappa | indexing + transforms | cost / operator attention | the over-engineering boundary (08) | whole pipeline |
| Lineage & governance | all of the above | safety / data quality | the audit you wish you had built earlier | every layer |
Three traversal paths use this map. Prerequisite path — read 00 through 08 in order; each chapter assumes the last. Failure path — when the copilot cites stale or wrong data, walk back: is the freshness gap in ingestion (02), the transform (04), or the index update (05)? Synthesis path — pick two rows from different pressure families and ask how they compose: storage tiering (03) plus PII deletion (07) means cold audio in Glacier still has to be findable and erasable when a user invokes their right to be forgotten.
Bridge. Before we build the pipe, we have to feel why the warehouse you already own is the wrong tool. The next file drops you into a copilot answering from 18-hour-old data, traces exactly which decision made it stale, and turns "fresh enough" from a slogan into a latency budget you can defend. → 01-batch-vs-streaming-pressure.md