03. Where raw bytes live and where queryable artifacts live — storage for a multimodal stream¶
~22 min read. Your log holds an event that says s3://raw/audio/88213/call-1432.wav. It does not hold the 6-minute, 11 MB call. Where does that file go, what does it cost to keep a year of them, and how does the copilot ever find the transcript you derive from it?
Built on the replay log and modality cost asymmetry named in 00-first-principles.md, and the log-is-a-transit-layer point from 02-ingestion-and-backpressure.md. Chapter 02 ended with a log that retains events for hours, not forever, and holds pointers, not payloads. This file builds the home for the bytes and introduces the derived artifact as a first-class citizen.
What chapter 02 settled, and what it left homeless¶
Chapter 02 gave us a durable, ordered log that absorbs the Monday flood and lets ASR, image, text, and compliance consumers fan out from one stream. That log made one quiet promise it cannot keep: that raw audio "always lands in object storage immediately, never dropped." We leaned on that promise as the safety net under load-shedding — drop processing, never data. But we never built the net. The log retains events for 24 hours and holds a few hundred bytes per event: a type, a customer id, a timestamp, and a path. The 11 MB call recording, the 4 MB screenshot, the eventual transcript and the 1024-dimension embedding all have to live somewhere the log is not.
That somewhere has to satisfy three pressures at once. It must be cheap enough to keep a year of raw audio without the bill dominating the platform. It must be durable enough that the replay path from chapter 02 still finds the original bytes when you reprocess a Tuesday embedding bug. And the derived artifacts — transcripts, captions, embeddings — must be query-shaped, because those are what the copilot retrieves, not the raw WAV. Raw and derived have opposite access patterns, and putting them in the same store is the first mistake.
What this file solves¶
A streaming platform that lands raw audio, images, and derived transcripts and embeddings in one store ends up either paying S3-Standard rates to archive cold call recordings nobody plays, or trying to run similarity search over a blob store that cannot do it. This file shows how to split raw bytes (cheap, immutable, tiered object storage) from derived artifacts (query-shaped tables and indexes), how a lakehouse table format like Iceberg or Delta gives the derived metadata transactions and schema evolution on top of the same object store, and how lifecycle tiering keeps a year of raw multimodal data from becoming the line item that kills the project.
1) Why not just keep everything in the database — the need for object storage¶
The instinct after building a log is to write everything into the warehouse or a database the consumers already use. For the chat text, fine — it is small and structured. For the call recording, this breaks immediately on physics. A relational or analytical database is built to index and scan rows; it charges you for storage at premium rates because that storage sits on fast disks behind a query engine. An 11 MB blob gets none of that engine's value — you never run WHERE audio LIKE ... — and you pay the premium anyway. At 12,000 calls/day averaging 11 MB, that is ~132 GB/day, ~48 TB/year of raw audio alone, before images and video.
So the real need is not "a place to put files." It is a store with three properties the database cannot give cheaply: flat cost per byte regardless of access engine, eleven-nines durability, and immutability (write once, never update — a recording does not change). That store is object storage: S3, GCS, Azure Blob. You address an object by key, you never edit it in place, and you pay roughly \(0.023/GB-month for hot access dropping to ~\)0.00099/GB-month in Deep Archive. The database stays for the small, queryable derived data; the bytes go to the object store.
WRONG: everything in one store RIGHT: split by access pattern
┌──────────────────────────┐ ┌─────────────────────┐ raw, immutable,
│ warehouse / vector DB │ │ OBJECT STORE │ write-once
│ chat rows ✓ │ │ audio.wav 4MB×N │ $0.001–0.023/GB
│ 11MB audio ✗ (premium) │ │ screenshot.png │ tiered by age
│ 4MB images ✗ │ ──split──▶ │ video.mp4 │
│ embeddings ✓ │ └─────────┬───────────┘
└──────────────────────────┘ │ derived from raw
pays query-engine rates ┌────────▼───────────┐ small, queryable
to store bytes nobody scans │ LAKEHOUSE + VECTOR │ transcripts, captions,
│ transcripts table │ embeddings, metadata
│ embeddings index │
└────────────────────┘
Why this rule exists. Storage cost scales with bytes; query value scales with structure. Raw multimodal payloads are huge and structureless to a query engine — you only ever fetch them by key or scan them with a model. Derived artifacts are small and richly queryable. Mixing them forces one of two losses: you pay query-engine storage rates for inert bytes, or you cripple query on the structured data by burying it under blobs. Split by access pattern and each store does the one thing it is priced for.
2) The core picture: raw lake, derived lakehouse, hot index¶
Memorize this layout. It is the spine of every section after this one.
THE LOG (chapter 02)
e{type, customer, s3_path, ts}
│
┌──────────┴───────────┐
▼ (raw bytes, on ingest) ▼ (derived, after transform — chapter 04)
┌───────────────┐ ┌──────────────────────────────────────────────┐
│ RAW ZONE │ │ DERIVED ZONE │
│ object store │ │ │
│ │ │ ┌────────────────┐ ┌────────────────────┐ │
│ audio/ ▒▒▒▒▒ │ asr→ │ │ LAKEHOUSE tables│ │ VECTOR INDEX │ │
│ images/ ▒▒▒ │ embed→ │ │ (Iceberg/Delta) │ │ (Milvus/Qdrant) │ │
│ video/ ▒▒▒▒▒▒ │ │ │ transcripts │ │ text vectors │ │
│ │ │ │ captions │ │ image vectors │ │
│ TIERED BY AGE: │ │ │ metadata │ │ hot, in RAM-ish │ │
│ hot→IA→archive │ │ │ on object store │ │ (chapter 05) │ │
└───────────────┘ │ └────────────────┘ └────────────────────┘ │
cheap, immutable, │ ACID, schema evolution, low-latency ANN │
replay source │ time-travel, batch+stream cross-modal query │
└──────────────────────────────────────────────┘
cold ◀───────────────── access frequency ─────────────────▶ hot
Three zones, three jobs. The raw zone is a flat, immutable, tiered pile of bytes — the replay source of truth and the thing PII deletion eventually has to reach (chapter 07). The lakehouse holds derived structured artifacts (transcripts, captions, per-row metadata) as ACID tables sitting on the same object store but with a transaction layer that gives schema evolution and time-travel. The vector index holds the embeddings, hot, tuned for low-latency similarity search (chapter 05). Cost-per-byte rises left to right; query value rises with it. The art is putting each artifact in the cheapest zone that can still serve its access pattern.
3) The running example: one call through the storage layers¶
Customer 88213's 14:32 call, traced byte by byte.
14:32:05 voice gateway writes raw bytes:
PUT s3://raw/audio/88213/2026-06-03/call-1432.wav (11 MB, immutable)
produces log event {type:audio, customer:88213, s3:".../call-1432.wav"}
14:34:20 ASR consumer reads the event, fetches the WAV by key, transcribes:
INSERT into lakehouse table `transcripts`
(customer=88213, call_id=..., text="I've tried this payment three times...",
raw_s3=".../call-1432.wav", model="whisper-lg-v3", lang="en", ts=14:32:05)
14:34:21 embedder reads the transcript text, produces a 1024-d vector:
UPSERT into vector index (id=transcript:88213:..., vec=[...], modality=text,
raw_s3=".../call-1432.wav")
Notice three artifacts now point back to one raw object: the transcript row, the embedding, and the log event all carry raw_s3=.../call-1432.wav. That back-pointer is the lineage root the governance chapter (07) depends on — every derived thing knows the raw byte it came from, so you can re-derive after a model upgrade, or delete the whole chain when the customer invokes erasure. The raw WAV will be played by a human maybe twice in its life (a dispute, a QA audit); the transcript and embedding are queried constantly. That access gap is exactly why they live in different zones with different cost curves.
The screenshot follows the same path: raw PNG to s3://raw/images/, a derived caption and a multimodal embedding into the derived zone, both back-pointing to the PNG. The chat message is the easy case — small enough that its "raw" form is its derived form, so it lands directly in the lakehouse and gets embedded; there is no separate raw zone for a 200-byte message.
4) Rule: store raw once, immutable, and tiered; derive often, query-shaped, and rebuildable¶
The chapter's invariant: raw payloads are written once and never edited, kept in the cheapest tier their access frequency allows; everything queryable is a derived artifact that points back to its raw source and can be rebuilt by replay. Two consequences fall out of this. First, you never destroy raw on a processing decision — chapter 02's "drop processing, never data" depends on raw being the durable floor. Second, derived artifacts are disposable: if the embedding model changes, you do not panic about lost vectors, you replay the raw through the new model. Raw is precious and cheap; derived is valuable and rebuildable.
Teacher voice. Treat raw as the negatives and derived as the prints. You keep the negatives in a cool, cheap drawer and almost never touch them; you make as many prints as you like and throw them away when the process improves. A platform that cannot re-derive — that treats embeddings as irreplaceable because it deleted or never kept the raw — has lost its negatives. Every "we can't fix the old embeddings" incident traces to a missing raw zone or a missing back-pointer.
5) The lakehouse: why a table format and not just folders of files¶
The raw zone is happy as plain keyed objects. The derived zone is not, and here is where a naive design breaks. The first instinct is to write transcripts as Parquet files directly into s3://derived/transcripts/ and point a query engine at the folder. It works for a day. Then the problems arrive, all from the same root: a folder of files has no transaction boundary and no schema authority.
Attempt A — bare Parquet files in a folder¶
A streaming writer appends a new Parquet file every minute. Helps: dead simple, cheap, any engine reads Parquet.
It breaks on four concrete cases, each from the running example:
- Partial reads. The copilot's backing query runs
SELECT text FROM transcripts WHERE customer=88213while the streaming writer is mid-flush. It reads a half-written file and either errors or returns a torn row. No atomic commit. - Schema drift. Month two, you add a
sentimentcolumn to transcripts. Old files don't have it; new files do. Queries break or silently null-fill, and there is no record of when the schema changed. - Small-file explosion. A per-minute streaming append creates 1,440 tiny files/day per partition. Query planning slows to a crawl reading thousands of file footers — the streaming small-file problem.
- No deletes. Customer 88213 invokes erasure. You must delete their transcripts. In a bare folder you rewrite every file that mentions them and hope no query reads mid-rewrite.
Attempt B — a lakehouse table format (Iceberg / Delta / Hudi)¶
A table format is a metadata layer over those same Parquet files on the same object store. It adds an atomic manifest: a commit either makes a new snapshot visible or it doesn't, so the copilot never reads a torn write. It tracks schema as versioned metadata, so adding sentiment is a logged, backward-compatible evolution. It supports compaction (merge small files into big ones) and row-level deletes (for erasure). And it gives time-travel: query the table as of last Tuesday to audit what the copilot would have retrieved before the embedding fix.
BARE FOLDER (Attempt A) TABLE FORMAT (Attempt B)
s3://derived/transcripts/ s3://derived/transcripts/
part-0001.parquet ← torn? data/part-*.parquet (same files)
part-0002.parquet metadata/
... 1,440 tiny files/day snapshot-N.json ← atomic commit pointer
schema history, manifests, delete files
no atomic commit atomic snapshots, schema evolution,
no schema authority compaction, row-level deletes, time-travel
no deletes, no time-travel same object store underneath
So the real problem with bare folders is not the file format; Parquet is fine. It is the absence of a transaction and schema layer over the files. The table format adds exactly that layer and nothing the object store already provides — it is metadata, not a new storage system.
Mini-FAQ. "Iceberg, Delta, or Hudi — which one?" For our derived transcripts and captions, the dominant operations are streaming appends, occasional schema additions, broad multi-engine reads (Flink writes, the copilot's query engine reads, a nightly Spark job rebuilds embeddings), and GDPR row-deletes. Iceberg has become the 2026 default for exactly this profile: vendor-neutral, partition evolution, and the broadest engine support (Spark, Flink, Trino, Snowflake, BigQuery). Hudi wins when the workload is high-frequency upserts/CDC — its merge-on-read writes only changed columns to delta logs, minimizing write amplification — which is more about mutating dimension tables than appending transcripts. Delta is the natural pick inside a Databricks/Spark Structured Streaming shop. None of them is the raw zone; raw stays as plain objects.
6) The property that changes the design: tiering by access frequency¶
The single largest storage lever is which tier each byte lives in over its lifetime. Raw audio is hot for about a day (the live and immediately-following interactions), warm for a few weeks (dispute window, QA), then cold for years (compliance, training-set rebuilds). Paying S3-Standard for all of it is the most common storage-cost blowup on these platforms.
The tiers and their 2026 S3 economics:
| Tier | Storage $/GB-mo | Retrieval | Good for, on our platform |
|---|---|---|---|
| S3 Standard | ~$0.023 | free, instant | raw of the last ~24–48 h; all derived tables/indexes |
| Standard-IA | ~$0.0125 | per-GB fee, instant | raw audio/images 2–30 days old (dispute window) |
| Glacier Instant Retrieval | ~$0.004 | per-GB fee, ms | raw 30–180 days, occasional QA/audit pulls |
| Glacier Deep Archive | ~$0.00099 | hours to restore, high per-GB | raw older than 180 days; compliance only |
| Intelligent-Tiering | \(0.023→\)0.00099 auto | no retrieval fee | raw with unpredictable access; +$0.0025/1k objects/mo monitoring |
The 23× spread between Standard and Deep Archive is the whole game. A lifecycle policy that moves raw audio Standard → IA at 2 days → Glacier IR at 30 days → Deep Archive at 180 days turns a flat \(0.023/GB bill into a blended rate an order of magnitude lower for the bulk of the data. The catch — and this is the boundary that bites people — is the **retrieval cost asymmetry**: Deep Archive stores for ~\)1/TB-month but retrieving in bulk costs orders of magnitude more and takes hours. Tier raw you will rarely read; never tier data you query, like the derived embeddings.
Teacher voice. The retrieval bill is the silent killer, not the storage bill. Glacier looks like "$1/TB to store" and reads like a steal — until a legal hold forces you to restore 40 TB of call audio and the retrieval + request charges dwarf a year of storage. Tier by read probability, and put anything with an unpredictable read pattern in Intelligent-Tiering, which charges no retrieval fee and auto-demotes idle objects. The decision is not "how old," it is "how likely will I read this, and can I tolerate hours of latency when I do."
7) Cost worked through: a year of the running platform¶
Concrete numbers for the running platform, raw zone only, with and without tiering. Order-of-magnitude; verify current pricing.
Daily raw volume:
audio: 12,000 calls × 11 MB ≈ 132 GB/day
images: 9,000 shots × 0.8 MB ≈ 7 GB/day
video: (assume 5% of calls have screen-share, 30 MB) ≈ 18 GB/day
total raw ≈ 157 GB/day ≈ 4.7 TB/month ≈ 57 TB/year
| Strategy | Effective $/GB-mo (blended) | ~Annual raw storage cost | Notes |
|---|---|---|---|
| All S3 Standard | $0.023 | ~$15,700/yr (avg 28.5 TB resident) | simplest, most expensive; cold audio at hot rates |
| Lifecycle: Std→IA→GIR→Deep Archive | ~$0.004 blended | ~$2,700/yr | bulk ages into archive; retrieval fees on the rare read |
| Intelligent-Tiering | ~$0.005 blended + monitoring | ~$3,400/yr | no retrieval fee; ~$0.0025/1k objects/mo monitoring adds up at object count |
(Resident TB grows through the year as data accumulates; figures use a mid-year average and ignore request costs, which matter more for small objects — the 9k images/day generate a lot of PUT requests.) The lesson is not the exact dollar figure; it is the 5–6× swing from a one-page lifecycle policy. Derived storage is a rounding error by comparison: a year of 1024-d float32 embeddings for ~7.6M artifacts is roughly 31 GB — the embeddings cost cents; the raw audio costs thousands. Modality cost asymmetry shows up in storage too: audio dominates the bytes and the bill.
8) Operational signals: watching storage health¶
- Healthy: raw-zone bytes grow linearly with traffic; >90% of bytes sit in IA or colder within a week; derived tables compacted (file count stable, not growing per-minute); retrieval requests rare and small.
- First metric to degrade: small-file count per partition in the derived lakehouse. A streaming writer appending per micro-batch grows file count linearly; query planning latency climbs before storage cost does. Compaction lag is the leading indicator of a slow copilot query.
- Misleading metric people watch: total bytes stored. It rises smoothly and looks like the cost story, but the bill is driven as much by which tier and by retrieval + request charges as by raw byte count. A flat byte-growth chart can hide a tripling bill from a bad lifecycle policy or a retrieval storm.
- First graph an expert opens: cost broken down by storage class and by request type (storage vs retrieval vs PUT/GET requests), overlaid with file count in the derived tables. They look for raw bytes stuck in Standard (missing lifecycle), a retrieval spike (someone scanning archive), or exploding small files (missing compaction).
9) Boundary: where this split helps, and where it is overhead¶
- Strong fit: large, immutable, rarely-read raw payloads (audio, video, images) alongside small, query-hot derived artifacts — exactly the multimodal copilot. The split saves an order of magnitude on storage and keeps query fast.
- Pathological: applying the full raw-zone-plus-lakehouse-plus-vector-index machinery to a text-only, low-volume feed. If the only modality is 200-byte chat messages, there is no "raw" worth separating; a single table plus an embedding column is simpler, and the three-zone architecture is ceremony.
- Scale/workload limit that breaks intuition: at small scale the lakehouse table format's metadata and compaction overhead can cost more attention than bare Parquet would, and the storage savings are noise. The split earns its keep only when raw bytes are large and numerous enough that tiering moves real money and small-file counts threaten query latency — i.e., real multimodal volume, not a prototype.
10) Wrong model to drop: "store it all in the vector database"¶
The seductive idea, especially after reading about multimodal vector search, is to push raw images and audio straight into the vector database "since it handles multimodal." It feels efficient — one system, one query. The correct model: the vector database is the hot derived index, not the raw store. It is priced and tuned to keep vectors and an ANN graph in fast memory for low-latency similarity search; loading 11 MB raw payloads into it blows its memory budget, wrecks its cost model, and gives you none of the tiering that makes raw cheap. Raw bytes go to object storage; embeddings of those bytes go to the vector DB; the vector DB row carries a raw_s3 pointer back. One system per access pattern.
11) Other storage failure shapes¶
- Orphaned raw — raw bytes written but the producing log event lost before any derived artifact points to them; the WAV exists but nothing references it, so it is invisible and un-deletable by lineage.
- Dangling pointer — a derived row's
raw_s3points to an object that lifecycle archived to Deep Archive; "fetch the original audio" now takes hours, surprising a live audit. - Small-file storm — per-micro-batch streaming appends to the lakehouse without compaction; query planning degrades as footers multiply.
- Schema drift without a format — bare-Parquet folders where new columns silently null-fill old files; queries return wrong aggregates with no error.
- Tier-and-retrieve thrash — data aged to archive then repeatedly pulled back for reprocessing; retrieval fees exceed what hot storage would have cost.
- Mutable-raw temptation — someone "fixes" a raw recording in place; replay now reproduces the edited version, breaking the lineage assumption that raw is immutable.
- Cross-region egress surprise — raw in one region, compute in another; per-GB egress on 157 GB/day dwarfs storage.
12) Pattern transfer¶
- Cold vs hot tiering = caching at the storage layer — Standard/IA/Glacier is the same locality pressure as L1/L2/RAM/disk: keep what you read often close and expensive, push what you read rarely far and cheap. The retrieval-cost asymmetry is the storage analogue of a cache miss penalty.
- Immutability + log = log-structured storage — raw-write-once plus the chapter-02 replay log is one idea: append-only data is replayable and reasoning-friendly; the same invariant that makes the log work makes the raw zone work.
- Derived-is-rebuildable = materialized view — transcripts and embeddings are materializations over raw; like any materialized view, they can be dropped and recomputed from the source, which is why an embedding bug (chapter 04) is recoverable.
- Schema evolution (chapter 07) — the table format's versioned schema is the mechanism chapter 07 leans on for safe schema evolution across a live stream; introduced here, stressed there.
13) Design test¶
- Is raw written once, immutable, and never edited in place — so replay reproduces the original bytes?
- Does every derived artifact carry a back-pointer to its raw source, so you can re-derive or delete the whole chain?
- Are derived, query-hot artifacts in a table format / vector index, and raw inert bytes in plain tiered object storage — not mixed?
- Is there a lifecycle policy moving raw to cheaper tiers by read probability, and have you priced the retrieval cost of the rare read?
- Is the lakehouse compacting streaming small files, so query planning latency stays flat as data grows?
Where this appears in production¶
Raw + lakehouse storage layers: - Netflix — origin and one of the largest Apache Iceberg users; petabytes of event and media-derived data as Iceberg tables on S3 for multi-engine access. - Apple — runs Iceberg at very large scale for streaming and batch analytics on shared object storage. - Databricks customers — Delta Lake as the derived-table format for Spark Structured Streaming pipelines, with ACID commits over object storage. - Uber — created Apache Hudi specifically for high-frequency upsert/CDC ingestion into the lake, minimizing write amplification. - Snowflake / BigQuery external tables — query Iceberg tables in place on object storage without copying, the multi-engine read pattern. - Spotify — large lakehouse on object storage feeding both batch ML and near-real-time features.
Object storage and tiering: - Amazon S3 + Intelligent-Tiering — auto-demotes idle objects across tiers with no retrieval fee; the default for unpredictable raw read patterns. - S3 Glacier Deep Archive — ~$1/TB-month cold raw audio and compliance archives where hours-to-restore is acceptable. - Google Cloud Storage / Azure Blob tiers — equivalent hot/cool/archive class spreads for the same raw-tiering pattern. - Pinterest — stores raw images in object storage and derived visual embeddings separately for visual search. - Dropbox — raw file bytes in object storage, derived search/preview artifacts in queryable stores, the same raw/derived split. - Zoom / call-center platforms — raw call recordings to object storage, transcripts and embeddings derived for search and copilots. - Datadog / observability vendors — raw telemetry to cheap object tiers, derived indexes hot, tiering by query recency. - Tabular / Onehouse — managed Iceberg/Hudi services that automate compaction and tiering for streaming lakehouse ingestion. - Slack — message and file storage split: small messages query-shaped, uploaded files in object storage with lifecycle. - Twitter/X media — raw media in object storage, derived thumbnails and embeddings in hot query layers.
Pause and recall¶
- Why does an 11 MB call recording not belong in the vector database or the warehouse?
- State the raw-vs-derived invariant in one sentence. Which is precious-and-cheap, and which is valuable-and-rebuildable?
- Name two concrete failures of writing transcripts as bare Parquet files in a folder. What single missing layer causes both?
- What does a lakehouse table format add on top of object storage, and what does it not replace?
- What is the 23× number, and what one-page artifact captures most of that saving?
- Why is the retrieval bill more dangerous than the storage bill for archived raw?
- Why are a year of embeddings cheap while a year of raw audio is expensive — what is this an instance of?
- What is the
raw_s3back-pointer for, in two later chapters?
Interview Q&A¶
Q1. Why split raw bytes from derived artifacts into different stores instead of one system? A. They have opposite access patterns. Raw payloads are huge, immutable, and read by key very rarely; derived artifacts are small, query-hot, and read constantly by the copilot. One store optimized for both does neither well — you either pay query-engine storage rates for inert blobs or cripple query latency burying structured data under raw. Split by access pattern: object storage (tiered) for raw, lakehouse + vector index for derived. Common wrong answer to avoid: "Put everything in the vector DB since it's multimodal." The vector DB is the hot derived index; loading raw payloads wrecks its memory and cost model.
Q2. A teammate writes transcripts as Parquet files directly to an S3 folder and points Trino at it. What breaks? A. No transaction layer means the copilot can read a half-written file (torn read); no schema authority means a later column addition silently null-fills old files; per-micro-batch appends create a small-file storm that slows query planning; and there is no clean row-level delete for GDPR erasure. A table format (Iceberg/Delta) adds atomic snapshots, versioned schema, compaction, and deletes over the same Parquet files. Common wrong answer to avoid: "Parquet handles schema, so it's fine." Parquet stores a schema per file; it has no cross-file transaction or schema-evolution authority.
Q3. How do you keep a year of raw call audio from dominating the bill? A. Lifecycle tiering by read probability: Standard for the last day or two, Standard-IA for the dispute window, Glacier Instant Retrieval for months, Deep Archive past ~180 days — or Intelligent-Tiering when reads are unpredictable, since it has no retrieval fee. The Standard-to-Deep-Archive spread is ~23×, so a one-page policy cuts the raw bill 5–6×. Never tier the derived embeddings — they're queried constantly. Common wrong answer to avoid: "Glacier Deep Archive for everything, it's $1/TB." Bulk retrieval from Deep Archive costs orders of magnitude more and takes hours; a legal hold restore can dwarf a year of storage.
Q4. Why are derived artifacts safe to lose but raw is not?
A. Derived artifacts (transcripts, embeddings) are materializations of raw and can be rebuilt by replaying raw through the current models — so an embedding-model upgrade is a re-derive, not a data loss. Raw is the only irreplaceable copy; the original bytes cannot be reconstructed. That asymmetry is why raw is written once and immutably and why every derived row carries a raw_s3 back-pointer.
Common wrong answer to avoid: "Back up the embeddings carefully, they're expensive to compute." Compute cost isn't the point; embeddings are reproducible from raw, raw is not — protect raw.
Q5. Iceberg, Delta, or Hudi for the derived transcripts table — how do you choose? A. The workload is streaming appends, occasional schema additions, broad multi-engine reads, and GDPR deletes. Iceberg fits that best in 2026: vendor-neutral, partition evolution, broadest engine support. Hudi wins for high-frequency upsert/CDC (merge-on-read, low write amplification), which is more about mutating tables than appending transcripts. Delta is the natural choice in a Databricks/Spark shop. Common wrong answer to avoid: "Whichever, they're interchangeable." Their write models differ: Hudi is built for upserts, Iceberg for broad-engine analytics, Delta for the Spark ecosystem.
Q6. (Cumulative) The copilot's backing query got slow this week, but storage cost is flat and consumer lag is normal. Where do you look? A. Not chapter 02 (lag is fine) and not raw storage (cost flat). Look at the derived lakehouse small-file count: a streaming writer appending per micro-batch without compaction multiplies file footers, so query planning slows even though bytes and lag look healthy. Run/tune compaction. The leading signal is file count per partition, not bytes or lag. Common wrong answer to avoid: "Scale the query engine." More compute masks a small-file problem briefly; the fix is compaction in the table format.
Design/debug exercise (10 min)¶
Step 1 — Modeled example. Storage plan for the audio modality:
Modality: audio (calls)
Raw: PUT s3://raw/audio/<customer>/<date>/<call>.wav immutable, write-once
Lifecycle: Standard (0–2d) → Std-IA (2–30d) → Glacier IR (30–180d) → Deep Archive (>180d)
Derived: transcript → Iceberg table `transcripts` (raw_s3 back-pointer, model, lang, ts)
embedding → vector index (id, vec, modality=text, raw_s3)
Rebuild: embedding-model upgrade → replay raw through new model, re-derive (raw untouched)
Delete: erasure → row-delete transcript + vector by customer; tombstone raw object
Step 2 — Your turn. Write the equivalent storage plan for screenshots (9k/day, 0.8 MB). Decide: is there a separate raw zone, what lifecycle, what derived artifacts (caption? embedding? both?), and what does the back-pointer point to? Then estimate the annual raw storage cost under a lifecycle policy vs all-Standard.
Step 3 — Reproduce from memory. Redraw the three-zone diagram from section 2 (raw zone → lakehouse → vector index, cold→hot), label what lives in each and its cost order, and write one sentence connecting the raw zone's immutability to chapter 02's replay log and to chapter 07's deletion requirement.
Operational memory¶
This chapter explained where the actual bytes go once the log holds only pointers, and why putting raw payloads and queryable artifacts in one store is the first cost-or-latency mistake. The important idea is the raw-vs-derived split: raw payloads are written once, immutable, and tiered cheaply by read probability, while derived artifacts are small, query-shaped, rebuildable, and each points back to its raw source — not "one big multimodal store."
You learned to land raw audio/images/video in plain tiered object storage, put derived transcripts and captions in a lakehouse table format (Iceberg by default in 2026) for atomic commits, schema evolution, compaction, and deletes, and keep embeddings hot in a vector index — with a raw_s3 back-pointer threading them together. That solves chapter 02's homeless-bytes problem and keeps the storage bill an order of magnitude lower via lifecycle tiering, because raw audio dominates the bytes and aging it into archive moves real money.
Carry this diagnostic forward: when the copilot query slows, check derived small-file count and compaction before scaling compute; when the bill jumps, check storage class and retrieval charges before raw byte count. Raw is the negatives, derived are the prints — never lose the negatives, freely reprint.
Remember:
- Split by access pattern: tiered object storage for raw, lakehouse + vector index for derived.
- Raw is write-once, immutable, the replay source of truth; derived is rebuildable and carries a
raw_s3back-pointer. - A table format adds atomic commits, schema evolution, compaction, and deletes over Parquet on the same object store — metadata, not a new system.
- Tier raw by read probability; the Standard→Deep-Archive spread is ~23×, but retrieval fees are the silent killer — Intelligent-Tiering has none.
- Audio dominates the byte count and the bill — modality cost asymmetry shows up in storage as well as in freshness.
Bridge. We now have a durable home for raw bytes and a query-shaped home for derived artifacts. But the derived zone is empty until something fills it — and filling it is the expensive, slow, error-prone step the whole platform is built around. Transcribing audio, captioning images, embedding everything: these are model calls running inside the stream, and a model call is nothing like a SQL transform. It is slow, costly, occasionally wrong, and it forces the question chapter 02 deferred — when a record might be processed twice on retry, do we pay for the model call twice, and does the index get a duplicate? The next file runs ASR, vision, and embedding models in the stream and confronts exactly-once versus at-least-once when the unit of work is an expensive model call. → 04-streaming-transforms-and-embeddings.md