Skip to content

07. Tracing and erasing what flows through the pipe — lineage, quality, and PII across modalities

~25 min read. A user emails: "Delete everything you have on me." You open a ticket and freeze. Her name is a row in a transcripts table, a vector in the index, an embedding in a 30-day-old segment, and — the part that stops you cold — a number she read aloud on a call, now living inside an audio-derived transcript, and her face in a screenshot's image embedding. There is no DELETE FROM customer that reaches all of those. You built five fresh, cheap, cross-modal layers and never built the one query that finds everything about one person. This file is that query, and the lineage and quality machinery that makes it possible.

Built on the derived artifact, lineage across modalities, modality cost asymmetry, and the replay log named in 00-first-principles.md. Chapters 02–06 moved data fast and cheap through every layer; this file makes that flow traceable, validatable, and erasable — and confronts PII that hides differently in audio, images, and text.


What the whole pipe settled, and the question it never answered

By chapter 06 the platform is complete as a delivery system: the log absorbs bursts (02), storage tiers cheaply (03), models transform in-stream without duplicates (04), the index serves fresh cross-modal retrieval (05), and freshness spends only where it's worth it (06). Every chapter optimized how data moves. None of them answered how you account for what moved.

Three accounting questions were deferred at every step. Lineage: when the copilot grounds an answer on a retrieved chunk, can you trace that chunk back through its embedding, its transcript, and the raw audio it came from — to audit, debug, or explain it? Quality: the schema of incoming events will change under you (a new field, a renamed type), and a malformed or low-confidence transcript will flow downstream silently — can you catch it on the stream before it corrupts the index? PII and deletion: the data carries names, card numbers, and faces, and a user can demand erasure — can you find and delete every trace of one person across text, audio-derived text, and image embeddings?

These are not features you bolt on. They are properties the whole pipe either has or lacks, and they are far cheaper to design in than to retrofit after a regulator or an incident forces the issue.


What this file solves

A streaming multimodal pipeline that moves data fast still has to trace each retrieved artifact back to its raw source, validate streaming data as its schema evolves, and find-and-erase one person's data across modalities where PII hides in different forms — and none of this works if it's retrofitted. This file shows how to carry lineage metadata on every derived artifact so a chunk traces back to raw, how to enforce a schema contract and quality checks on the stream so bad data is quarantined not propagated, and how to handle PII and right-to-erasure across text, audio transcripts, and image embeddings where a card number spoken aloud and a face in a screenshot need different detection and the same deletion guarantee.


1) Why the back-pointer alone isn't lineage — the need to trace the full chain

Chapter 05 had every retrieved chunk carry a raw_s3 back-pointer — the result links to the original audio or image bytes. That feels like lineage. It is one link, not the chain.

When the copilot grounds an answer on txn:88213:90414 and a supervisor asks "why did it say that?", the back-pointer gets you to the raw call.wav. But it does not tell you: which ASR model version produced the transcript, whether that transcript was PII-masked, which embedding model and dimension turned it into a vector, which pipeline run did the work, or whether the transcript was later corrected. So when the answer is wrong, you cannot tell if it was a bad transcription, a stale embedding model (chapter 04's version skew), a masking step that dropped the relevant words, or a retrieval miss. One link gets you to bytes; it doesn't get you to what happened to those bytes.

So the real need is not "link to the source." It is trace the full transformation chain — raw → transcript → masked transcript → embedding → indexed vector → retrieved chunk — with the model version and pipeline run at each hop, so any wrong answer can be attributed to the exact step that caused it, and any artifact can be re-derived or deleted with its whole dependency tree.

BACK-POINTER ONLY              FULL LINEAGE CHAIN
 chunk ─▶ raw_s3                chunk ─▶ vector(model=embed-4, dim=1024, run=r88)
 (one link)                            ─▶ masked_transcript(pii_run=p12)
 "where are the bytes?"                ─▶ transcript(asr=whisper-v3, run=r88)
                                       ─▶ raw_s3 call.wav (immutable, ch.03)
 can't attribute a wrong answer        each hop: model version + pipeline run
 to a step                             → attribute, re-derive, or delete the whole tree

Why this rule exists. A derived artifact is the output of a chain of fallible, versioned transformations. The back-pointer captures the first input; lineage captures every transformation between input and output. Without the chain, a wrong or deletable artifact is a dead end — you can see what it is but not how it got that way or what else depends on it. Lineage is the chain that turns "this chunk exists" into "this chunk came from these inputs via these versioned steps," which is what auditing, debugging, re-deriving, and erasing all require.


2) The core picture: the lineage graph and the quality gate on the stream

   THE STREAM                  QUALITY GATE              TRANSFORM (ch.04)            LINEAGE-TAGGED SINK
   raw events                  schema + checks                                        every artifact carries
                                                                                       its full provenance

  event ─▶ ┌─ schema contract ─┐    valid    ┌─ ASR / embed / mask ─┐    upsert   ┌──────────────────────┐
           │ shape? required?   │──────────▶ │  + tag each output    │──────────▶ │ vector + metadata:    │
           │ confidence ≥ τ?    │            │    with run + version │            │  raw_s3, asr_ver,     │
           └────────┬───────────┘            └───────────────────────┘            │  embed_ver, pii_run,  │
                    │ invalid / low-conf                                           │  customer_id, run_id  │
                    ▼                                                              └──────────┬───────────┘
              DEAD-LETTER / QUARANTINE                                                        │
              (don't propagate bad data)                                       LINEAGE GRAPH (OpenLineage / catalog)
                                                                               raw → transcript → masked → vector
                                                                               answers: trace, audit, DELETE-by-person

Two mechanisms in one picture. The quality gate sits at the front: an event that violates the schema contract or whose model confidence is below threshold is routed to a dead-letter/quarantine path instead of propagating into the index — bad data is stopped, not stored. The lineage tagging sits at the sink: every derived artifact is written with its full provenance (raw pointer, every model version, the PII-masking run, the customer id, the pipeline run id), and those tags feed a lineage graph (OpenLineage, a catalog) that can answer the three accounting questions — trace this chunk, audit this run, delete everything for this person.


3) The running example: tracing and erasing 88213 across modalities

Recall 88213's three artifacts. With lineage tags, each carries its chain:

id=chat:88213:90412   raw_s3=.../chat       embed_ver=embed-4  pii_run=p12  customer=88213  run=r88
id=img:88213:90413    raw_s3=.../shot.png   embed_ver=mm-3.5   pii_run=p12  customer=88213  run=r88
id=txn:88213:90414    raw_s3=.../call.wav   asr_ver=whisper-v3 embed_ver=embed-4 pii_run=p12 customer=88213 run=r88

Trace (debug a wrong answer). The copilot mis-stated the card status. Walk txn:90414's chain: ASR whisper-v3 transcribed "card declined," pii_run=p12 masked the card number, embed-4 vectorized it. Pull the masked transcript — it reads "card ending [REDACTED] declined." The masking removed the last four digits the copilot needed to confirm the card. Not a retrieval bug, not an embedding bug: a masking step that dropped a field the answer depended on. Lineage attributed the failure to one hop.

Erase (right-to-erasure). The user demands deletion. The lineage graph answers "what is everything for customer=88213": three vectors in the index (delete by metadata filter — chapter 05's filter is now a deletion key), three derived rows in the lakehouse tables, and three raw objects in the object store (chapter 03). One person, three modalities, one query over customer_id, executed across index + lakehouse + object store. Crucially, the audio trace is not just the raw .wav — it is the spoken card number now sitting inside the transcript text, which only pii_run=p12 knew to mask. Without the lineage tag recording that masking happened, you would delete the wav and the row and miss that the number also lived in a derived embedding.

The asymmetry the example exposes: PII hides in a different place per modality. In the chat, PII is in the text directly. In the audio, PII is in the transcript (spoken aloud), invisible in the raw waveform to a text scanner. In the screenshot, PII is in the pixels (a face, a card on screen), reachable only by OCR + vision, not by any text scan. Modality cost asymmetry returns as detection asymmetry: each modality needs a different detector, and erasure must hit all the derived forms, not just the raw.


4) Rule: every artifact carries its provenance, and every transform validates before it propagates

The chapter's invariant: a derived artifact is never written without its full lineage (source pointer, every model version, masking run, owner id, pipeline run), and a stream record never propagates downstream until it passes the schema contract and quality threshold — bad data is quarantined, not indexed. Provenance makes every artifact traceable, re-derivable, and erasable; the quality gate makes the index trustworthy by keeping malformed and low-confidence data out.

These two are linked. Provenance is what lets you delete (find everything for a person) and re-derive (replay through a corrected step). The quality gate is what stops the index from filling with garbage that no amount of lineage can fix after the fact. You enforce both on the stream, at write time, because retrofitting lineage onto artifacts that were written without it is impossible — you cannot reconstruct which model version produced a vector after the fact.

WRITE-TIME GUARANTEE                       WHAT IT BUYS LATER
 tag artifact with full provenance          trace a wrong answer to a hop (debug)
                                            find everything for a person (erase)
                                            replay through a fixed step (re-derive)
 validate before propagate (schema + conf)  index stays trustworthy
                                            bad data quarantined, not double-counted
 enforce ON THE STREAM, at write            retrofit is impossible — version is lost

Teacher voice. The two questions a regulator and an on-call engineer ask are the same question pointed in opposite directions: the engineer asks "where did this artifact come from?" and the regulator asks "where did this person's data go?" Both are answered by the same lineage graph, traversed forward or backward. If you only built the back-pointer, you can answer neither fully — you know the raw source but not the chain of transformations, the model versions, or the other artifacts derived from the same person. Build the chain at write time, because the model version that produced a vector cannot be recovered once the vector is written without it.


5) Schema evolution: schema-on-read tolerance vs schema-on-write contract

The incoming events will change shape — a new field, a renamed type, a producer that ships a malformed payload. How the pipe absorbs that change without breaking downstream is the schema-evolution decision.

Attempt A — schema-on-read, tolerate anything

Accept whatever arrives; let each consumer parse what it needs and ignore the rest.

Helps: producers move fast; no central gatekeeper; a new field flows through without breaking old consumers.

Hurts: a malformed payload (missing required field, wrong type) flows all the way into the transform before failing, or worse, succeeds and writes a garbage artifact. The break surfaces deep in the pipe — a poison record (chapter 04) stalls a partition, or a null where a customer id should be writes an un-erasable orphan. You learn about the schema change from a downstream incident, not at the edge.

Attempt B — schema-on-write contract with compatible evolution

Enforce a schema at ingest (a registry: Confluent Schema Registry, or the table format's schema). Producers must register changes; the registry enforces compatibility rules — adding an optional field is backward-compatible and allowed; removing a required field or changing a type is breaking and rejected at the edge.

Helps: breaking changes are caught at the producer, not in a downstream incident; consumers get a guaranteed shape; Iceberg/Delta evolve the table schema by metadata (add/rename/reorder columns) without rewriting data, so derived tables absorb compatible changes cheaply.

Hurts: a registry is a coordination point producers must respect; an over-strict contract slows legitimate evolution; you must define compatibility policy (backward, forward, full).

So the real choice is where the schema break surfaces — at the edge (write contract) or deep in the pipe (read tolerance). For a pipeline whose output (the index) must stay trustworthy and whose artifacts must stay erasable, surfacing breaks at the edge is worth the coordination cost: a malformed event that reaches the index is far more expensive than one rejected at ingest. Use schema-on-write with compatible evolution; reserve schema-on-read tolerance for genuinely exploratory raw zones.

Mini-FAQ. "Iceberg supports schema evolution — doesn't that solve it?" It solves the storage half: Iceberg/Delta evolve the table schema (add, drop, rename, reorder columns) by metadata without rewriting files, so a compatible change is cheap and old snapshots stay readable. It does not validate the incoming stream — a malformed event still needs a registry/contract at ingest to be rejected before it's written. Table-format schema evolution + an ingest schema contract are complementary: one keeps stored tables evolvable, the other keeps bad events out.


6) The property that changes the design: PII detection is per-modality, deletion is uniform

The dimension that reshapes governance is that PII detection differs by modality but the deletion guarantee must not. Each modality hides PII in a different form and needs a different detector; but a deletion request must reach every form with the same certainty.

MODALITY    WHERE PII HIDES              DETECTOR NEEDED              DELETION REACHES
 text/chat   in the text directly         NER / pattern match          row + vector by customer_id
 audio       in the TRANSCRIPT (spoken)   ASR → then NER on transcript raw wav + transcript + embedding
 image       in PIXELS (face, card)       OCR + vision / NER on OCR    raw png + image embedding + caption

Detection asymmetry: a text scanner finds the card number in the chat instantly, is blind to the same number spoken in the audio until ASR transcribes it, and is blind to a card photographed in a screenshot until OCR reads the pixels. Redaction must happen after the modality is converted to a detectable form — mask the transcript after ASR, redact the image after OCR — and the masking must be recorded in lineage (pii_run) so deletion knows it happened.

Deletion uniformity: regardless of modality, "delete everything for customer=88213" must hit the raw object, every derived artifact, and every index vector — found by the customer_id that lineage stamped on each. Iceberg/Delta row-level deletes (delete files / deletion vectors) erase the lakehouse rows by predicate; the vector index deletes by metadata filter (chapter 05); the object store deletes the raw. The lineage graph is what guarantees the deletion query reaches all three across all modalities.

The pressure evolution: per-modality detection relieves the risk of un-masked PII (each form gets its right detector) but creates detection cost and latency — every audio must be transcribed and every image OCR'd before PII can even be found, and the audio's detection inherits ASR's minutes-long lag (chapter 01). The detection step absorbs the cost; the deletion guarantee stays uniform.

Teacher voice. The deletion failure that gets companies fined is not forgetting to delete the obvious copy — it's forgetting the derived copy in a modality you didn't think to scan. A card number the customer typed is easy. The same number she read aloud lives in a transcript and an embedding; her face lives in an image vector. If your deletion only hits raw storage and the text table, you've left PII in the audio-derived and image-derived artifacts. The rule: delete by the owner id that lineage stamped on every artifact regardless of modality, and verify the index and lakehouse and object store all came back empty for that id.


7) Cost and behavior table: governance choices under this workload

Order-of-magnitude for the running platform. Verify against your stack and jurisdiction.

Choice What it buys What it costs When to use
No lineage, back-pointer only link to raw bytes can't attribute wrong answers, can't fully erase never for regulated/PII data
Lineage tags on every artifact (+ OpenLineage/catalog) trace, audit, erase-by-person, re-derive metadata write per artifact, a lineage store default — provenance is cheap, retrofit is impossible
Schema-on-read tolerance producer speed breaks surface deep, garbage in index exploratory raw zones only
Schema-on-write contract + compatible evolution breaks caught at edge, index trustworthy a registry, compatibility policy default for the indexed/served path
Per-modality PII detect + redact, recorded in lineage catch PII in text, audio, image a detector per modality, ASR/OCR latency before detection mandatory for PII across modalities
Delete raw + text only feels done leaves PII in audio/image-derived artifacts the fine-generating trap

The honest defaults: lineage tags on every artifact (the write is cheap; the retrofit is impossible), a schema-on-write contract for the served path, per-modality detection that redacts after conversion and records the masking run, and deletion that reaches every modality's derived forms by the lineage-stamped owner id. The last row is the trap that looks done and isn't.

Concrete: PII detection adds cost per item — a text NER pass is cheap (cents per thousand), but audio requires ASR first (~$0.046/call from chapter 04) and images require OCR + vision before detection can even run. Over 12k calls + uploads/day, that detection cost is real but small against the fine for missing a derived copy. The lineage metadata is a few hundred bytes per artifact — ~7.6M artifacts/year × ~300 B ≈ 2 GB/year, negligible against the auditability it buys.


8) Operational signals: watching governance and quality

  • Healthy: dead-letter/quarantine rate low and stable (a small steady trickle of genuinely malformed events); schema-registry rejections near zero except when a producer ships a real breaking change; PII-detector coverage = 100% of artifacts have a pii_run tag; deletion requests complete across all three stores within SLA with a verification pass returning empty.
  • First metric to degrade: quarantine/dead-letter rate. A producer ships a schema change that's subtly incompatible, or a model's confidence drops, and the quarantine rate climbs — the leading indicator that bad data would be entering the index if the gate weren't catching it. Rising quarantine means a schema or model problem upstream, before any downstream answer degrades.
  • Misleading metric people watch: index size or throughput. They look healthy while a deletion silently failed to reach the audio-derived embedding, or while artifacts are being written without a pii_run tag — the data is flowing, the counts are fine, and PII is quietly un-erasable. Throughput counts artifacts, not whether each is governed.
  • First graph an expert opens: lineage-tag coverage and deletion-verification results over time, overlaid with quarantine rate. They look for artifacts missing provenance tags (un-erasable orphans being created), deletions that didn't return empty across all three stores (incomplete erasure), and quarantine spikes (schema/quality regression upstream).

9) Boundary: where this governance design fits, and where it's overkill

  • Strong fit: a multimodal pipeline carrying PII that must be auditable and erasable — exactly this platform under GDPR/CCPA. Lineage on every artifact, a schema contract on the served path, and per-modality detection earn their keep because the cost of an un-erasable derived copy is a fine and a breach.
  • Pathological: full lineage tagging, per-column quality checks, and per-modality PII detection on a throwaway internal stream with no PII and no audit requirement. The metadata write, the registry coordination, and the detection latency are pure overhead when nothing downstream needs to trace, validate, or erase.
  • Scale/workload limit that breaks intuition: the intuition "we'll add governance later when we need it" breaks because lineage cannot be retrofitted — once a vector is written without its model version and owner id, that provenance is gone, and reconstructing which of 7.6M artifacts belongs to one person without a stamped customer_id is intractable. Governance is the one part of the pipe that is dramatically cheaper to build in at write time than to add after an incident forces it.

10) Wrong model to drop: "deleting the raw data deletes the person"

The seductive idea is that erasing the raw audio, image, and chat satisfies a deletion request — the source is gone. It feels complete. The correct model: PII propagates into derived artifacts that outlive the raw, and deletion must reach every derived form, per modality. A spoken card number lives in the transcript and its embedding; a face lives in the image embedding and a caption; deleting the raw .wav leaves the number in a vector the copilot can still retrieve. Deletion is a graph traversal over the lineage tree by owner id, hitting raw + every derived row + every index vector across text, audio, and image — verified empty in all three stores. Deleting only the raw is the trap that looks done, passes a casual check, and leaves retrievable PII behind.


11) Other governance and quality failure shapes

  • Un-erasable orphan — an artifact written without a customer_id/lineage tag; deletion-by-person can't find it; PII stranded.
  • Derived-copy miss — deletion hits raw + text but not the audio-derived transcript embedding or the image embedding; PII survives in a modality you forgot.
  • Silent quality propagation — no quality gate; a low-confidence transcript or null-field event flows into the index and the copilot grounds on garbage.
  • Schema break deep in pipe — schema-on-read tolerance; a malformed payload poisons a partition (chapter 04) or writes a typed-wrong row, surfacing as a downstream incident.
  • Lineage drift on re-derive — a model upgrade (chapter 04 skew) re-derives artifacts but doesn't update lineage tags; the chain now lies about which version produced what.
  • Masking-after-index — PII detected and masked in the transcript but the embedding was already computed from the unmasked text and indexed; the vector still encodes the PII.
  • Tombstone-not-purged — index marks a deleted vector as a tombstone but compaction lag (chapter 05) means it's still searchable; "deleted" data returns.
  • Cross-customer leak in results — retrieval without the customer filter (chapter 05) surfaces another person's PII; a governance failure at the read layer.

12) Pattern transfer

  • Lineage = the freshness gap's audit twin — chapters 01–06 traced how fast data moves; lineage traces where it moved and what touched it. Same pipe, the other axis: latency vs provenance, both properties of the same flow.
  • Schema contract = idempotent write's cousin (chapter 04) — a write contract at ingest enforces a shape the way an idempotent key enforces uniqueness; both are write-time guarantees that keep the downstream store trustworthy instead of cleaning up after the fact.
  • Delete-by-customer-id = the metadata filter inverted (chapter 05) — the same customer_id filter that scopes retrieval for correctness is the key that erases a person's data; the filter that keeps reads correct is the filter that makes deletion complete.
  • Per-modality detection = modality cost asymmetry again (chapter 04) — PII detection inherits the same per-modality cost and latency as the transforms: text is cheap and instant, audio needs ASR first, images need OCR; detection asymmetry mirrors transform asymmetry.

13) Design test

  1. Does every derived artifact carry full provenance (raw pointer, all model versions, masking run, owner id, pipeline run), written at the sink — not reconstructed later?
  2. Does a stream record pass a schema contract and quality threshold before it propagates, with malformed/low-confidence data quarantined to a dead-letter path?
  3. Is PII detected per modality (text directly, audio after ASR, image after OCR) and the masking recorded in lineage so deletion knows it happened?
  4. Does "delete everything for this person" reach raw + every derived row + every index vector across all modalities, verified empty in all three stores?
  5. Is the schema break designed to surface at the edge (write contract) rather than deep in the pipe, and can compatible evolution proceed without rewriting data (Iceberg/Delta)?

Where this appears in production

Lineage, schema, and quality on streams: - OpenLineage — open standard for end-to-end lineage; captures dataset- and column-level metadata and quality metrics (row count, null count, distinct count) from Spark/Iceberg scans and commits. - Apache Iceberg / Delta / Hudi — snapshot history, schema-evolution tracking, and catalog audit logs provide lineage; row-level deletes / deletion vectors erase rows by predicate without full rewrite. - Confluent Schema Registry — enforces backward/forward/full compatibility on stream events at the edge, rejecting breaking changes before they propagate. - Debezium OpenLineage SMT — emits standardized lineage from CDC streams so source-to-sink flow is traceable. - Apache Atlas — metadata and lineage catalog tying datasets, transforms, and classifications together. - Great Expectations / Soda / Monte Carlo — data-quality checks and observability on pipelines, quarantining or alerting on bad data. - dbt tests + exposures — schema and quality assertions plus lineage from source to consumed model.

PII detection and erasure across modalities: - AssemblyAI / Deepgram PII redaction — detect and redact PII in speech-to-text transcripts before storage (the audio-after-ASR path). - AWS Macie / Comprehend / Microsoft Presidio — NER-based PII detection and redaction for text and structured data. - Google DLP / Cloud Healthcare de-identification — detect and redact PII/PHI across text and, with OCR, documents and images. - Snowflake Cortex PII redaction — detect and redact PII inline in the warehouse. - Gladia / Hamming — voice-agent transcript PII redaction for compliance-by-design. - OCR + NER pipelines (image PII) — detect PII in screenshots/recordings by reading pixels then applying entity recognition — the image-after-OCR path. - GDPR / CCPA right-to-erasure programs — enterprise deletion workflows that must reach every derived copy by subject id across stores. - Healthcare / fintech call platforms — HIPAA/PCI-driven transcript and recording redaction with audit lineage on every artifact.


Pause and recall

  1. Why is the raw_s3 back-pointer one link and not lineage, and what does the full chain add?
  2. State the chapter's invariant. Why must provenance be written at the sink and not reconstructed later?
  3. Where does a schema break surface under schema-on-read vs schema-on-write, and which does the served path want?
  4. Does Iceberg's schema evolution remove the need for an ingest schema contract? Why or why not?
  5. Why is PII detection per-modality but deletion uniform — where does PII hide in audio vs image?
  6. Why does deleting only the raw data leave PII behind, and what is the correct deletion shape?
  7. Which metric is the leading indicator that bad data would be entering the index, and which comforting metric hides incomplete erasure?
  8. Why can't governance be retrofitted, and what specifically is lost if a vector is written without its provenance?

Interview Q&A

Q1. A user invokes right-to-erasure. Walk through deleting everything you have on them across modalities. A. Query the lineage graph for everything tagged customer_id=them: raw objects (chat, wav, png) in the object store, derived rows (transcript, caption) in the lakehouse, and vectors in the index — across text, audio-derived, and image. Delete raw from the object store, row-level-delete the lakehouse rows (Iceberg deletion vectors), and delete index vectors by the metadata filter. The critical part is the derived copies: the spoken card number in the transcript embedding and the face in the image embedding, which the raw deletion misses. Verify all three stores return empty for that id. Common wrong answer to avoid: "Delete the raw files." That leaves the number in the transcript and its embedding, and the face in the image vector — retrievable PII the copilot can still surface.

Q2. The copilot gave a wrong answer about a card. How does lineage help you debug it? A. Walk the chunk's lineage chain backward: indexed vector → embedding model + version → masked transcript + masking run → raw ASR transcript + ASR version → raw wav. Inspect each hop. If the masked transcript reads "card ending [REDACTED]," the PII-masking step dropped the digits the answer needed — a masking bug, not a retrieval or embedding bug. Lineage attributes the failure to the exact hop instead of guessing across the whole pipe. Common wrong answer to avoid: "Re-run the query / tune retrieval." Without the chain you can't tell a masking drop from an embedding-version skew from a retrieval miss; lineage is what localizes it.

Q3. Schema-on-read or schema-on-write for the events feeding the index — defend a choice. A. Schema-on-write with a registry and compatible-evolution rules for the served path. A malformed event that reaches the index is far more expensive than one rejected at ingest — it poisons a partition or writes an un-erasable orphan, surfacing as a downstream incident. The contract catches breaking changes at the producer; compatible changes (add optional field) flow through, and Iceberg/Delta evolve the table schema by metadata without rewriting data. Reserve schema-on-read tolerance for exploratory raw zones. Common wrong answer to avoid: "Schema-on-read is more flexible." Flexibility at ingest pushes the break deep into the pipe where it's an incident, not an edge rejection — wrong trade for a served, erasable index.

Q4. Why is PII detection harder for audio and images than for text? A. PII hides in a different form per modality. In text it's in the words — a NER pass finds it. In audio it's spoken, so a text scanner is blind until ASR transcribes it (inheriting ASR's minutes of lag). In images it's in pixels — a face or a card on screen — reachable only by OCR + vision. So each modality needs its own detector, redaction happens after conversion to a detectable form, and the masking must be recorded in lineage so deletion knows it happened. Detection is per-modality; the deletion guarantee stays uniform. Common wrong answer to avoid: "Run the same PII scanner on everything." A text scanner can't see a spoken number or a photographed card; you need ASR-then-NER and OCR-then-NER, not one scanner.

Q5. Where do you enforce lineage and quality, and why not add governance later? A. At write time, on the stream — tag every artifact with full provenance and validate before propagating. Governance cannot be retrofitted: once a vector is written without its model version and owner id, that provenance is gone, and finding one person's data among millions of untagged artifacts is intractable. Quality must gate at ingest so bad data never enters the index. Lineage is the one part of the pipe dramatically cheaper to build in than to bolt on after a regulator or incident forces it. Common wrong answer to avoid: "Add lineage when compliance asks." The model version and owner id that make an artifact traceable and erasable are lost at write; you can't reconstruct them, so 'later' means an intractable backfill.

Q6. (Cumulative) You deleted a customer but the copilot still retrieves their complaint. Is this a chapter-05 indexing issue, a chapter-07 governance issue, or both? A. Both, and lineage tells you which. If the vector was never tagged with customer_id, deletion-by-person couldn't find it — a chapter-07 missing-lineage failure. If it was tagged and deleted but still returns, the index marked it a tombstone and compaction lag (chapter 05) keeps it searchable — an indexing purge-lag issue. Check whether the artifact has a provenance tag (governance) and whether the delete was compacted out (indexing); the fix differs. Common wrong answer to avoid: "Just delete it again." If it's an untagged orphan, re-deleting can't find it either; and if it's tombstone lag, the issue is compaction, not the delete — diagnose with lineage first.


Design/debug exercise (10 min)

Step 1 — Modeled example. Provenance and deletion plan for the audio modality:

Artifact:    txn:88213:90414 (transcript embedding)
Lineage tags: raw_s3=.../call.wav, asr_ver=whisper-v3, embed_ver=embed-4,
              pii_run=p12, customer_id=88213, run_id=r88
PII detect:  ASR transcript → NER → mask card/SSN → record pii_run=p12
Quality gate: reject if ASR confidence < τ or transcript empty → dead-letter
Delete-by-person(88213):
   object store: delete call.wav
   lakehouse:    row-level delete transcript row (Iceberg deletion vector)
   index:        delete vector by filter customer_id=88213
   verify:       all three return empty for 88213

Step 2 — Your turn. Write the provenance + deletion plan for the image modality (a screenshot with a face and a card visible on screen). Decide: which detector finds the PII (and what must run before it), what lineage tags the image embedding carries, where PII could survive if you only deleted the raw png, and how you verify erasure across all three stores. (Hint: the caption is a derived text copy too.)

Step 3 — Reproduce from memory. Redraw the section-2 diagram (stream → quality gate → transform with lineage tagging → lineage-tagged sink + lineage graph), label where bad data is quarantined and where deletion-by-person traverses, and write one sentence connecting the customer_id deletion key to chapter 05's metadata filter and one connecting per-modality detection to chapter 04's modality cost asymmetry.


Operational memory

This chapter explained why a pipeline that moves data fast still fails the two questions a regulator and an on-call engineer ask — "where did this person's data go?" and "where did this artifact come from?" — unless it carries lineage on every artifact and validates quality on the stream. The important idea is that a derived artifact is the output of a chain of versioned, fallible transforms, and only the full provenance chain (not a single back-pointer) lets you trace, audit, re-derive, or erase it; and that PII hides in a different form per modality, so detection is per-modality while the deletion guarantee must be uniform.

You learned to tag every artifact at the sink with its source pointer, model versions, masking run, owner id, and pipeline run; to gate the stream with a schema contract and confidence threshold so malformed or low-confidence data is quarantined not indexed; to detect PII after each modality is converted to a detectable form (text directly, audio after ASR, image after OCR) and record the masking; and to delete by the lineage-stamped owner id across raw, derived rows, and index vectors in all modalities, verified empty. That makes the fast pipe also a traceable, trustworthy, erasable one.

Carry this diagnostic forward: when a deletion leaves data behind, suspect a derived copy in a modality you didn't scan or an untagged orphan; when the copilot grounds on garbage, check the quality gate; when an answer is wrong, walk the lineage chain to the exact hop. Governance cannot be retrofitted — the model version and owner id are lost at write, so build the chain at write time, before a regulator or incident forces an intractable backfill.

Remember:

  • A back-pointer is one link; lineage is the full chain (raw → transcript → masked → embedding → vector) with model version and run at each hop.
  • Tag provenance at write and validate before propagate — governance cannot be retrofitted once the version and owner id are lost.
  • PII hides per modality (text in words, audio in the transcript, image in pixels); detection is per-modality, deletion is uniform.
  • Deleting raw isn't deletion — reach every derived copy across all modalities by the lineage-stamped owner id, verified empty in all stores.
  • Surface schema breaks at the edge with a write contract; Iceberg/Delta evolve table schema by metadata but don't validate the incoming stream.

Bridge. We can now trace, validate, and erase everything that flows through the pipe — the platform is fast, cheap, fresh, cross-modal, and governed. But every mechanism we built carries a contested edge: people argue about exactly-once vs idempotency, about whether kappa really killed lambda, about how fresh is actually fresh enough, and about when this entire streaming apparatus is just over-engineering a problem a nightly batch job would have solved. So the question shifts from "how do we build each layer?" to "where do these layers hit their honest limits, where does the industry disagree, and where is streaming the wrong tool entirely?" The final file reviews the boundaries, the open problems, and the tradeoffs no clean architecture resolves. → 08-boundary-tradeoff-review.md