07. Lineage, Catalog, and Governance¶
⏱️ Estimated time: 24 min | Level: advanced
ELI5 callback: In the car factory, the loading dock sets arrival rhythm, the conveyor belt sets work rhythm, the showroom exposes finished output, the reject bin protects trust, and the manifest explains every move. This file teaches how the factory proves origin, ownership, and policy.
Metadata answers incident questions fast¶
See. Governance sounds heavy until an incident arrives.
Then everyone asks the same questions immediately.
Where did this column come from?
Who changed the rule?
Which dashboards now lie?
Lineage and catalog tools answer those questions faster.
They turn tribal knowledge into searchable metadata.
So what to do?
Capture technical lineage automatically where possible.
Add business descriptions deliberately where automation stops.
Catalogs should expose owner, freshness, schema, and sensitivity.
Lineage should expose upstream and downstream impact.
Column-level lineage matters for high-value metrics and PII.
Asset ownership matters for response speed.
Without ownership, alerts become theatre.
Simple, no?
Governance is really operational clarity plus policy enforcement.
Good metadata reduces panic during change.
Automatic capture still needs human meaning¶
Lineage can come from query parsing, orchestrator metadata, and manual tags.
Query parsing works well for SQL-heavy estates.
It becomes weaker for opaque code and external side effects.
Orchestrator events reveal run timing and dependencies.
Catalog crawlers discover tables, files, and schemas.
Manual annotations add business meaning and caveats.
Now watch.
Metadata only helps if users trust it.
Stale catalogs become digital graveyards.
┌──────────┐ parse ┌───────────┐ publish ┌──────────┐ │ SQL/jobs │──────────▶│ Lineage DB│────────────▶│ Catalog │ └──────────┘ └───────────┘ └────┬─────┘ │ │ └──────── owners / tags / glossary ─────────────┘ impact analysis before change policy and search in one place
Set freshness expectations for metadata too.
Auto-sync descriptions when schemas evolve, or flag drift.
Tag PII at column level where policy demands it.
Tag certification status for trusted datasets.
Make search forgiving with aliases and domains.
Humans search for GMV before gross merchandise value.
Design the catalog for discoverability, not only compliance.
Compliance still benefits when discovery improves.
Useful governance gets adopted.
Ownership, access, and retention need precision¶
Ownership should exist for source, pipeline, and served asset.
One team rarely owns everything forever.
Make transfer rules clear during reorganizations.
Data domains help distribute responsibility sensibly.
Access control should match least privilege.
Analysts may need masked views, not raw PII.
See.
Policy tags can drive row and column access automatically.
Approval workflows should be auditable.
Retention and deletion requests need lineage to be credible.
Otherwise legal responses become guesswork.
Impact analysis should happen before changing shared metrics.
A renamed column can break far more than one dashboard.
So what to do?
Review top critical assets monthly for owner and policy accuracy.
Review unused sensitive tables for removal.
Remove stale data products from the catalog visibly.
Clean catalogs increase trust.
Treat metadata as a product¶
Start with critical tables, dashboards, and regulated data.
Do not wait for perfect whole-estate coverage.
Integrate lineage with orchestrators and transformation tools early.
Pull owners from real teams, not placeholders.
Add glossary terms for core business concepts.
Keep definitions close to data products.
Make certification explicit and revocable.
Expose downstream consumers before schema changes deploy.
Think again using the factory analogy.
The loading dock shows who sent each part, the conveyor belt records every station touched, the showroom labels what customers can trust, the reject bin flags restricted or broken items, and the manifest ties the whole story together.
Simple, no?
Governance succeeds when engineers and analysts both use it daily.
If only auditors open the catalog, you built the wrong thing.
If lineage misses critical joins, your impact analysis is weak.
If ownership is outdated, alerts still bounce uselessly.
Treat metadata as a product.
Product thinking improves compliance and discovery together.
That is the practical path.
Where this lives in the wild¶
- Data catalogs are common where many teams need self-service discovery.
- Column-level lineage matters most for regulated fields and core metrics.
- Ownership metadata speeds incident response in shared platforms.
- PII tagging powers masked views, audits, and deletion workflows.
Pause and recall¶
- Why does governance suddenly matter during incidents?
- What is the limit of automatic lineage parsing?
- Why should metadata itself have freshness expectations?
- What happens when ownership is stale?
Interview Q&A¶
Q: What should a useful catalog show first? A: Owner, description, freshness, schema, and sensitivity. Common wrong answer to avoid: Every low-level storage detail before anything else.
Q: Why is column-level lineage valuable? A: It narrows impact analysis for critical metrics and PII. Common wrong answer to avoid: Table-level lineage is always sufficient.
Q: How do you keep catalogs from dying? A: Automate technical capture and review business metadata regularly. Common wrong answer to avoid: Launch the tool once and trust users to update it.
Q: What makes governance adopted? A: It improves daily discovery and debugging, not only audits. Common wrong answer to avoid: Compliance teams alone can drive adoption.
Apply now (5 min)¶
Choose one important dashboard and trace its upstream tables. Write the owner, freshness SLA, and one sensitive column. Note which lineage is automatic and which needs manual annotation. Add one glossary term users often search incorrectly. Decide one schema change review rule before deployment.
Bridge. Lineage tracked. But ML needs features served fast. → 08