00. Data Platform and Pipeline Design — The Five-Year-Old Version¶

Data platforms are factories. Raw materials come in, get processed, and finished products come out.

Imagine a car factory. Raw steel, rubber, glass arrive at the loading dock. That is data ingestion — bringing raw data from dozens of sources into one place.

The raw materials move along a conveyor belt. Each station does one thing: stamp the metal, weld the frame, paint the body. That is a data pipeline — a series of transformations that clean, enrich, and reshape raw data into something useful.

At the end of the conveyor belt, finished cars park in the showroom — ready for customers to browse. That is the serving layer — data warehouses, dashboards, and APIs that let analysts and applications query processed data.

Sometimes a part is defective. A sensor catches it on the belt and diverts it to the reject bin. That is data quality — validation, schema enforcement, and dead-letter handling that catches bad records before they contaminate downstream.

The factory keeps a manifest for every car: which parts went in, which station processed it, when it was completed. If a recall happens, you trace the manifest back to the exact batch of defective steel. That is data lineage and cataloging — knowing where data came from, how it was transformed, and where it went.

Modern AI systems are especially hungry. They don't just want finished cars — they want every intermediate part, labeled with timestamps and quality grades. Feature stores, training datasets, and evaluation sets all come from this factory.

Why is this hard? Scale. Netflix processes over 1 trillion events per day. Uber generates 1 PB of new data daily. At this scale, the conveyor belt must be distributed across hundreds of machines, fault-tolerant, and still deliver results within minutes or seconds.

There are two fundamental modes. Batch processing: run the factory once a day, process yesterday's data. Real-time streaming: the conveyor belt never stops — data flows continuously, millisecond by millisecond. Most modern platforms need both — the "lambda architecture" or newer "kappa architecture" patterns.

The tooling landscape is vast. Airflow orchestrates batch workflows. Kafka and Flink power streaming. Spark handles distributed computation. dbt transforms data in warehouses. Snowflake, BigQuery, and Databricks serve as the showroom. Great Expectations validates quality. Each tool does one thing well — your job is wiring them into a cohesive factory.

Data governance runs through everything. Who can see what? Which columns contain PII? When was this table last updated? Is this data trustworthy? The manifest answers these questions — and regulations (GDPR, HIPAA, CCPA) demand it.

The placeholders you will see called back¶

Placeholder	Meaning
loading dock	ingestion — Kafka, Debezium, Fivetran, APIs pulling raw data in
conveyor belt	pipeline — Spark, Flink, dbt, Airflow; the sequence of transformations
showroom	serving layer — data warehouse, lakehouse, dashboards, feature store
reject bin	data quality — validation, schema checks, quarantine for bad records
manifest	lineage and catalog — metadata, provenance, column-level tracking

Top resources¶

Fundamentals of Data Engineering by Joe Reis — the modern data engineering textbook
The Data Warehouse Toolkit by Ralph Kimball — dimensional modeling; still foundational
dbt Documentation — the transformation layer that data teams love
Apache Spark: The Definitive Guide — distributed processing patterns
Data Mesh by Zhamak Dehghani — decentralized data ownership at scale

What's coming¶

01-batch-vs-stream.md — when to process data in chunks vs. real-time flows
02-ingestion-patterns.md — CDC, event streams, API pulls, and file drops
03-transformation-dbt-spark.md — cleaning, joining, and reshaping data at scale
04-orchestration-airflow-dagster.md — scheduling, DAGs, retries, and dependency management
05-warehouse-lakehouse.md — Snowflake, BigQuery, Databricks, and the lakehouse pattern
06-data-quality-testing.md — Great Expectations, dbt tests, and schema enforcement
07-lineage-catalog-governance.md — tracking data origins, ownership, and access controls
08-feature-stores.md — bridging data pipelines and ML model training/serving
09-honest-admission.md — what we don't fully understand about data platforms

Bridge. The factory floor has two modes: batch shifts and real-time assembly lines. Let's understand when to use each. → 01-batch-vs-stream.md