00. Data Platform and Pipeline Design — The Five-Year-Old Version¶
Data platforms are factories. Raw materials come in, get processed, and finished products come out.
Imagine a car factory. Raw steel, rubber, glass arrive at the loading dock. That is data ingestion — bringing raw data from dozens of sources into one place.
The raw materials move along a conveyor belt. Each station does one thing: stamp the metal, weld the frame, paint the body. That is a data pipeline — a series of transformations that clean, enrich, and reshape raw data into something useful.
At the end of the conveyor belt, finished cars park in the showroom — ready for customers to browse. That is the serving layer — data warehouses, dashboards, and APIs that let analysts and applications query processed data.
Sometimes a part is defective. A sensor catches it on the belt and diverts it to the reject bin. That is data quality — validation, schema enforcement, and dead-letter handling that catches bad records before they contaminate downstream.
The factory keeps a manifest for every car: which parts went in, which station processed it, when it was completed. If a recall happens, you trace the manifest back to the exact batch of defective steel. That is data lineage and cataloging — knowing where data came from, how it was transformed, and where it went.
Modern AI systems are especially hungry. They don't just want finished cars — they want every intermediate part, labeled with timestamps and quality grades. Feature stores, training datasets, and evaluation sets all come from this factory.
Why is this hard? Scale. Netflix processes over 1 trillion events per day. Uber generates 1 PB of new data daily. At this scale, the conveyor belt must be distributed across hundreds of machines, fault-tolerant, and still deliver results within minutes or seconds.
There are two fundamental modes. Batch processing: run the factory once a day, process yesterday's data. Real-time streaming: the conveyor belt never stops — data flows continuously, millisecond by millisecond. Most modern platforms need both — the "lambda architecture" or newer "kappa architecture" patterns.
The tooling landscape is vast. Airflow orchestrates batch workflows. Kafka and Flink power streaming. Spark handles distributed computation. dbt transforms data in warehouses. Snowflake, BigQuery, and Databricks serve as the showroom. Great Expectations validates quality. Each tool does one thing well — your job is wiring them into a cohesive factory.
Data governance runs through everything. Who can see what? Which columns contain PII? When was this table last updated? Is this data trustworthy? The manifest answers these questions — and regulations (GDPR, HIPAA, CCPA) demand it.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| loading dock | ingestion — Kafka, Debezium, Fivetran, APIs pulling raw data in |
| conveyor belt | pipeline — Spark, Flink, dbt, Airflow; the sequence of transformations |
| showroom | serving layer — data warehouse, lakehouse, dashboards, feature store |
| reject bin | data quality — validation, schema checks, quarantine for bad records |
| manifest | lineage and catalog — metadata, provenance, column-level tracking |
Top resources¶
- Fundamentals of Data Engineering by Joe Reis — the modern data engineering textbook
- The Data Warehouse Toolkit by Ralph Kimball — dimensional modeling; still foundational
- dbt Documentation — the transformation layer that data teams love
- Apache Spark: The Definitive Guide — distributed processing patterns
- Data Mesh by Zhamak Dehghani — decentralized data ownership at scale
What's coming¶
- 01-batch-vs-stream.md — when to process data in chunks vs. real-time flows
- 02-ingestion-patterns.md — CDC, event streams, API pulls, and file drops
- 03-transformation-dbt-spark.md — cleaning, joining, and reshaping data at scale
- 04-orchestration-airflow-dagster.md — scheduling, DAGs, retries, and dependency management
- 05-warehouse-lakehouse.md — Snowflake, BigQuery, Databricks, and the lakehouse pattern
- 06-data-quality-testing.md — Great Expectations, dbt tests, and schema enforcement
- 07-lineage-catalog-governance.md — tracking data origins, ownership, and access controls
- 08-feature-stores.md — bridging data pipelines and ML model training/serving
- 09-honest-admission.md — what we don't fully understand about data platforms
Bridge. The factory floor has two modes: batch shifts and real-time assembly lines. Let's understand when to use each. → 01-batch-vs-stream.md