Skip to content

00. Data Platform and Pipeline Design — The Five-Year-Old Version

Data platforms are factories. Raw materials come in, get processed, and finished products come out.


Imagine a car factory. Raw steel, rubber, glass arrive at the loading dock. That is data ingestion — bringing raw data from dozens of sources into one place.

The raw materials move along a conveyor belt. Each station does one thing: stamp the metal, weld the frame, paint the body. That is a data pipeline — a series of transformations that clean, enrich, and reshape raw data into something useful.

At the end of the conveyor belt, finished cars park in the showroom — ready for customers to browse. That is the serving layer — data warehouses, dashboards, and APIs that let analysts and applications query processed data.

Sometimes a part is defective. A sensor catches it on the belt and diverts it to the reject bin. That is data quality — validation, schema enforcement, and dead-letter handling that catches bad records before they contaminate downstream.

The factory keeps a manifest for every car: which parts went in, which station processed it, when it was completed. If a recall happens, you trace the manifest back to the exact batch of defective steel. That is data lineage and cataloging — knowing where data came from, how it was transformed, and where it went.

Modern AI systems are especially hungry. They don't just want finished cars — they want every intermediate part, labeled with timestamps and quality grades. Feature stores, training datasets, and evaluation sets all come from this factory.

Why is this hard? Scale. Netflix processes over 1 trillion events per day. Uber generates 1 PB of new data daily. At this scale, the conveyor belt must be distributed across hundreds of machines, fault-tolerant, and still deliver results within minutes or seconds.

There are two fundamental modes. Batch processing: run the factory once a day, process yesterday's data. Real-time streaming: the conveyor belt never stops — data flows continuously, millisecond by millisecond. Most modern platforms need both — the "lambda architecture" or newer "kappa architecture" patterns.

The tooling landscape is vast. Airflow orchestrates batch workflows. Kafka and Flink power streaming. Spark handles distributed computation. dbt transforms data in warehouses. Snowflake, BigQuery, and Databricks serve as the showroom. Great Expectations validates quality. Each tool does one thing well — your job is wiring them into a cohesive factory.

Data governance runs through everything. Who can see what? Which columns contain PII? When was this table last updated? Is this data trustworthy? The manifest answers these questions — and regulations (GDPR, HIPAA, CCPA) demand it.


The placeholders you will see called back

Placeholder Meaning
loading dock ingestion — Kafka, Debezium, Fivetran, APIs pulling raw data in
conveyor belt pipeline — Spark, Flink, dbt, Airflow; the sequence of transformations
showroom serving layer — data warehouse, lakehouse, dashboards, feature store
reject bin data quality — validation, schema checks, quarantine for bad records
manifest lineage and catalog — metadata, provenance, column-level tracking

Top resources


What's coming

  1. 01-batch-vs-stream.md — when to process data in chunks vs. real-time flows
  2. 02-ingestion-patterns.md — CDC, event streams, API pulls, and file drops
  3. 03-transformation-dbt-spark.md — cleaning, joining, and reshaping data at scale
  4. 04-orchestration-airflow-dagster.md — scheduling, DAGs, retries, and dependency management
  5. 05-warehouse-lakehouse.md — Snowflake, BigQuery, Databricks, and the lakehouse pattern
  6. 06-data-quality-testing.md — Great Expectations, dbt tests, and schema enforcement
  7. 07-lineage-catalog-governance.md — tracking data origins, ownership, and access controls
  8. 08-feature-stores.md — bridging data pipelines and ML model training/serving
  9. 09-honest-admission.md — what we don't fully understand about data platforms

Bridge. The factory floor has two modes: batch shifts and real-time assembly lines. Let's understand when to use each. → 01-batch-vs-stream.md