04. Orchestration with Airflow and Dagster¶

⏱️ Estimated time: 23 min | Level: intermediate

ELI5 callback: In the car factory, the loading dock sets arrival rhythm, the conveyor belt sets work rhythm, the showroom exposes finished output, the reject bin protects trust, and the manifest explains every move. This file teaches who starts, stops, and retries each station.

Separate orchestration from computation¶

See. Orchestration is not the same as computation.

Airflow or Dagster mostly tells work when to run.

Spark, dbt, SQL, or scripts do the actual work.

A DAG encodes dependencies, schedules, and failure behavior.

Good orchestration makes pipelines boring.

Bad orchestration creates invisible waiting and noisy pages.

So what to do?

Start by defining task boundaries around retry-safe work.

One task should have one obvious success condition.

Keep tasks idempotent whenever possible.

Otherwise backfills become terrifying.

Scheduling should reflect data readiness, not human preference.

Cron is simple, but event-driven triggers may be better.

Sensors help, but too many can waste workers.

Dagster adds stronger asset semantics and local development ergonomics.

Airflow offers wide adoption and many operators.

Simple, no?

Choose the orchestrator your team can operate calmly.

Retries, backfills, and sensors need clear rules¶

Retries are useful only when failures are transient.

Retrying a bad SQL statement five times is comedy.

So classify failures: transient, data, code, or dependency.

Backfills replay past periods to rebuild missing or corrected outputs.

They need isolation from daily fresh runs.

Concurrency limits stop backfills from starving production.

Catchup settings decide whether the scheduler floods old intervals.

Now watch.

The safest DAG makes time boundaries explicit.

┌──────────┐ sensor ┌──────────┐ │ Upstream │──────────▶│ Task A │ └──────────┘ └────┬─────┘ │ retry ┌────────▼─────┐ │ Task B / dbt │──▶ publish └──────────────┘

Sensors should wait for durable signals, not vague hope.

A file existence check is clearer than polling a dashboard.

SLAs and alerts should point to owners, not generic channels.

Logging must include run_id, partition, and upstream references.

Manual reruns need written playbooks.

Without playbooks, operators invent risky shortcuts.

Good orchestration reduces surprise during replay.

Great orchestration makes dependency graphs legible to newcomers.

That saves real incident hours.

Monitor freshness, not just green boxes¶

A green DAG can still hide stale data.

So monitor freshness and row counts, not only task success.

One upstream slowdown can cause a downstream domino effect.

Priority pools and queues help separate critical from optional work.

Cross-team pipelines need clear contracts for arrival times.

Otherwise every delay becomes a blame game.

See.

Dynamic task generation helps scale, but can hide complexity.

Too many tiny tasks overload the scheduler itself.

Too few giant tasks hide where the failure sits.

Balance visibility with overhead.

Store artifacts and metadata for each run.

Operators need evidence, not memory.

Dagster assets make dependencies between tables more explicit.

Airflow datasets now move in that direction too.

So what to do?

Prefer observable runs over clever dependency tricks.

If you cannot explain a rerun, you do not control it.

Operational clarity is the real product¶

Separate orchestration code from business transformation code.

Version schedules and runbooks with the pipeline definitions.

Keep secrets out of DAG source when possible.

Define retry counts by failure type, not habit.

Provide a backfill path before first production release.

Provide a manual override path before first production release.

Document data availability windows for consumers.

Document who owns upstream dependencies.

Think again using the factory analogy.

The loading dock decides when raw parts arrive, the conveyor belt must move tasks in order, the showroom depends on on-time delivery, the reject bin should isolate bad runs, and the manifest must expose run history.

Simple, no?

Orchestration is mostly operational clarity packaged as code.

The best scheduler is the one nobody discusses on calm weeks.

The worst scheduler makes every rerun a special project.

Keep the graph simple enough to inspect quickly.

Keep alerts rich enough to act immediately.

Keep ownership obvious across teams.

That is production maturity.

Where this lives in the wild¶

Airflow still runs a huge share of batch analytics pipelines.
Dagster is growing in asset-centric and Python-heavy data teams.
Backfills are routine in finance, growth, and compliance workloads.
Sensors and event triggers matter where upstream readiness is variable.

Pause and recall¶

Why must orchestration stay separate from actual compute logic?
When does a retry help versus hurt?
Why can a green DAG still hide bad data outcomes?
What makes backfills dangerous without isolation?

Interview Q&A¶

Q: What belongs in an orchestrator? A: Dependencies, schedules, retries, and run metadata. Common wrong answer to avoid: All business logic should live in the DAG.

Q: Why are idempotent tasks so important? A: They make reruns and backfills predictable. Common wrong answer to avoid: Because they look cleaner in code review.

Q: When should you use sensors carefully? A: When waiting is needed, but worker capacity is limited. Common wrong answer to avoid: Use sensors for every upstream condition.

Q: How do you monitor orchestration well? A: Track freshness, row counts, lag, and task outcomes together. Common wrong answer to avoid: Green boxes are enough.

Apply now (5 min)¶

Sketch a three-step DAG for ingest, transform, and publish. Add one retry policy and one no-retry rule. Write the backfill strategy for last seven days. Define the signal that proves upstream data is ready. Add one alert with owner, partition, and run context.

Bridge. Pipeline orchestrated. But where do results land? → 05