03. Transformations with dbt and Spark¶
⏱️ Estimated time: 24 min | Level: intermediate
ELI5 callback: In the car factory, the loading dock sets arrival rhythm, the conveyor belt sets work rhythm, the showroom exposes finished output, the reject bin protects trust, and the manifest explains every move. This file teaches who cleans and reshapes the parts.
Start with business meaning, not syntax¶
See. Transformation is where raw records become business meaning.
ETL transforms before loading into the analytical store.
ELT loads first, then transforms inside the warehouse or lakehouse.
ELT usually wins when storage is cheap and SQL talent is abundant.
ETL still matters when privacy, cost, or preprocessing demands it.
So what to do?
Keep raw data untouched whenever policy allows.
That gives you replay and audit flexibility.
Then build clean layers with clear contracts.
dbt shines for SQL-first teams and warehouse-native models.
Spark shines when data volume or logic outgrows simple SQL.
Python and Scala bring power, but also more surface area.
SQL stays easier for analysts to review.
Code stays better for custom parsing and heavy algorithms.
Simple, no?
Pick the smallest engine that meets scale and logic needs.
Transformation should reveal business entities, not tool preference.
Naming, testing, and freshness matter as much as syntax.
Model layers and job boundaries clearly¶
dbt organizes logic as models with dependencies and tests.
That creates readable DAGs inside the transformation layer.
Incremental models save money when only recent partitions change.
Snapshots preserve slowly changing records for audits and dimensions.
Spark jobs excel at wide scans and complex joins.
Partitioning and shuffle control decide whether jobs finish or crawl.
Now watch.
A bad join key can burn huge money quickly.
A good data model cuts compute more than clever tuning.
┌──────────┐ ┌────────────┐ ┌───────────┐ │ Raw data │──▶│ Staging │──▶│ Mart SQL │ └──────────┘ └────────────┘ └───────────┘ │ └──────────────▶┌────────────┐ │ Spark job │──▶ Curated files └────────────┘
Use staging models to standardize types and field names.
Use intermediate models to express reusable business logic.
Use marts to serve final team-facing datasets.
Keep one responsibility per model where possible.
Giant models become impossible to debug.
Giant Spark jobs become impossible to trust.
Refactor logic when lineage becomes muddy.
The goal is explainable transformation, not cleverness.
Future you must read this at midnight.
Correctness beats clever performance tricks¶
Null handling breaks dashboards more often than cluster size.
Timezone handling breaks trust even faster.
Late arriving records can force restatement or merge logic.
Deduplication rules must be explicit and testable.
Surrogate keys help joins, but natural keys explain business reality.
Use both when needed.
See.
Push simple filters earlier in the plan.
Avoid exploding nested data unless consumers truly need it.
Cache only when repeated access justifies memory use.
Prefer idempotent writes for reruns.
Prefer append plus merge patterns when mutation volume is manageable.
Watch skew on hot keys.
One giant customer can ruin a distributed job.
Monitor runtime, scan size, row counts, and failures together.
So what to do?
Tie every transform to an owner and an SLA.
Unowned models become silent liabilities.
Choose tools by shape of work¶
Use dbt when most logic is relational and warehouse-resident.
Use Spark when files are huge or logic is compute-heavy.
Use both when the estate genuinely needs both.
But draw the boundary clearly.
Do not transform the same concept in three places.
That creates semantic drift.
Centralize canonical definitions for revenue, user, and order first.
Publish model contracts before scaling contributors.
Think again using the factory analogy.
The loading dock lands raw material, the conveyor belt reshapes it step by step, the showroom should only see stable models, the reject bin catches invalid rows, and the manifest documents every dependency.
Simple, no?
Transformation work is successful when downstream teams stop asking basic questions.
If every chart needs explanation, your models are weak.
If reruns terrify the team, your jobs are fragile.
If only one engineer understands the graph, the design failed.
Favor readability over wizardry.
Favor repeatability over heroics.
That is how data logic survives scale.
Where this lives in the wild¶
- dbt is common in warehouse-first analytics teams.
- Spark remains common for large file processing and heavy joins.
- Lakehouse platforms often mix SQL models with Spark notebooks.
- Mature teams standardize semantic definitions before scaling marts.
Pause and recall¶
- When does ELT beat ETL clearly?
- Why are model layers better than one giant transform?
- What makes skew dangerous in Spark?
- Why does ownership matter for data models?
Interview Q&A¶
Q: When would you pick dbt over Spark? A: When transformation is mostly relational and warehouse-native. Common wrong answer to avoid: dbt is only for small datasets.
Q: Why keep raw data untouched? A: It preserves replay, audit, and alternative modeling options. Common wrong answer to avoid: Because analysts enjoy messy tables.
Q: What usually breaks trust first in transformations? A: Null, timezone, and duplicate handling mistakes. Common wrong answer to avoid: Only slow runtime matters.
Q: How do you avoid semantic drift? A: Publish canonical definitions and keep one owner per concept. Common wrong answer to avoid: Let each team define revenue separately.
Apply now (5 min)¶
Pick one raw table and define staging, intermediate, and mart outputs. Decide whether dbt, Spark, or both should own each step. Write one key test for duplicates or nulls. Mark one field that needs timezone or deduplication care. Explain how you would rerun the flow safely.
Bridge. Transforms defined. But who schedules and monitors them? → 04