Building an ELT pipeline with dbt and Airflow: from zero to a production dashboard
A concrete roadmap from scattered data across multiple systems to a data warehouse you can trust — with dbt for transforms, Airflow for orchestration, and data contracts so teams don't break each other's dashboards.
Most businesses don't lack data — they lack a reliable path from raw data to a number that decision-makers can look at and trust. This article describes the roadmap we recommend when building the Data Analytics layer in a Pilot Build: from data scattered across CRM, sales systems, and Excel files, to a structured warehouse with tests and self-updating dashboards.
Why ELT, not ETL
Traditional ETL (Extract-Transform-Load) transforms data before loading it into the warehouse — meaning the transform logic lives in an intermediate system that's often hard to test and hard to change. ELT (Extract-Load-Transform) loads raw data into the warehouse first, then transforms it with SQL directly inside the warehouse. With modern warehouses (BigQuery, Snowflake, ClickHouse, or Postgres for small-to-medium scale), compute is cheap enough to make this practical — and the benefit is that all transform logic becomes code that can be versioned, tested, and reviewed like regular software.
Three layers in the warehouse — staging, intermediate, marts
dbt (data build tool) organizes transforms into SQL models, typically split into three layers:
- Staging: 1-1 with the raw source — only renames columns, casts data types, and filters out junk records. No business logic here
- Intermediate: where tables from multiple sources are joined, deduplicated, or partially aggregated — for example, merging orders from a POS system and an online system into a single unified orders table
- Marts: final tables, designed specifically for a particular group of business questions (revenue by store, monthly churn...) — these are the tables that BI tools connect to
Practical rule: if a number appears on 2 different dashboards, it must be computed in exactly one place — in a mart model — not recomputed with a separate formula in each dashboard.
Airflow's role — not to run SQL, but to orchestrate
A common misconception is that Airflow "runs the transforms". In reality, Airflow only orchestrates: it calls dbt run commands in the right order, at the right time, after source data is ready — and handles what happens when a step fails (retry, alert, backfill). The transform logic itself still lives in dbt. Keeping these two roles clearly separated helps you:
- Swap orchestrators (Airflow → Dagster, or vice versa) without rewriting transforms
- Run and test transforms locally without Airflow running
- Debug more easily — "the transform is wrong" and "the job didn't run" are two different kinds of failures, at two different layers
Data contracts — agreements between teams
When multiple teams contribute to one warehouse, the easiest thing that can go wrong is one team changing a source schema (renaming a column, changing a data type) and breaking another team's dashboard — usually discovered days later, when someone asks why the numbers look wrong. A data contract is a simple convention: every important source table has an explicitly declared schema (dbt sources.yml), along with automated tests (not_null, unique, relationships) that run every time the pipeline executes. If a source changes and breaks the contract, the pipeline fails right at the staging step — before bad data reaches any dashboard.
A practical rollout roadmap — 4 phases
Phase 1 — Audit and data contracts (1-2 weeks)
List out all existing data sources, assess their reliability, and define the expected schema for each one. This is the cheapest phase but often produces the biggest findings — many "facts" about the data that the operations team believes turn out to be wrong once checked directly.
Phase 2 — Staging layer + minimal pipeline (2-3 weeks)
Bring the 2-3 most important sources into the warehouse, build staging models with basic tests, and a simple Airflow DAG that runs daily. The goal is to have one working end-to-end loop — even if it only serves a single business question — before expanding further.
Phase 3 — Marts + first dashboard (2-3 weeks)
Build 1-2 mart models that answer specific business questions (e.g., revenue and cancellation rate by day/region), connected to a dashboard (Superset, Metabase, or Looker Studio). This is the point where the value becomes tangible to end users — and usually where the next expansion request comes from.
Phase 4 — Expand + self-service (ongoing)
Add new data sources as needed, train the business team to build their own dashboards on existing marts (no complex SQL required), and set up alerting when dbt tests fail — this is the Optimization Loop applied to data infrastructure.
A note on tooling
dbt and Airflow are popular choices with a solid ecosystem, but they're not the only option — at smaller scale, a simple cron job calling a Python script + dbt core, or Dagster instead of Airflow, may be a more sensible choice with less operational overhead. Tooling should be chosen once data volume and the number of pipelines to manage are clear — not before.
Conclusion
A good ELT pipeline isn't judged by which tools it uses, but by whether: when data is wrong, the pipeline flags it right at the source, and when a new business question comes up, you don't need to rebuild from scratch. This is the standard we use when designing the Data Analytics layer for every Pilot Build — starting small, but with a structure solid enough to scale without a rebuild, as in the ETL pipeline merging telematics data for 600 trucks that we built.