What is dbt?
dbt (data build tool) is an open-source command-line tool that enables analysts to transform data in their warehouse using SQL. It's the "T" in ELT.
The Problem dbt Solves
Before dbt, transforming data in a warehouse meant:
- Writing SQL queries in scripts or notebooks
- Manually running them in order (dependencies are fragile)
- No version control or testing
- No documentation or lineage graphs
- Hard to collaborate across teams
Example — Pre-dbt workflow at Swiggy:
# Run in order manually (fragile):
psql -f 01_clean_orders.sql
psql -f 02_join_customers.sql
psql -f 03_calculate_metrics.sqlIf 02_join_customers.sql depends on 01_clean_orders.sql, you must remember to run them in order. If someone changes 01, they might break 02.
The dbt Solution
dbt turns SQL files into a directed acyclic graph (DAG) of dependencies. Each .sql file is a "model" (a table or view). dbt figures out the order and runs them automatically.
Example — dbt workflow:
-- models/staging/stg_orders.sql
SELECT * FROM {{ source('raw', 'orders') }}
WHERE amount > 0
-- models/marts/fact_orders.sql
SELECT
order_id,
customer_id,
SUM(amount) AS total_amount
FROM {{ ref('stg_orders') }}
GROUP BY order_id, customer_idRun dbt run and dbt:
- Runs
stg_orders.sqlfirst (no dependencies) - Runs
fact_orders.sqlsecond (depends onstg_orders) - Creates tables/views in your warehouse
Key Insight: dbt is not an ETL tool. It doesn't extract or load data. It only transforms data that's already in your warehouse (BigQuery, Snowflake, Redshift, etc.). It's the "T" in ELT.
dbt Models — Turning SQL into Tables
A model is a single .sql file that defines a transformation. When you run dbt run, dbt executes the SQL and materializes it as a table or view in your warehouse.
Example — Staging Model (Cleaning Raw Data)
-- models/staging/stg_orders.sql
{{
config(
materialized='view'
)
}}
SELECT
order_id,
customer_id,
product_id,
amount,
order_date,
-- Clean data
CASE
WHEN amount < 0 THEN 0
ELSE amount
END AS clean_amount,
-- Add derived fields
DATE_TRUNC(order_date, MONTH) AS order_month
FROM {{ source('raw', 'orders') }}
WHERE customer_id IS NOT NULLWhat happens when you run dbt run:
- dbt generates
CREATE VIEW stg_orders AS (SELECT ...) - Executes it in your warehouse (BigQuery, Snowflake, etc.)
- Creates a view called
stg_orders
Example — Mart Model (Business Logic)
-- models/marts/fact_orders_monthly.sql
{{
config(
materialized='table'
)
}}
SELECT
order_month,
customer_id,
COUNT(*) AS order_count,
SUM(clean_amount) AS total_revenue,
AVG(clean_amount) AS avg_order_value
FROM {{ ref('stg_orders') }}
GROUP BY order_month, customer_idWhat happens:
- dbt knows this depends on
stg_orders(via{{ ref('stg_orders') }}) - dbt runs
stg_ordersfirst (if needed) - dbt generates
CREATE TABLE fact_orders_monthly AS (SELECT ...) - Creates a table in your warehouse
Materialization Strategies
| Strategy | What dbt Creates | When to Use | Example |
|----------|------------------|-------------|---------|
| view | A SQL view | Lightweight staging, no data duplication | stg_orders |
| table | A physical table | Final marts for dashboards (fast queries) | fact_orders_monthly |
| incremental | Appends new rows only | Large fact tables (millions of rows) | fact_events |
| ephemeral | CTE (no table/view) | Intermediate logic, not queried directly | clean_dates |
Writing SQL in dbt is like writing functions in Python. Each model is a function that returns a dataset. {{ ref('stg_orders') }} is like calling another function. dbt figures out the order to run them, just like Python resolves function dependencies.
⚠️ CheckpointQuiz error: Missing or invalid options array
Sources, Tests, and Documentation
Sources — Connecting to Raw Data
Sources define raw tables in your warehouse that dbt reads from (but doesn't create).
# models/sources.yml
version: 2
sources:
- name: raw
database: flipkart_raw
schema: production
tables:
- name: orders
- name: customers
- name: productsUsage in models:
-- Reference source tables
SELECT * FROM {{ source('raw', 'orders') }}Benefit: If the source table moves (e.g., from flipkart_raw.production.orders to flipkart_raw.staging.orders), you update sources.yml once, not every model.
Tests — Data Quality Checks
dbt has built-in tests to validate your data:
# models/schema.yml
version: 2
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: amount
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000Run tests:
dbt testOutput:
✓ unique_stg_orders_order_id .................... [PASS]
✓ not_null_stg_orders_order_id .................. [PASS]
✓ not_null_stg_orders_customer_id ............... [PASS]
✗ relationships_stg_orders_customer_id .......... [FAIL 47 rows]
dbt found 47 orders with customer_id values that don't exist in dim_customers — a data quality issue!
Custom Tests
-- tests/assert_no_negative_amounts.sql
SELECT *
FROM {{ ref('stg_orders') }}
WHERE amount < 0If this query returns rows, the test fails.
Documentation
Add descriptions to your models:
# models/schema.yml
models:
- name: fact_orders_monthly
description: "Monthly aggregated orders by customer. Used for cohort analysis and LTV calculations."
columns:
- name: order_month
description: "First day of the month (YYYY-MM-01)"
- name: total_revenue
description: "Sum of order amounts (clean_amount) in rupees"Generate docs:
dbt docs generate
dbt docs serve # Opens a web UI with lineage graphsdbt Docs automatically generates a lineage graph showing how all your models connect — from raw sources to final dashboards. This is invaluable for understanding data dependencies.
The dbt Workflow
1. Project Structure
my_dbt_project/
├── dbt_project.yml # Project config
├── models/
│ ├── staging/ # Clean raw data
│ │ ├── stg_orders.sql
│ │ ├── stg_customers.sql
│ ├── marts/ # Business logic
│ │ ├── fact_orders.sql
│ │ ├── dim_customers.sql
│ ├── sources.yml # Source definitions
│ ├── schema.yml # Tests & docs
├── tests/ # Custom tests
├── macros/ # Reusable SQL functions
2. Development Loop
# 1. Write a new model
vim models/staging/stg_orders.sql
# 2. Run the model
dbt run --models stg_orders
# 3. Test it
dbt test --models stg_orders
# 4. View compiled SQL
cat target/compiled/my_project/models/staging/stg_orders.sql
# 5. Run entire project
dbt run
dbt test
# 6. Generate docs
dbt docs generate
dbt docs serve3. Production Deployment
In production, dbt runs on a schedule (via Airflow, GitHub Actions, or dbt Cloud):
# Airflow DAG
from airflow import DAG
from airflow.operators.bash import BashOperator
dag = DAG('dbt_daily', schedule_interval='0 2 * * *')
dbt_run = BashOperator(
task_id='dbt_run',
bash_command='cd /opt/dbt && dbt run',
dag=dag
)
dbt_test = BashOperator(
task_id='dbt_test',
bash_command='cd /opt/dbt && dbt test',
dag=dag
)
dbt_run >> dbt_testEvery night at 2 AM, Airflow runs dbt run (to transform data) then dbt test (to validate it).
Razorpay uses dbt to transform payment data in BigQuery. Every hour, a dbt job runs:
- Loads raw payment events from APIs (via Fivetran)
- dbt cleans events (stg_payments)
- dbt aggregates daily metrics (fact_payments_daily)
- dbt tests data quality (no duplicates, amounts ≥ 0)
- Dashboards query the final tables
Why dbt is the Standard for Analytics Engineering
1. Version Control
All transformations are .sql files in Git. You get:
- Peer review (pull requests)
- Rollback (revert a commit)
- History (who changed what, when)
Before dbt: Analysts run ad-hoc queries in BigQuery console. No one knows what logic is in production.
With dbt: Every transformation is versioned, reviewed, and deployed via CI/CD.
2. Dependency Management
dbt automatically determines the order to run models based on {{ ref() }} and {{ source() }}.
Example:
-- models/staging/stg_orders.sql
SELECT * FROM {{ source('raw', 'orders') }}
-- models/staging/stg_customers.sql
SELECT * FROM {{ source('raw', 'customers') }}
-- models/marts/fact_orders.sql
SELECT
o.*,
c.customer_tier
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_customers') }} c USING (customer_id)Dependency graph:
raw.orders → stg_orders ↘
→ fact_orders
raw.customers → stg_customers ↗
Run dbt run and it automatically runs in this order.
3. Data Quality Testing
Built-in tests catch issues before they reach dashboards:
unique: No duplicate order IDsnot_null: Every order has a customer IDrelationships: Every customer ID exists indim_customersaccepted_values: Order status is one of [pending, completed, cancelled]
Example — Catching bad data:
$ dbt test
✗ unique_stg_orders_order_id .... [FAIL 12 duplicate order_ids]The analyst investigates and finds a bug in the source extraction. Without dbt, these duplicates would corrupt the dashboard.
4. Self-Documenting
dbt generates:
- Column descriptions (from
schema.yml) - Lineage graphs (showing how models connect)
- Query definitions (the SQL for each model)
Example lineage graph:
sources.raw.orders
↓
stg_orders
↓
fact_orders → [Dashboard: Daily Revenue]
↓
fact_orders_monthly → [Dashboard: Cohort Analysis]
Analysts can click on fact_orders and see:
- What it depends on (upstream)
- What depends on it (downstream)
- The SQL that creates it
- Test results
5. Modularity
Break complex logic into reusable models:
Before dbt (one giant query):
-- 500-line query with nested CTEs
WITH clean_orders AS (...),
clean_customers AS (...),
joined AS (...),
aggregated AS (...)
SELECT * FROM aggregated;With dbt (modular models):
-- models/staging/stg_orders.sql (50 lines)
-- models/staging/stg_customers.sql (40 lines)
-- models/marts/fact_orders.sql (30 lines)Each model is small, testable, and reusable.
Analytics Engineering: dbt created a new role — the Analytics Engineer. They're analysts who write SQL, but they use software engineering practices (Git, tests, CI/CD) to build production-grade data pipelines.
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}