Topic 53 of

What is dbt (data build tool)? Complete Guide for Analysts

dbt turns SQL SELECT statements into a production-grade data pipeline. Write a query in .sql file, run 'dbt run', and it becomes a table in your warehouse — with tests, docs, and version control built in.

📚Intermediate
⏱️11 min
7 quizzes
🛠️

What is dbt?

dbt (data build tool) is an open-source command-line tool that enables analysts to transform data in their warehouse using SQL. It's the "T" in ELT.

The Problem dbt Solves

Before dbt, transforming data in a warehouse meant:

  • Writing SQL queries in scripts or notebooks
  • Manually running them in order (dependencies are fragile)
  • No version control or testing
  • No documentation or lineage graphs
  • Hard to collaborate across teams

Example — Pre-dbt workflow at Swiggy:

$ terminalBash
# Run in order manually (fragile):
psql -f 01_clean_orders.sql
psql -f 02_join_customers.sql
psql -f 03_calculate_metrics.sql

If 02_join_customers.sql depends on 01_clean_orders.sql, you must remember to run them in order. If someone changes 01, they might break 02.

The dbt Solution

dbt turns SQL files into a directed acyclic graph (DAG) of dependencies. Each .sql file is a "model" (a table or view). dbt figures out the order and runs them automatically.

Example — dbt workflow:

query.sqlSQL
-- models/staging/stg_orders.sql
SELECT * FROM {{ source('raw', 'orders') }}
WHERE amount > 0

-- models/marts/fact_orders.sql
SELECT
  order_id,
  customer_id,
  SUM(amount) AS total_amount
FROM {{ ref('stg_orders') }}
GROUP BY order_id, customer_id

Run dbt run and dbt:

  1. Runs stg_orders.sql first (no dependencies)
  2. Runs fact_orders.sql second (depends on stg_orders)
  3. Creates tables/views in your warehouse
Info

Key Insight: dbt is not an ETL tool. It doesn't extract or load data. It only transforms data that's already in your warehouse (BigQuery, Snowflake, Redshift, etc.). It's the "T" in ELT.

📄

dbt Models — Turning SQL into Tables

A model is a single .sql file that defines a transformation. When you run dbt run, dbt executes the SQL and materializes it as a table or view in your warehouse.

Example — Staging Model (Cleaning Raw Data)

query.sqlSQL
-- models/staging/stg_orders.sql
{{
  config(
    materialized='view'
  )
}}

SELECT
  order_id,
  customer_id,
  product_id,
  amount,
  order_date,
  -- Clean data
  CASE
    WHEN amount < 0 THEN 0
    ELSE amount
  END AS clean_amount,
  -- Add derived fields
  DATE_TRUNC(order_date, MONTH) AS order_month
FROM {{ source('raw', 'orders') }}
WHERE customer_id IS NOT NULL

What happens when you run dbt run:

  1. dbt generates CREATE VIEW stg_orders AS (SELECT ...)
  2. Executes it in your warehouse (BigQuery, Snowflake, etc.)
  3. Creates a view called stg_orders

Example — Mart Model (Business Logic)

query.sqlSQL
-- models/marts/fact_orders_monthly.sql
{{
  config(
    materialized='table'
  )
}}

SELECT
  order_month,
  customer_id,
  COUNT(*) AS order_count,
  SUM(clean_amount) AS total_revenue,
  AVG(clean_amount) AS avg_order_value
FROM {{ ref('stg_orders') }}
GROUP BY order_month, customer_id

What happens:

  1. dbt knows this depends on stg_orders (via {{ ref('stg_orders') }})
  2. dbt runs stg_orders first (if needed)
  3. dbt generates CREATE TABLE fact_orders_monthly AS (SELECT ...)
  4. Creates a table in your warehouse

Materialization Strategies

| Strategy | What dbt Creates | When to Use | Example | |----------|------------------|-------------|---------| | view | A SQL view | Lightweight staging, no data duplication | stg_orders | | table | A physical table | Final marts for dashboards (fast queries) | fact_orders_monthly | | incremental | Appends new rows only | Large fact tables (millions of rows) | fact_events | | ephemeral | CTE (no table/view) | Intermediate logic, not queried directly | clean_dates |

Think of it this way...

Writing SQL in dbt is like writing functions in Python. Each model is a function that returns a dataset. {{ ref('stg_orders') }} is like calling another function. dbt figures out the order to run them, just like Python resolves function dependencies.

⚠️ CheckpointQuiz error: Missing or invalid options array

Sources, Tests, and Documentation

Sources — Connecting to Raw Data

Sources define raw tables in your warehouse that dbt reads from (but doesn't create).

config.ymlYAML
# models/sources.yml
version: 2

sources:
  - name: raw
    database: flipkart_raw
    schema: production
    tables:
      - name: orders
      - name: customers
      - name: products

Usage in models:

query.sqlSQL
-- Reference source tables
SELECT * FROM {{ source('raw', 'orders') }}

Benefit: If the source table moves (e.g., from flipkart_raw.production.orders to flipkart_raw.staging.orders), you update sources.yml once, not every model.

Tests — Data Quality Checks

dbt has built-in tests to validate your data:

config.ymlYAML
# models/schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: amount
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

Run tests:

$ terminalBash
dbt test

Output:

✓ unique_stg_orders_order_id .................... [PASS] ✓ not_null_stg_orders_order_id .................. [PASS] ✓ not_null_stg_orders_customer_id ............... [PASS] ✗ relationships_stg_orders_customer_id .......... [FAIL 47 rows]

dbt found 47 orders with customer_id values that don't exist in dim_customers — a data quality issue!

Custom Tests

query.sqlSQL
-- tests/assert_no_negative_amounts.sql
SELECT *
FROM {{ ref('stg_orders') }}
WHERE amount < 0

If this query returns rows, the test fails.

Documentation

Add descriptions to your models:

config.ymlYAML
# models/schema.yml
models:
  - name: fact_orders_monthly
    description: "Monthly aggregated orders by customer. Used for cohort analysis and LTV calculations."
    columns:
      - name: order_month
        description: "First day of the month (YYYY-MM-01)"
      - name: total_revenue
        description: "Sum of order amounts (clean_amount) in rupees"

Generate docs:

$ terminalBash
dbt docs generate
dbt docs serve  # Opens a web UI with lineage graphs
Info

dbt Docs automatically generates a lineage graph showing how all your models connect — from raw sources to final dashboards. This is invaluable for understanding data dependencies.

🔄

The dbt Workflow

1. Project Structure

my_dbt_project/ ├── dbt_project.yml # Project config ├── models/ │ ├── staging/ # Clean raw data │ │ ├── stg_orders.sql │ │ ├── stg_customers.sql │ ├── marts/ # Business logic │ │ ├── fact_orders.sql │ │ ├── dim_customers.sql │ ├── sources.yml # Source definitions │ ├── schema.yml # Tests & docs ├── tests/ # Custom tests ├── macros/ # Reusable SQL functions

2. Development Loop

$ terminalBash
# 1. Write a new model
vim models/staging/stg_orders.sql

# 2. Run the model
dbt run --models stg_orders

# 3. Test it
dbt test --models stg_orders

# 4. View compiled SQL
cat target/compiled/my_project/models/staging/stg_orders.sql

# 5. Run entire project
dbt run
dbt test

# 6. Generate docs
dbt docs generate
dbt docs serve

3. Production Deployment

In production, dbt runs on a schedule (via Airflow, GitHub Actions, or dbt Cloud):

config.ymlYAML
# Airflow DAG
from airflow import DAG
from airflow.operators.bash import BashOperator

dag = DAG('dbt_daily', schedule_interval='0 2 * * *')

dbt_run = BashOperator(
    task_id='dbt_run',
    bash_command='cd /opt/dbt && dbt run',
    dag=dag
)

dbt_test = BashOperator(
    task_id='dbt_test',
    bash_command='cd /opt/dbt && dbt test',
    dag=dag
)

dbt_run >> dbt_test

Every night at 2 AM, Airflow runs dbt run (to transform data) then dbt test (to validate it).

Think of it this way...

Razorpay uses dbt to transform payment data in BigQuery. Every hour, a dbt job runs:

  1. Loads raw payment events from APIs (via Fivetran)
  2. dbt cleans events (stg_payments)
  3. dbt aggregates daily metrics (fact_payments_daily)
  4. dbt tests data quality (no duplicates, amounts ≥ 0)
  5. Dashboards query the final tables
🌟

Why dbt is the Standard for Analytics Engineering

1. Version Control

All transformations are .sql files in Git. You get:

  • Peer review (pull requests)
  • Rollback (revert a commit)
  • History (who changed what, when)

Before dbt: Analysts run ad-hoc queries in BigQuery console. No one knows what logic is in production.

With dbt: Every transformation is versioned, reviewed, and deployed via CI/CD.

2. Dependency Management

dbt automatically determines the order to run models based on {{ ref() }} and {{ source() }}.

Example:

query.sqlSQL
-- models/staging/stg_orders.sql
SELECT * FROM {{ source('raw', 'orders') }}

-- models/staging/stg_customers.sql
SELECT * FROM {{ source('raw', 'customers') }}

-- models/marts/fact_orders.sql
SELECT
  o.*,
  c.customer_tier
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_customers') }} c USING (customer_id)

Dependency graph:

raw.orders → stg_orders ↘ → fact_orders raw.customers → stg_customers ↗

Run dbt run and it automatically runs in this order.

3. Data Quality Testing

Built-in tests catch issues before they reach dashboards:

  • unique: No duplicate order IDs
  • not_null: Every order has a customer ID
  • relationships: Every customer ID exists in dim_customers
  • accepted_values: Order status is one of [pending, completed, cancelled]

Example — Catching bad data:

$ terminalBash
$ dbt test

✗ unique_stg_orders_order_id .... [FAIL 12 duplicate order_ids]

The analyst investigates and finds a bug in the source extraction. Without dbt, these duplicates would corrupt the dashboard.

4. Self-Documenting

dbt generates:

  • Column descriptions (from schema.yml)
  • Lineage graphs (showing how models connect)
  • Query definitions (the SQL for each model)

Example lineage graph:

sources.raw.orders ↓ stg_orders ↓ fact_orders → [Dashboard: Daily Revenue] ↓ fact_orders_monthly → [Dashboard: Cohort Analysis]

Analysts can click on fact_orders and see:

  • What it depends on (upstream)
  • What depends on it (downstream)
  • The SQL that creates it
  • Test results

5. Modularity

Break complex logic into reusable models:

Before dbt (one giant query):

query.sqlSQL
-- 500-line query with nested CTEs
WITH clean_orders AS (...),
     clean_customers AS (...),
     joined AS (...),
     aggregated AS (...)
SELECT * FROM aggregated;

With dbt (modular models):

query.sqlSQL
-- models/staging/stg_orders.sql (50 lines)
-- models/staging/stg_customers.sql (40 lines)
-- models/marts/fact_orders.sql (30 lines)

Each model is small, testable, and reusable.

Info

Analytics Engineering: dbt created a new role — the Analytics Engineer. They're analysts who write SQL, but they use software engineering practices (Git, tests, CI/CD) to build production-grade data pipelines.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}