What is What is dbt (data build tool)? Complete Guide for Analysts?

Learn dbt (data build tool) — the SQL transformation framework that turns analysts into data engineers. Understand models, tests, and documentation with real examples.

Is What is dbt (data build tool)? Complete Guide for Analysts suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 11 min to complete and includes 7 interactive quizzes to test your understanding.

How long does it take to learn What is dbt (data build tool)? Complete Guide for Analysts?

You can complete this topic in about 11 min. The topic is part 53 of undefined in our comprehensive Data Analytics Learning Path.

What is dbt? Complete Guide to Data Build Tool | DataPath

🛠️

What is dbt?

dbt (data build tool) is an open-source command-line tool that enables analysts to transform data in their warehouse using SQL. It's the "T" in ELT.

The Problem dbt Solves

Before dbt, transforming data in a warehouse meant:

Writing SQL queries in scripts or notebooks
Manually running them in order (dependencies are fragile)
No version control or testing
No documentation or lineage graphs
Hard to collaborate across teams

Example — Pre-dbt workflow at Swiggy:

$ terminalBash

# Run in order manually (fragile):
psql -f 01_clean_orders.sql
psql -f 02_join_customers.sql
psql -f 03_calculate_metrics.sql

If 02_join_customers.sql depends on 01_clean_orders.sql, you must remember to run them in order. If someone changes 01, they might break 02.

The dbt Solution

dbt turns SQL files into a directed acyclic graph (DAG) of dependencies. Each .sql file is a "model" (a table or view). dbt figures out the order and runs them automatically.

Example — dbt workflow:

query.sqlSQL

-- models/staging/stg_orders.sql
SELECT * FROM {{ source('raw', 'orders') }}
WHERE amount > 0

-- models/marts/fact_orders.sql
SELECT
  order_id,
  customer_id,
  SUM(amount) AS total_amount
FROM {{ ref('stg_orders') }}
GROUP BY order_id, customer_id

Run dbt run and dbt:

Runs stg_orders.sql first (no dependencies)
Runs fact_orders.sql second (depends on stg_orders)
Creates tables/views in your warehouse

Info

Key Insight: dbt is not an ETL tool. It doesn't extract or load data. It only transforms data that's already in your warehouse (BigQuery, Snowflake, Redshift, etc.). It's the "T" in ELT.

📄

dbt Models — Turning SQL into Tables

A model is a single .sql file that defines a transformation. When you run dbt run, dbt executes the SQL and materializes it as a table or view in your warehouse.

Example — Staging Model (Cleaning Raw Data)

query.sqlSQL

-- models/staging/stg_orders.sql
{{
  config(
    materialized='view'
  )
}}

SELECT
  order_id,
  customer_id,
  product_id,
  amount,
  order_date,
  -- Clean data
  CASE
    WHEN amount < 0 THEN 0
    ELSE amount
  END AS clean_amount,
  -- Add derived fields
  DATE_TRUNC(order_date, MONTH) AS order_month
FROM {{ source('raw', 'orders') }}
WHERE customer_id IS NOT NULL

What happens when you run dbt run:

dbt generates CREATE VIEW stg_orders AS (SELECT ...)
Executes it in your warehouse (BigQuery, Snowflake, etc.)
Creates a view called stg_orders

Example — Mart Model (Business Logic)

query.sqlSQL

-- models/marts/fact_orders_monthly.sql
{{
  config(
    materialized='table'
  )
}}

SELECT
  order_month,
  customer_id,
  COUNT(*) AS order_count,
  SUM(clean_amount) AS total_revenue,
  AVG(clean_amount) AS avg_order_value
FROM {{ ref('stg_orders') }}
GROUP BY order_month, customer_id

What happens:

dbt knows this depends on stg_orders (via {{ ref('stg_orders') }})
dbt runs stg_orders first (if needed)
dbt generates CREATE TABLE fact_orders_monthly AS (SELECT ...)
Creates a table in your warehouse

Materialization Strategies

| Strategy | What dbt Creates | When to Use | Example | |----------|------------------|-------------|---------| | view | A SQL view | Lightweight staging, no data duplication | stg_orders | | table | A physical table | Final marts for dashboards (fast queries) | fact_orders_monthly | | incremental | Appends new rows only | Large fact tables (millions of rows) | fact_events | | ephemeral | CTE (no table/view) | Intermediate logic, not queried directly | clean_dates |

Think of it this way...

Writing SQL in dbt is like writing functions in Python. Each model is a function that returns a dataset. {{ ref('stg_orders') }} is like calling another function. dbt figures out the order to run them, just like Python resolves function dependencies.

⚠️ CheckpointQuiz error: Missing or invalid options array

✅

Sources, Tests, and Documentation

Sources — Connecting to Raw Data

Sources define raw tables in your warehouse that dbt reads from (but doesn't create).

config.ymlYAML

# models/sources.yml
version: 2

sources:
  - name: raw
    database: flipkart_raw
    schema: production
    tables:
      - name: orders
      - name: customers
      - name: products

Usage in models:

query.sqlSQL

-- Reference source tables
SELECT * FROM {{ source('raw', 'orders') }}

Benefit: If the source table moves (e.g., from flipkart_raw.production.orders to flipkart_raw.staging.orders), you update sources.yml once, not every model.

Tests — Data Quality Checks

dbt has built-in tests to validate your data:

config.ymlYAML

# models/schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: amount
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

Run tests:

$ terminalBash

dbt test

Output:

✓ unique_stg_orders_order_id .................... [PASS]
✓ not_null_stg_orders_order_id .................. [PASS]
✓ not_null_stg_orders_customer_id ............... [PASS]
✗ relationships_stg_orders_customer_id .......... [FAIL 47 rows]

dbt found 47 orders with customer_id values that don't exist in dim_customers — a data quality issue!

Custom Tests

query.sqlSQL

-- tests/assert_no_negative_amounts.sql
SELECT *
FROM {{ ref('stg_orders') }}
WHERE amount < 0

If this query returns rows, the test fails.

Documentation

Add descriptions to your models:

config.ymlYAML

# models/schema.yml
models:
  - name: fact_orders_monthly
    description: "Monthly aggregated orders by customer. Used for cohort analysis and LTV calculations."
    columns:
      - name: order_month
        description: "First day of the month (YYYY-MM-01)"
      - name: total_revenue
        description: "Sum of order amounts (clean_amount) in rupees"

Generate docs:

$ terminalBash

dbt docs generate
dbt docs serve  # Opens a web UI with lineage graphs

Info

dbt Docs automatically generates a lineage graph showing how all your models connect — from raw sources to final dashboards. This is invaluable for understanding data dependencies.

🔄

The dbt Workflow

1. Project Structure

my_dbt_project/
├── dbt_project.yml       # Project config
├── models/
│   ├── staging/          # Clean raw data
│   │   ├── stg_orders.sql
│   │   ├── stg_customers.sql
│   ├── marts/            # Business logic
│   │   ├── fact_orders.sql
│   │   ├── dim_customers.sql
│   ├── sources.yml       # Source definitions
│   ├── schema.yml        # Tests & docs
├── tests/                # Custom tests
├── macros/               # Reusable SQL functions

2. Development Loop

$ terminalBash

# 1. Write a new model
vim models/staging/stg_orders.sql

# 2. Run the model
dbt run --models stg_orders

# 3. Test it
dbt test --models stg_orders

# 4. View compiled SQL
cat target/compiled/my_project/models/staging/stg_orders.sql

# 5. Run entire project
dbt run
dbt test

# 6. Generate docs
dbt docs generate
dbt docs serve

3. Production Deployment

In production, dbt runs on a schedule (via Airflow, GitHub Actions, or dbt Cloud):

config.ymlYAML

# Airflow DAG
from airflow import DAG
from airflow.operators.bash import BashOperator

dag = DAG('dbt_daily', schedule_interval='0 2 * * *')

dbt_run = BashOperator(
    task_id='dbt_run',
    bash_command='cd /opt/dbt && dbt run',
    dag=dag
)

dbt_test = BashOperator(
    task_id='dbt_test',
    bash_command='cd /opt/dbt && dbt test',
    dag=dag
)

dbt_run >> dbt_test

Every night at 2 AM, Airflow runs dbt run (to transform data) then dbt test (to validate it).

Think of it this way...

Razorpay uses dbt to transform payment data in BigQuery. Every hour, a dbt job runs:

Loads raw payment events from APIs (via Fivetran)
dbt cleans events (stg_payments)
dbt aggregates daily metrics (fact_payments_daily)
dbt tests data quality (no duplicates, amounts ≥ 0)
Dashboards query the final tables

🌟

Why dbt is the Standard for Analytics Engineering

1. Version Control

All transformations are .sql files in Git. You get:

Peer review (pull requests)
Rollback (revert a commit)
History (who changed what, when)

Before dbt: Analysts run ad-hoc queries in BigQuery console. No one knows what logic is in production.

With dbt: Every transformation is versioned, reviewed, and deployed via CI/CD.

2. Dependency Management

dbt automatically determines the order to run models based on {{ ref() }} and {{ source() }}.

Example:

query.sqlSQL

-- models/staging/stg_orders.sql
SELECT * FROM {{ source('raw', 'orders') }}

-- models/staging/stg_customers.sql
SELECT * FROM {{ source('raw', 'customers') }}

-- models/marts/fact_orders.sql
SELECT
  o.*,
  c.customer_tier
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_customers') }} c USING (customer_id)

Dependency graph:

raw.orders → stg_orders ↘
                          → fact_orders
raw.customers → stg_customers ↗

Run dbt run and it automatically runs in this order.

3. Data Quality Testing

Built-in tests catch issues before they reach dashboards:

unique: No duplicate order IDs
not_null: Every order has a customer ID
relationships: Every customer ID exists in dim_customers
accepted_values: Order status is one of [pending, completed, cancelled]

Example — Catching bad data:

$ terminalBash

$ dbt test

✗ unique_stg_orders_order_id .... [FAIL 12 duplicate order_ids]

The analyst investigates and finds a bug in the source extraction. Without dbt, these duplicates would corrupt the dashboard.

4. Self-Documenting

dbt generates:

Column descriptions (from schema.yml)
Lineage graphs (showing how models connect)
Query definitions (the SQL for each model)

Example lineage graph:

sources.raw.orders
    ↓
stg_orders
    ↓
fact_orders → [Dashboard: Daily Revenue]
    ↓
fact_orders_monthly → [Dashboard: Cohort Analysis]

Analysts can click on fact_orders and see:

What it depends on (upstream)
What depends on it (downstream)
The SQL that creates it
Test results

5. Modularity

Break complex logic into reusable models:

Before dbt (one giant query):

query.sqlSQL

-- 500-line query with nested CTEs
WITH clean_orders AS (...),
     clean_customers AS (...),
     joined AS (...),
     aggregated AS (...)
SELECT * FROM aggregated;

With dbt (modular models):

query.sqlSQL

-- models/staging/stg_orders.sql (50 lines)
-- models/staging/stg_customers.sql (40 lines)
-- models/marts/fact_orders.sql (30 lines)

Each model is small, testable, and reusable.

Info

Analytics Engineering: dbt created a new role — the Analytics Engineer. They're analysts who write SQL, but they use software engineering practices (Git, tests, CI/CD) to build production-grade data pipelines.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}