What is ETL vs ELT — When to Transform Before or After Loading?

Understand the difference between ETL and ELT data pipelines. Learn when to transform data before loading and when to load raw data first with real-world examples.

Is ETL vs ELT — When to Transform Before or After Loading suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 9 min to complete and includes 7 interactive quizzes to test your understanding.

How long does it take to learn ETL vs ELT — When to Transform Before or After Loading?

You can complete this topic in about 9 min. The topic is part 52 of undefined in our comprehensive Data Analytics Learning Path.

ETL vs ELT: Complete Comparison for Data Analysts | DataPath

🔄

ETL — Extract, Transform, Load

ETL is the traditional approach: data is transformed before it's loaded into the warehouse.

The Flow

Source DB → Extract → Transform → Load → Warehouse

Extract: Pull raw data from sources (MySQL, APIs, logs)
Transform: Clean, join, aggregate, filter — outside the warehouse
Load: Insert the transformed data into the warehouse

Example — Flipkart's Daily Order Pipeline (ETL)

Source: MySQL database with 10 million orders yesterday

Extract:

query.sqlSQL

-- Pull yesterday's raw orders
SELECT * FROM orders WHERE DATE(order_date) = '2026-03-22';

Transform (in Python/Spark before loading):

code.pyPython

import pandas as pd

# Load extracted data
orders = pd.read_csv('raw_orders.csv')

# Clean data
orders = orders.dropna(subset=['customer_id', 'amount'])
orders = orders[orders['amount'] > 0]  # Remove invalid orders

# Join with customer dimension
customers = pd.read_csv('customers.csv')
orders = orders.merge(customers, on='customer_id', how='left')

# Calculate derived columns
orders['order_month'] = pd.to_datetime(orders['order_date']).dt.to_period('M')
orders['is_first_order'] = orders.groupby('customer_id')['order_date'].rank() == 1

# Save transformed data
orders.to_csv('transformed_orders.csv', index=False)

Load:

query.sqlSQL

-- Load into BigQuery
LOAD DATA INTO warehouse.fact_orders
FROM 'gs://bucket/transformed_orders.csv';

When ETL Makes Sense

Use ETL when:

Your warehouse is expensive or limited in compute (legacy on-prem systems)
You need to enforce data quality before loading (strict governance)
The warehouse can't handle the source data format (e.g., nested JSON)
You have complex transformations that are faster in Spark/Python than SQL

Example — Swiggy's Restaurant Data:

Raw source: Nested JSON from restaurant APIs
Transform in Spark: Flatten JSON, deduplicate, geocode addresses
Load: Clean tabular data into Redshift

Info

ETL Downside: If you realize you need a new column (e.g., customer_lifetime_orders), you must re-extract, re-transform, and re-load the entire dataset — which can take hours or days.

⚡

ELT — Extract, Load, Transform

ELT is the modern approach: data is loaded raw, then transformed inside the warehouse.

The Flow

Source DB → Extract → Load (raw) → Transform (in warehouse) → Analytics

Extract: Pull raw data from sources
Load: Insert raw data into the warehouse immediately (no transformation)
Transform: Use SQL (in BigQuery, Snowflake, Redshift) to clean, join, and aggregate

Example — Flipkart's Daily Order Pipeline (ELT)

Extract & Load:

query.sqlSQL

-- Load raw data directly into BigQuery
CREATE OR REPLACE TABLE warehouse.raw_orders AS
SELECT * FROM EXTERNAL_QUERY(
  'projects/flipkart/connections/mysql',
  'SELECT * FROM orders WHERE DATE(order_date) = "2026-03-22"'
);

Transform (inside BigQuery using SQL/dbt):

query.sqlSQL

-- Create clean fact table via transformation
CREATE OR REPLACE TABLE warehouse.fact_orders AS
SELECT
  o.order_id,
  o.customer_id,
  o.amount,
  o.order_date,
  c.city,
  c.customer_tier,
  DATE_TRUNC(o.order_date, MONTH) AS order_month,
  -- Mark first orders
  ROW_NUMBER() OVER (
    PARTITION BY o.customer_id ORDER BY o.order_date
  ) = 1 AS is_first_order
FROM warehouse.raw_orders o
LEFT JOIN warehouse.dim_customers c USING (customer_id)
WHERE o.amount > 0  -- Filter invalid orders
  AND o.customer_id IS NOT NULL;

Benefit: If you need to add customer_lifetime_orders later, just re-run the transformation query — no need to re-extract from the source.

When ELT Makes Sense

Use ELT when:

Your warehouse is modern and scalable (BigQuery, Snowflake, Redshift)
You want flexibility to iterate on transformations without re-extracting
Source data is already in a warehouse-compatible format
You want analysts to own transformations (via SQL/dbt)

Example — Zomato's Event Logs:

Raw source: Pageview logs from Google Analytics BigQuery export
Load: Stream raw events directly into BigQuery
Transform: Analysts write SQL to sessionize, attribute conversions

Think of it this way...

ETL is like cooking a meal, then storing the finished dish. If you want to add salt, you have to re-cook the entire meal. ELT is like storing raw ingredients, then cooking on-demand — if you want to add salt, just add it to the recipe.

⚠️ CheckpointQuiz error: Missing or invalid options array

⚖️

ETL vs ELT — Side-by-Side Comparison

| Aspect | ETL | ELT | |--------|-----|-----| | Transformation timing | Before loading | After loading | | Where compute happens | External (Spark, Airflow, Python) | Inside warehouse (SQL) | | Data loaded | Cleaned, transformed | Raw, unprocessed | | Warehouse storage | Only final tables | Raw + transformed tables | | Flexibility | Low (re-extract to change logic) | High (re-run SQL to iterate) | | Speed to load | Slower (transform first) | Faster (load immediately) | | Best for | Legacy systems, complex non-SQL transforms | Modern cloud warehouses, SQL-based teams | | Cost | Expensive external compute | Cheap warehouse compute (BigQuery, Snowflake) | | Governance | Strict (validate before loading) | Flexible (validate after loading) | | Tools | Talend, Informatica, Apache Spark | dbt, Dataform, Matillion |

Example — Flipkart Black Friday Sale

ETL Approach:

Extract 500 GB of orders from MySQL
Transform in Spark cluster (4 hours)
Load transformed data into Redshift (30 minutes)
Total time to query: 4.5 hours
If you realize you need a new column: Re-run entire pipeline (4.5 hours again)

ELT Approach:

Stream raw orders to BigQuery (30 minutes)
Transform in BigQuery SQL (10 minutes)
Total time to query: 40 minutes
If you need a new column: Re-run SQL transformation (10 minutes)

Info

Key Insight: ELT is faster to iterate because raw data is already in the warehouse. ETL is better when transformation logic is too complex for SQL or the warehouse can't handle the compute load.

🚀

The Modern Data Stack — Why ELT Won

Modern cloud warehouses (BigQuery, Snowflake, Redshift) made ELT the dominant pattern because:

1. Cheap, Scalable Compute

Old world (2010): On-prem data warehouses (Teradata, Oracle) had expensive, fixed compute. Running transformations inside the warehouse was costly, so companies used ETL to offload compute to Spark clusters.

Modern world (2026): BigQuery charges $5 per TB scanned. Transforming 100 GB of data costs $0.50. Warehouses auto-scale, so there's no reason to transform outside the warehouse.

2. SQL is Fast Enough

Modern warehouses are columnar and massively parallel. SQL queries that took hours in 2010 now take seconds.

Example — Swiggy's Daily Aggregations:

query.sqlSQL

-- This query runs in 8 seconds on 50M rows in BigQuery
SELECT
  city,
  DATE_TRUNC(order_date, MONTH) AS month,
  COUNT(*) AS orders,
  AVG(delivery_time_minutes) AS avg_delivery_time,
  SUM(amount) AS revenue
FROM warehouse.raw_orders
WHERE order_date >= '2024-01-01'
GROUP BY city, month
ORDER BY month, revenue DESC;

No need for Spark — BigQuery handles this in seconds.

3. Analysts Can Own Transformations

With ELT, analysts write SQL transformations using tools like dbt. No need to wait for data engineers to update Spark jobs.

Example — dbt transformation model:

query.sqlSQL

-- models/fact_orders_daily.sql
{{ config(materialized='table') }}

SELECT
  order_date,
  customer_id,
  SUM(amount) AS total_amount,
  COUNT(*) AS order_count
FROM {{ source('raw', 'orders') }}
WHERE amount > 0
GROUP BY order_date, customer_id

Run dbt run and the table is created/updated in the warehouse.

4. Flexibility to Experiment

Raw data is preserved. Analysts can reprocess history without re-extracting.

Example: Flipkart realizes they want to classify orders as "high-value" (≥ ₹5000). With ELT, they just update the SQL transformation and re-run it on raw data. With ETL, they'd need to re-extract millions of orders from the source.

Think of it this way...

Zepto (10-minute grocery delivery) uses ELT with BigQuery. Every order is loaded raw within seconds. Analysts query raw data to experiment with new metrics (e.g., "orders per active dark store per hour"). When they finalize the logic, they create a dbt model to materialize it as a table.

🤔

When to Use ETL vs ELT

Use ETL When:

Warehouse compute is expensive/limited
- Legacy on-prem systems (Teradata, Oracle)
- Small data teams with budget constraints
Complex transformations require specialized tools
- Machine learning feature engineering (use PySpark)
- Image/video processing (use Python libraries)
- Advanced statistical models (use R/Python)
Strict data governance before loading
- PII must be masked before entering the warehouse
- Regulatory requirements (HIPAA, GDPR) mandate pre-load validation
Source data format is incompatible with warehouse
- Deeply nested JSON that's hard to query in SQL
- Binary formats (Avro, Protobuf) that need parsing

Example — PhonePe's Transaction Logs:

Raw data: 1 TB of JSON logs per day
ETL: Use Spark to flatten JSON, mask PII, filter fraud, then load to Redshift
Why ETL: JSON is deeply nested, PII must be removed before loading

Use ELT When:

Modern cloud warehouse (BigQuery, Snowflake, Redshift)
- Cheap, scalable compute
- SQL is fast enough for 95% of transformations
Analysts need to iterate quickly
- Business logic changes frequently
- Experimentation is key (A/B test analysis, cohort definitions)
Source data is already structured
- Relational databases, SaaS APIs (Stripe, Salesforce)
- CSVs, Parquet files
You want a "single source of truth" in the warehouse
- All raw data preserved for audits and reprocessing
- Transformations are versioned and reproducible (via dbt)

Example — Razorpay's Payment Analytics:

Raw data: Payment events from PostgreSQL
ELT: Load raw events to BigQuery, transform with dbt
Why ELT: Data is structured, analysts iterate on cohort definitions, BigQuery handles the compute

Info

Modern Best Practice: Start with ELT for 90% of pipelines. Use ETL only when you have a specific reason (complex ML, PII masking, non-SQL transformations).

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}