Topic 52 of

ETL vs ELT โ€” When to Transform Before or After Loading

ETL transforms data before loading it into the warehouse. ELT loads raw data first, then transforms it inside the warehouse. The difference? Speed, flexibility, and where compute happens.

๐Ÿ“šIntermediate
โฑ๏ธ9 min
โœ…7 quizzes
๐Ÿ”„

ETL โ€” Extract, Transform, Load

ETL is the traditional approach: data is transformed before it's loaded into the warehouse.

The Flow

Source DB โ†’ Extract โ†’ Transform โ†’ Load โ†’ Warehouse
  1. Extract: Pull raw data from sources (MySQL, APIs, logs)
  2. Transform: Clean, join, aggregate, filter โ€” outside the warehouse
  3. Load: Insert the transformed data into the warehouse

Example โ€” Flipkart's Daily Order Pipeline (ETL)

Source: MySQL database with 10 million orders yesterday

Extract:

query.sqlSQL
-- Pull yesterday's raw orders
SELECT * FROM orders WHERE DATE(order_date) = '2026-03-22';

Transform (in Python/Spark before loading):

code.pyPython
import pandas as pd

# Load extracted data
orders = pd.read_csv('raw_orders.csv')

# Clean data
orders = orders.dropna(subset=['customer_id', 'amount'])
orders = orders[orders['amount'] > 0]  # Remove invalid orders

# Join with customer dimension
customers = pd.read_csv('customers.csv')
orders = orders.merge(customers, on='customer_id', how='left')

# Calculate derived columns
orders['order_month'] = pd.to_datetime(orders['order_date']).dt.to_period('M')
orders['is_first_order'] = orders.groupby('customer_id')['order_date'].rank() == 1

# Save transformed data
orders.to_csv('transformed_orders.csv', index=False)

Load:

query.sqlSQL
-- Load into BigQuery
LOAD DATA INTO warehouse.fact_orders
FROM 'gs://bucket/transformed_orders.csv';

When ETL Makes Sense

Use ETL when:

  • Your warehouse is expensive or limited in compute (legacy on-prem systems)
  • You need to enforce data quality before loading (strict governance)
  • The warehouse can't handle the source data format (e.g., nested JSON)
  • You have complex transformations that are faster in Spark/Python than SQL

Example โ€” Swiggy's Restaurant Data:

  • Raw source: Nested JSON from restaurant APIs
  • Transform in Spark: Flatten JSON, deduplicate, geocode addresses
  • Load: Clean tabular data into Redshift
Info

ETL Downside: If you realize you need a new column (e.g., customer_lifetime_orders), you must re-extract, re-transform, and re-load the entire dataset โ€” which can take hours or days.

โšก

ELT โ€” Extract, Load, Transform

ELT is the modern approach: data is loaded raw, then transformed inside the warehouse.

The Flow

Source DB โ†’ Extract โ†’ Load (raw) โ†’ Transform (in warehouse) โ†’ Analytics
  1. Extract: Pull raw data from sources
  2. Load: Insert raw data into the warehouse immediately (no transformation)
  3. Transform: Use SQL (in BigQuery, Snowflake, Redshift) to clean, join, and aggregate

Example โ€” Flipkart's Daily Order Pipeline (ELT)

Extract & Load:

query.sqlSQL
-- Load raw data directly into BigQuery
CREATE OR REPLACE TABLE warehouse.raw_orders AS
SELECT * FROM EXTERNAL_QUERY(
  'projects/flipkart/connections/mysql',
  'SELECT * FROM orders WHERE DATE(order_date) = "2026-03-22"'
);

Transform (inside BigQuery using SQL/dbt):

query.sqlSQL
-- Create clean fact table via transformation
CREATE OR REPLACE TABLE warehouse.fact_orders AS
SELECT
  o.order_id,
  o.customer_id,
  o.amount,
  o.order_date,
  c.city,
  c.customer_tier,
  DATE_TRUNC(o.order_date, MONTH) AS order_month,
  -- Mark first orders
  ROW_NUMBER() OVER (
    PARTITION BY o.customer_id ORDER BY o.order_date
  ) = 1 AS is_first_order
FROM warehouse.raw_orders o
LEFT JOIN warehouse.dim_customers c USING (customer_id)
WHERE o.amount > 0  -- Filter invalid orders
  AND o.customer_id IS NOT NULL;

Benefit: If you need to add customer_lifetime_orders later, just re-run the transformation query โ€” no need to re-extract from the source.

When ELT Makes Sense

Use ELT when:

  • Your warehouse is modern and scalable (BigQuery, Snowflake, Redshift)
  • You want flexibility to iterate on transformations without re-extracting
  • Source data is already in a warehouse-compatible format
  • You want analysts to own transformations (via SQL/dbt)

Example โ€” Zomato's Event Logs:

  • Raw source: Pageview logs from Google Analytics BigQuery export
  • Load: Stream raw events directly into BigQuery
  • Transform: Analysts write SQL to sessionize, attribute conversions
Think of it this way...

ETL is like cooking a meal, then storing the finished dish. If you want to add salt, you have to re-cook the entire meal. ELT is like storing raw ingredients, then cooking on-demand โ€” if you want to add salt, just add it to the recipe.

โš ๏ธ CheckpointQuiz error: Missing or invalid options array

โš–๏ธ

ETL vs ELT โ€” Side-by-Side Comparison

| Aspect | ETL | ELT | |--------|-----|-----| | Transformation timing | Before loading | After loading | | Where compute happens | External (Spark, Airflow, Python) | Inside warehouse (SQL) | | Data loaded | Cleaned, transformed | Raw, unprocessed | | Warehouse storage | Only final tables | Raw + transformed tables | | Flexibility | Low (re-extract to change logic) | High (re-run SQL to iterate) | | Speed to load | Slower (transform first) | Faster (load immediately) | | Best for | Legacy systems, complex non-SQL transforms | Modern cloud warehouses, SQL-based teams | | Cost | Expensive external compute | Cheap warehouse compute (BigQuery, Snowflake) | | Governance | Strict (validate before loading) | Flexible (validate after loading) | | Tools | Talend, Informatica, Apache Spark | dbt, Dataform, Matillion |

Example โ€” Flipkart Black Friday Sale

ETL Approach:

  1. Extract 500 GB of orders from MySQL
  2. Transform in Spark cluster (4 hours)
  3. Load transformed data into Redshift (30 minutes)
  4. Total time to query: 4.5 hours
  5. If you realize you need a new column: Re-run entire pipeline (4.5 hours again)

ELT Approach:

  1. Stream raw orders to BigQuery (30 minutes)
  2. Transform in BigQuery SQL (10 minutes)
  3. Total time to query: 40 minutes
  4. If you need a new column: Re-run SQL transformation (10 minutes)
Info

Key Insight: ELT is faster to iterate because raw data is already in the warehouse. ETL is better when transformation logic is too complex for SQL or the warehouse can't handle the compute load.

๐Ÿš€

The Modern Data Stack โ€” Why ELT Won

Modern cloud warehouses (BigQuery, Snowflake, Redshift) made ELT the dominant pattern because:

1. Cheap, Scalable Compute

Old world (2010): On-prem data warehouses (Teradata, Oracle) had expensive, fixed compute. Running transformations inside the warehouse was costly, so companies used ETL to offload compute to Spark clusters.

Modern world (2026): BigQuery charges $5 per TB scanned. Transforming 100 GB of data costs $0.50. Warehouses auto-scale, so there's no reason to transform outside the warehouse.

2. SQL is Fast Enough

Modern warehouses are columnar and massively parallel. SQL queries that took hours in 2010 now take seconds.

Example โ€” Swiggy's Daily Aggregations:

query.sqlSQL
-- This query runs in 8 seconds on 50M rows in BigQuery
SELECT
  city,
  DATE_TRUNC(order_date, MONTH) AS month,
  COUNT(*) AS orders,
  AVG(delivery_time_minutes) AS avg_delivery_time,
  SUM(amount) AS revenue
FROM warehouse.raw_orders
WHERE order_date >= '2024-01-01'
GROUP BY city, month
ORDER BY month, revenue DESC;

No need for Spark โ€” BigQuery handles this in seconds.

3. Analysts Can Own Transformations

With ELT, analysts write SQL transformations using tools like dbt. No need to wait for data engineers to update Spark jobs.

Example โ€” dbt transformation model:

query.sqlSQL
-- models/fact_orders_daily.sql
{{ config(materialized='table') }}

SELECT
  order_date,
  customer_id,
  SUM(amount) AS total_amount,
  COUNT(*) AS order_count
FROM {{ source('raw', 'orders') }}
WHERE amount > 0
GROUP BY order_date, customer_id

Run dbt run and the table is created/updated in the warehouse.

4. Flexibility to Experiment

Raw data is preserved. Analysts can reprocess history without re-extracting.

Example: Flipkart realizes they want to classify orders as "high-value" (โ‰ฅ โ‚น5000). With ELT, they just update the SQL transformation and re-run it on raw data. With ETL, they'd need to re-extract millions of orders from the source.

Think of it this way...

Zepto (10-minute grocery delivery) uses ELT with BigQuery. Every order is loaded raw within seconds. Analysts query raw data to experiment with new metrics (e.g., "orders per active dark store per hour"). When they finalize the logic, they create a dbt model to materialize it as a table.

๐Ÿค”

When to Use ETL vs ELT

Use ETL When:

  1. Warehouse compute is expensive/limited

    • Legacy on-prem systems (Teradata, Oracle)
    • Small data teams with budget constraints
  2. Complex transformations require specialized tools

    • Machine learning feature engineering (use PySpark)
    • Image/video processing (use Python libraries)
    • Advanced statistical models (use R/Python)
  3. Strict data governance before loading

    • PII must be masked before entering the warehouse
    • Regulatory requirements (HIPAA, GDPR) mandate pre-load validation
  4. Source data format is incompatible with warehouse

    • Deeply nested JSON that's hard to query in SQL
    • Binary formats (Avro, Protobuf) that need parsing

Example โ€” PhonePe's Transaction Logs:

  • Raw data: 1 TB of JSON logs per day
  • ETL: Use Spark to flatten JSON, mask PII, filter fraud, then load to Redshift
  • Why ETL: JSON is deeply nested, PII must be removed before loading

Use ELT When:

  1. Modern cloud warehouse (BigQuery, Snowflake, Redshift)

    • Cheap, scalable compute
    • SQL is fast enough for 95% of transformations
  2. Analysts need to iterate quickly

    • Business logic changes frequently
    • Experimentation is key (A/B test analysis, cohort definitions)
  3. Source data is already structured

    • Relational databases, SaaS APIs (Stripe, Salesforce)
    • CSVs, Parquet files
  4. You want a "single source of truth" in the warehouse

    • All raw data preserved for audits and reprocessing
    • Transformations are versioned and reproducible (via dbt)

Example โ€” Razorpay's Payment Analytics:

  • Raw data: Payment events from PostgreSQL
  • ELT: Load raw events to BigQuery, transform with dbt
  • Why ELT: Data is structured, analysts iterate on cohort definitions, BigQuery handles the compute
Info

Modern Best Practice: Start with ELT for 90% of pipelines. Use ETL only when you have a specific reason (complex ML, PII masking, non-SQL transformations).

โš ๏ธ FinalQuiz error: Missing or invalid questions array

โš ๏ธ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}