Topic 57 of

Data Lake vs Data Warehouse โ€” When to Use Each

A data lake stores raw files in any format. A data warehouse stores structured tables optimized for SQL. The lake is flexible but messy. The warehouse is clean but rigid.

๐Ÿ“šBeginner
โฑ๏ธ9 min
โœ…7 quizzes
๐Ÿž๏ธ

Data Lake vs Data Warehouse โ€” Definitions

Data Warehouse

A data warehouse is a centralized repository of structured, cleaned data optimized for SQL analytics.

Characteristics:

  • Structured data (tables with fixed schemas)
  • Cleaned and transformed (via ETL/ELT pipelines)
  • Optimized for queries (indexed, partitioned, columnar storage)
  • Used by analysts (for dashboards, reports, BI tools)

Examples: BigQuery, Snowflake, Redshift, Azure Synapse

Analogy: A data warehouse is like a library with organized bookshelves, catalogs, and librarians. You can find any book quickly.

Data Lake

A data lake is a storage system for raw, unprocessed data in any format (structured, semi-structured, unstructured).

Characteristics:

  • Any format (CSV, JSON, Parquet, images, videos, logs)
  • Raw, unprocessed (no schema enforcement, no cleaning)
  • Cheap storage (store everything, decide later what to use)
  • Used by data engineers/scientists (for ETL, ML, exploration)

Examples: AWS S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS)

Analogy: A data lake is like a storage unit where you dump all your stuff in boxes. It's cheap, but you need to dig through boxes to find anything useful.

Info

Key Difference: A warehouse stores answers (clean, queryable tables). A lake stores raw materials (unprocessed files you must transform before querying).

โš–๏ธ

Data Lake vs Warehouse โ€” Side-by-Side

| Aspect | Data Lake | Data Warehouse | |--------|-----------|----------------| | Data format | Any (CSV, JSON, Parquet, images, videos) | Structured tables only | | Schema | Schema-on-read (no schema until queried) | Schema-on-write (enforced when data is loaded) | | Data quality | Raw, unprocessed (may have errors) | Cleaned, validated, transformed | | Storage cost | Very cheap ($0.023/GB/month on S3) | Moderate ($0.02-0.04/GB/month) | | Query speed | Slow (must read files, parse formats) | Fast (optimized for SQL, columnar) | | Users | Data engineers, data scientists | Analysts, business users | | Query language | Spark, Python, or SQL (via tools) | SQL | | Use cases | Raw data storage, ML feature engineering, archival | Dashboards, reports, business analytics | | Flexibility | High (store anything, decide later) | Low (must define schema upfront) | | Governance | Hard (no schema, no validation) | Easy (schema enforced, data quality tests) |

Schema-on-Read vs Schema-on-Write

Data Lake (Schema-on-Read):

  • Data is stored as-is (no schema enforced)
  • Schema is applied when you read/query the data
  • Flexible (can reinterpret the same data in multiple ways)
  • Example: JSON logs stored in S3 โ€” you parse fields when querying

Data Warehouse (Schema-on-Write):

  • Schema is enforced when data is loaded
  • Data must match the table schema (column types, constraints)
  • Rigid (if schema changes, you must alter the table)
  • Example: Orders table in BigQuery โ€” you must specify columns and types upfront
Think of it this way...

A data lake is like a garage where you dump stuff in boxes. Later, when you need something, you open boxes and organize on-the-fly. A data warehouse is like a filing cabinet with labeled drawers โ€” everything has a place, and you can find it instantly.

โš ๏ธ CheckpointQuiz error: Missing or invalid options array

๐Ÿ—๏ธ

How Data Lakes and Warehouses Work Together

Modern data architectures use both data lakes and warehouses:

The Modern Data Stack Pattern

Raw Data Sources (APIs, databases, logs) โ†“ Data Lake (S3, ADLS, GCS) โ€” Store raw, unprocessed data โ€” โ†“ ETL/ELT Pipeline (Spark, dbt, Fivetran) โ€” Clean, transform, aggregate โ€” โ†“ Data Warehouse (BigQuery, Snowflake, Redshift) โ€” Structured, queryable tables โ€” โ†“ BI Tools / Dashboards (Tableau, Power BI, Looker)

Example โ€” Swiggy's Data Architecture

Step 1: Ingest to Lake

  • Raw delivery GPS logs (JSON, 5 GB/hour) โ†’ Stream to S3
  • Restaurant menu updates (JSON) โ†’ Load to S3
  • Order transaction logs (CSV) โ†’ Load to S3

Why lake first?

  • Cheap storage (don't know yet what data is useful)
  • Preserve raw data (can reprocess if needed)
  • Flexible (no schema decisions yet)

Step 2: Transform with Spark

  • Read JSON logs from S3
  • Parse, clean, aggregate (calculate delivery times by zone)
  • Write transformed Parquet files back to S3

Step 3: Load to Warehouse

  • Load cleaned Parquet from S3 to BigQuery
  • Create tables: fact_deliveries, dim_restaurants, dim_zones
  • Analysts query BigQuery (fast SQL queries)

Step 4: Dashboards

  • Looker Studio connects to BigQuery
  • Real-time dashboards: "Average delivery time by city"
Info

Best Practice: Use the lake for raw data and ETL. Use the warehouse for analytics. This separation keeps the warehouse clean and fast while preserving raw data in the lake.

๐Ÿค”

When to Use Data Lake vs Data Warehouse

Use a Data Lake When:

  1. Storing unstructured data

    • Images, videos, PDFs, audio files
    • Example: Netflix stores movie files in S3
  2. Raw data you're not ready to analyze yet

    • Logs, events, sensor data
    • Example: Flipkart stores raw clickstream logs (may analyze later for ML)
  3. Data science / machine learning

    • Training datasets (images for computer vision, text for NLP)
    • Example: Ola stores ride GPS traces in S3 for route optimization models
  4. Archival / compliance

    • Store historical data cheaply (access rarely)
    • Example: Banking transactions older than 7 years โ†’ archive to S3 Glacier ($0.004/GB/month)
  5. Exploratory analysis / experimentation

    • Data engineers experimenting with new data sources
    • Example: Parse Twitter API data to see if it's useful for sentiment analysis

Use a Data Warehouse When:

  1. Structured data for analytics

    • Orders, customers, products, transactions
    • Example: Zomato stores cleaned order data in BigQuery
  2. Business intelligence / dashboards

    • Analysts querying for reports
    • Example: "Monthly revenue by product category" dashboard
  3. Ad-hoc SQL queries

    • Business users exploring data without code
    • Example: Marketing team queries BigQuery to analyze campaign ROI
  4. Aggregated metrics

    • Daily/monthly summaries for fast dashboards
    • Example: Swiggy pre-aggregates delivery metrics by hour/city in Redshift
  5. Data governance / quality

    • Schema enforcement, validation, tests
    • Example: Razorpay's warehouse has strict schema + dbt tests to ensure payment data accuracy

Hybrid Use Cases

Use both:

  • Raw data in lake (cheap, flexible) โ†’ Transform with Spark โ†’ Load to warehouse (fast queries)
  • Example: Flipkart stores 10 TB of logs daily in S3, processes with Spark, loads 100 GB of summaries to BigQuery
Think of it this way...

Data Lake (S3):

  • Raw GPS traces from delivery partners (5 GB/hour)
  • Photos of delivered orders (for verification)
  • Raw logs from mobile app

Data Warehouse (BigQuery):

  • Clean order data (order_id, customer_id, delivery_time)
  • Aggregated metrics (orders per dark store per hour)
  • Customer cohorts (retention, lifetime value)

Why both? Lake stores raw data for ML (route optimization models). Warehouse stores clean data for dashboards (ops team monitoring delivery times).

๐Ÿ”

How to Query Data Lakes

Data lakes store files, not tables. To query them, you need a query engine:

1. AWS Athena (Query S3 with SQL)

Athena is a serverless query service that reads files in S3 and applies SQL.

query.sqlSQL
-- Define a table schema over S3 files
CREATE EXTERNAL TABLE clickstream (
  user_id STRING,
  event_type STRING,
  page_url STRING,
  timestamp TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://flipkart-data-lake/clickstream/';

-- Query the data
SELECT event_type, COUNT(*) AS event_count
FROM clickstream
WHERE DATE(timestamp) = '2026-03-22'
GROUP BY event_type;

Performance: Slower than BigQuery (no indexing, reads files every time).

Cost: $5 per TB scanned (same as BigQuery).

2. Apache Spark (Programmatic Queries)

code.pyPython
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("query-lake").getOrCreate()

# Read Parquet files from S3
df = spark.read.parquet("s3://flipkart-data-lake/clickstream/2026-03-22/")

# Query with Spark SQL
df.createOrReplaceTempView("clickstream")
result = spark.sql("""
  SELECT event_type, COUNT(*) AS event_count
  FROM clickstream
  GROUP BY event_type
""")

result.show()

3. Databricks / Delta Lake (Enhanced Lake with ACID)

Delta Lake adds warehouse-like features to data lakes:

  • ACID transactions (no partial writes)
  • Schema enforcement
  • Time travel (query historical versions)

Example โ€” Delta Lake on S3:

code.pyPython
# Write data to Delta Lake
df.write.format("delta").save("s3://flipkart-lake/orders-delta/")

# Query Delta Lake (fast, ACID, schema enforced)
orders = spark.read.format("delta").load("s3://flipkart-lake/orders-delta/")
orders.filter("order_date = '2026-03-22'").show()

Benefit: Combines lake flexibility (store raw files) with warehouse performance (fast queries, schema enforcement).

Info

Data Lakehouse: A hybrid architecture combining lake (cheap, flexible storage) with warehouse (fast queries, schema enforcement). Examples: Databricks Delta Lake, Apache Iceberg, Hudi.

๐Ÿ 

The Lakehouse โ€” Best of Both Worlds

Data Lakehouse is a new architecture that combines the flexibility of data lakes with the performance of data warehouses.

Problem with Traditional Lakes

  • No schema enforcement โ†’ data quality issues
  • Slow queries (read/parse files every time)
  • No ACID transactions (partial writes, inconsistency)

Lakehouse Features

  1. ACID transactions (via Delta Lake, Iceberg, Hudi)
  2. Schema enforcement (validate data on write)
  3. Time travel (query historical versions)
  4. Fast queries (metadata indexing, data skipping)
  5. Cheap storage (still just files on S3/ADLS/GCS)

Example โ€” Databricks Lakehouse

code.pyPython
# Create a Delta table (lakehouse) on S3
df.write.format("delta") \
  .mode("overwrite") \
  .saveAsTable("orders")

# Query with SQL (fast, like a warehouse)
spark.sql("""
  SELECT city, SUM(amount) AS revenue
  FROM orders
  WHERE order_date = '2026-03-22'
  GROUP BY city
""").show()

# Time travel (query yesterday's data)
spark.read.format("delta") \
  .option("versionAsOf", 1) \
  .table("orders") \
  .show()

Benefit: Analysts query with SQL (like a warehouse), but data is stored cheaply in S3 (like a lake).

When to Use a Lakehouse

  • You want warehouse performance (fast SQL queries)
  • But you need lake flexibility (store any format, change schemas easily)
  • And you want lake pricing (cheap S3 storage)

Example companies using lakehouses:

  • Netflix: Stores petabytes of user data in S3, queries with Iceberg
  • Uber: Uses Delta Lake on S3 for ride data
  • Booking.com: Migrated from Hive (slow lake) to Iceberg (fast lakehouse)
Info

Future Trend: Lakehouses are replacing separate lakes + warehouses. In 5 years, most companies will use lakehouses (Delta Lake, Iceberg) instead of maintaining both a lake and a warehouse.

โš ๏ธ FinalQuiz error: Missing or invalid questions array

โš ๏ธ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}