Topic 55 of

Apache Spark Intro — Big Data Processing for Analysts

Apache Spark lets you process terabytes of data across hundreds of machines using Python or SQL. It's what powers data pipelines at Flipkart, Uber, and Netflix when SQL warehouses aren't enough.

📚Advanced
⏱️11 min
7 quizzes

What is Apache Spark?

Apache Spark is an open-source distributed computing engine for processing massive datasets across a cluster of machines. Think of it as "SQL + Python, but for data too big to fit on one machine."

Key Concepts

Distributed Computing: Instead of processing data on one machine, Spark splits the work across 10, 100, or 1000 machines in parallel.

In-Memory Processing: Spark keeps data in RAM between operations, making it 10-100x faster than older systems (like Hadoop MapReduce) that wrote to disk after every step.

Unified Engine: Spark handles:

  • Batch processing (daily ETL jobs)
  • Streaming (real-time event processing)
  • Machine learning (training models on huge datasets)
  • SQL queries (similar to BigQuery, but on your own cluster)

Spark vs SQL Warehouses (BigQuery, Snowflake)

| Aspect | Spark | BigQuery/Snowflake | |--------|-------|-------------------| | Processing model | Distributed (you manage cluster) | Serverless (cloud manages) | | Language | Python, Scala, SQL | SQL (some Python support) | | Best for | Complex transformations, ML, unstructured data | SQL analytics on structured data | | Speed | Fast (in-memory) | Very fast (optimized for SQL) | | Cost | Pay for cluster hours | Pay per query/storage | | Complexity | High (tune cluster, partitions) | Low (just write SQL) |

Think of it this way...

BigQuery is like hiring a taxi — you tell it where to go (write SQL), and Google handles everything. Spark is like renting a fleet of buses — you control the route, speed, and crew, but you manage the logistics.

🤔

When Do Analysts Need Spark?

Most analysts never need Spark. SQL warehouses (BigQuery, Snowflake) handle 90% of analytics workloads. But Spark is necessary when:

1. Complex Transformations SQL Can't Handle

Example — Feature engineering for ML:

code.pyPython
# PySpark: Rolling 7-day average order value per customer
from pyspark.sql import Window
from pyspark.sql.functions import avg, col

window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(-6, 0)

df = df.withColumn("rolling_7day_avg", avg("amount").over(window_spec))

This is possible in SQL (window functions), but Spark makes complex multi-step transformations easier in Python.

2. Processing Unstructured Data

Example — Parsing 1 TB of JSON logs:

code.pyPython
# PySpark: Read nested JSON, flatten, filter
df = spark.read.json("s3://logs/2026-03/*.json.gz")

df_clean = df.select(
    col("event_id"),
    col("user.user_id").alias("user_id"),
    col("event_properties.page_url").alias("page_url"),
    col("timestamp")
).filter(col("user_id").isNotNull())

df_clean.write.parquet("s3://clean-logs/2026-03/")

BigQuery can query JSON, but for complex flattening and transformations, Spark is more flexible.

3. Data Too Large for Your Warehouse

Example — Flipkart's clickstream data:

  • 10 TB of raw logs per day
  • Too expensive to load into BigQuery ($200/day in storage + query costs)
  • Use Spark to pre-aggregate, then load summaries to BigQuery
code.pyPython
# PySpark: Aggregate 10 TB into 100 GB summary
df = spark.read.parquet("s3://clickstream/2026-03-22/")

summary = df.groupBy("user_id", "session_id", "date").agg(
    count("*").alias("pageviews"),
    countDistinct("product_id").alias("products_viewed"),
    sum("time_on_page").alias("total_time_seconds")
)

# Load 100 GB summary to BigQuery (cheap)
summary.write.format("bigquery").save("flipkart.analytics.daily_sessions")

4. Custom Machine Learning Pipelines

Example — Training a recommendation model:

code.pyPython
from pyspark.ml.recommendation import ALS

# Train on 1 billion user-product interactions
als = ALS(userCol="user_id", itemCol="product_id", ratingCol="rating")
model = als.fit(interactions_df)

# Generate recommendations for all users
recommendations = model.recommendForAllUsers(10)

BigQuery ML exists, but Spark MLlib offers more flexibility for custom models.

Info

Rule of thumb: If your data fits in a SQL warehouse and your transformations are expressible in SQL, use the warehouse. Use Spark only when you have a specific reason (complex logic, unstructured data, extreme scale).

⚠️ CheckpointQuiz error: Missing or invalid options array

📊

Spark DataFrames — The Core Abstraction

A Spark DataFrame is like a Pandas DataFrame, but distributed across a cluster. Each machine holds a chunk of rows.

Creating a DataFrame

code.pyPython
from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder.appName("swiggy-analysis").getOrCreate()

# Read from CSV
df = spark.read.csv("s3://swiggy-data/orders.csv", header=True, inferSchema=True)

# Show first 5 rows
df.show(5)

Output:

+----------+-------------+---------------+------+ | order_id|customer_id |order_date |amount| +----------+-------------+---------------+------+ | 100001| C45123|2026-03-22 | 450 | | 100002| C67890|2026-03-22 | 890 | | 100003| C45123|2026-03-23 | 320 | +----------+-------------+---------------+------+

DataFrame Operations (Similar to SQL)

code.pyPython
from pyspark.sql.functions import col, sum, avg, count

# Filter
df_mumbai = df.filter(col("city") == "Mumbai")

# Select columns
df_slim = df.select("order_id", "customer_id", "amount")

# Group by and aggregate
revenue_by_city = df.groupBy("city").agg(
    count("*").alias("order_count"),
    sum("amount").alias("total_revenue"),
    avg("amount").alias("avg_order_value")
)

revenue_by_city.show()

Output:

+----------+-------------+---------------+-----------------+ |city |order_count |total_revenue |avg_order_value | +----------+-------------+---------------+-----------------+ |Mumbai | 1245000| 622500000.0| 500.0 | |Bangalore | 1098000| 549000000.0| 500.2 | |Delhi | 987000| 493500000.0| 500.1 | +----------+-------------+---------------+-----------------+

Transformations vs Actions

Transformations (lazy — don't execute immediately):

  • filter(), select(), groupBy(), join()
  • Spark builds an execution plan but doesn't run it

Actions (eager — trigger execution):

  • show(), count(), collect(), write()
  • Spark executes the entire plan and returns results
code.pyPython
# This doesn't run yet (lazy)
df_filtered = df.filter(col("amount") > 500)
df_grouped = df_filtered.groupBy("city").count()

# This triggers execution (action)
df_grouped.show()  # Spark runs the entire pipeline now

Why lazy evaluation? Spark optimizes the entire pipeline before execution. If you filter → aggregate → filter again, Spark might push filters down to read less data.

Think of it this way...

Ola processes 10 TB of ride data daily using Spark on AWS EMR. They:

  1. Read raw GPS logs from S3
  2. Calculate trip distances, durations, surge pricing
  3. Join with driver/customer data
  4. Write aggregated metrics to Redshift (for dashboards)
  5. Entire pipeline runs in 2 hours across 100 machines
🔧

PySpark Example — Real ETL Pipeline

Scenario: Flipkart wants to identify high-value customers (total spend ≥ ₹50,000 in last 6 months).

Data: 500 million orders (50 GB parquet files on S3)

Step 1: Read Data

code.pyPython
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, months_between, current_date

spark = SparkSession.builder.appName("high-value-customers").getOrCreate()

# Read orders (partitioned by order_date)
orders = spark.read.parquet("s3://flipkart-data/orders/")

# Read customers
customers = spark.read.parquet("s3://flipkart-data/customers/")

Step 2: Filter Last 6 Months

code.pyPython
# Calculate date 6 months ago
from pyspark.sql.functions import date_sub

six_months_ago = date_sub(current_date(), 180)

orders_recent = orders.filter(col("order_date") >= six_months_ago)

Step 3: Aggregate by Customer

code.pyPython
customer_spend = orders_recent.groupBy("customer_id").agg(
    sum("amount").alias("total_spend"),
    count("*").alias("order_count"),
    avg("amount").alias("avg_order_value")
)

Step 4: Filter High-Value Customers

code.pyPython
high_value = customer_spend.filter(col("total_spend") >= 50000)

Step 5: Join with Customer Dimensions

code.pyPython
high_value_enriched = high_value.join(
    customers.select("customer_id", "name", "email", "city", "tier"),
    on="customer_id",
    how="left"
)

Step 6: Write Results

code.pyPython
# Write to S3 as Parquet
high_value_enriched.write.mode("overwrite").parquet("s3://flipkart-analytics/high-value-customers/")

# Or write directly to BigQuery
high_value_enriched.write.format("bigquery") \
    .option("table", "flipkart.analytics.high_value_customers") \
    .save()

Step 7: Run on Cluster

$ terminalBash
# Submit to Spark cluster (AWS EMR, Databricks, etc.)
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 50 \
  --executor-memory 16G \
  --executor-cores 4 \
  high_value_customers.py

Execution time: 15 minutes on 50 machines (vs 10+ hours on a single machine)

Info

Key Insight: Spark's power is parallelism. This job processes 50 GB across 50 machines, each handling 1 GB. Without Spark, one machine would process 50 GB sequentially (10x slower).

🆚

Spark vs Alternatives — When to Use What

| Tool | Best For | Example Use Case | Cost | |------|----------|------------------|------| | BigQuery | SQL analytics on structured data | Daily revenue dashboards, cohort analysis | $5/TB queried | | Snowflake | SQL analytics, data sharing | Enterprise data warehouse, BI tool integration | $2-4/compute hour | | Apache Spark | Complex ETL, ML, unstructured data | Feature engineering, log parsing, training models | $0.10-0.50/hour per node | | Pandas | Small datasets (<10 GB) on one machine | Exploratory analysis, Jupyter notebooks | Free (but limited scale) | | dbt | SQL transformations in warehouse | Cleaning data, building marts | Free (uses warehouse compute) |

Decision Tree: Which Tool to Use?

Is your data structured (tables with columns)? ├─ YES │ └─ Can you express your logic in SQL? │ ├─ YES → Use BigQuery/Snowflake (+ dbt for transformations) │ └─ NO (complex Python logic, ML) → Use Spark or Pandas (if data fits in memory) └─ NO (unstructured: JSON logs, images, text) └─ Use Spark (for large data) or Python (for small data)

When Companies Use Spark

Flipkart:

  • ETL: Process 10 TB of clickstream logs daily (Spark on AWS EMR)
  • Analytics: Query cleaned data in BigQuery (dashboards, reports)
  • ML: Train recommendation models (Spark MLlib)

Swiggy:

  • ETL: Aggregate real-time delivery GPS data (Spark Streaming)
  • Analytics: Load aggregates into Redshift for BI tools
  • ML: Predict delivery times (Spark ML)

PhonePe:

  • ETL: Parse transaction logs, detect fraud (Spark on Databricks)
  • Analytics: Store transaction summaries in BigQuery
  • ML: Fraud detection models (Spark ML)
Info

Modern Best Practice: Use SQL warehouses (BigQuery, Snowflake) for 90% of analytics. Use Spark only for pre-processing (ETL) or ML. Load Spark's output into the warehouse for final analysis.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}