What is Apache Spark?
Apache Spark is an open-source distributed computing engine for processing massive datasets across a cluster of machines. Think of it as "SQL + Python, but for data too big to fit on one machine."
Key Concepts
Distributed Computing: Instead of processing data on one machine, Spark splits the work across 10, 100, or 1000 machines in parallel.
In-Memory Processing: Spark keeps data in RAM between operations, making it 10-100x faster than older systems (like Hadoop MapReduce) that wrote to disk after every step.
Unified Engine: Spark handles:
- Batch processing (daily ETL jobs)
- Streaming (real-time event processing)
- Machine learning (training models on huge datasets)
- SQL queries (similar to BigQuery, but on your own cluster)
Spark vs SQL Warehouses (BigQuery, Snowflake)
| Aspect | Spark | BigQuery/Snowflake | |--------|-------|-------------------| | Processing model | Distributed (you manage cluster) | Serverless (cloud manages) | | Language | Python, Scala, SQL | SQL (some Python support) | | Best for | Complex transformations, ML, unstructured data | SQL analytics on structured data | | Speed | Fast (in-memory) | Very fast (optimized for SQL) | | Cost | Pay for cluster hours | Pay per query/storage | | Complexity | High (tune cluster, partitions) | Low (just write SQL) |
BigQuery is like hiring a taxi — you tell it where to go (write SQL), and Google handles everything. Spark is like renting a fleet of buses — you control the route, speed, and crew, but you manage the logistics.
When Do Analysts Need Spark?
Most analysts never need Spark. SQL warehouses (BigQuery, Snowflake) handle 90% of analytics workloads. But Spark is necessary when:
1. Complex Transformations SQL Can't Handle
Example — Feature engineering for ML:
# PySpark: Rolling 7-day average order value per customer
from pyspark.sql import Window
from pyspark.sql.functions import avg, col
window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(-6, 0)
df = df.withColumn("rolling_7day_avg", avg("amount").over(window_spec))This is possible in SQL (window functions), but Spark makes complex multi-step transformations easier in Python.
2. Processing Unstructured Data
Example — Parsing 1 TB of JSON logs:
# PySpark: Read nested JSON, flatten, filter
df = spark.read.json("s3://logs/2026-03/*.json.gz")
df_clean = df.select(
col("event_id"),
col("user.user_id").alias("user_id"),
col("event_properties.page_url").alias("page_url"),
col("timestamp")
).filter(col("user_id").isNotNull())
df_clean.write.parquet("s3://clean-logs/2026-03/")BigQuery can query JSON, but for complex flattening and transformations, Spark is more flexible.
3. Data Too Large for Your Warehouse
Example — Flipkart's clickstream data:
- 10 TB of raw logs per day
- Too expensive to load into BigQuery ($200/day in storage + query costs)
- Use Spark to pre-aggregate, then load summaries to BigQuery
# PySpark: Aggregate 10 TB into 100 GB summary
df = spark.read.parquet("s3://clickstream/2026-03-22/")
summary = df.groupBy("user_id", "session_id", "date").agg(
count("*").alias("pageviews"),
countDistinct("product_id").alias("products_viewed"),
sum("time_on_page").alias("total_time_seconds")
)
# Load 100 GB summary to BigQuery (cheap)
summary.write.format("bigquery").save("flipkart.analytics.daily_sessions")4. Custom Machine Learning Pipelines
Example — Training a recommendation model:
from pyspark.ml.recommendation import ALS
# Train on 1 billion user-product interactions
als = ALS(userCol="user_id", itemCol="product_id", ratingCol="rating")
model = als.fit(interactions_df)
# Generate recommendations for all users
recommendations = model.recommendForAllUsers(10)BigQuery ML exists, but Spark MLlib offers more flexibility for custom models.
Rule of thumb: If your data fits in a SQL warehouse and your transformations are expressible in SQL, use the warehouse. Use Spark only when you have a specific reason (complex logic, unstructured data, extreme scale).
⚠️ CheckpointQuiz error: Missing or invalid options array
Spark DataFrames — The Core Abstraction
A Spark DataFrame is like a Pandas DataFrame, but distributed across a cluster. Each machine holds a chunk of rows.
Creating a DataFrame
from pyspark.sql import SparkSession
# Initialize Spark
spark = SparkSession.builder.appName("swiggy-analysis").getOrCreate()
# Read from CSV
df = spark.read.csv("s3://swiggy-data/orders.csv", header=True, inferSchema=True)
# Show first 5 rows
df.show(5)Output:
+----------+-------------+---------------+------+
| order_id|customer_id |order_date |amount|
+----------+-------------+---------------+------+
| 100001| C45123|2026-03-22 | 450 |
| 100002| C67890|2026-03-22 | 890 |
| 100003| C45123|2026-03-23 | 320 |
+----------+-------------+---------------+------+
DataFrame Operations (Similar to SQL)
from pyspark.sql.functions import col, sum, avg, count
# Filter
df_mumbai = df.filter(col("city") == "Mumbai")
# Select columns
df_slim = df.select("order_id", "customer_id", "amount")
# Group by and aggregate
revenue_by_city = df.groupBy("city").agg(
count("*").alias("order_count"),
sum("amount").alias("total_revenue"),
avg("amount").alias("avg_order_value")
)
revenue_by_city.show()Output:
+----------+-------------+---------------+-----------------+
|city |order_count |total_revenue |avg_order_value |
+----------+-------------+---------------+-----------------+
|Mumbai | 1245000| 622500000.0| 500.0 |
|Bangalore | 1098000| 549000000.0| 500.2 |
|Delhi | 987000| 493500000.0| 500.1 |
+----------+-------------+---------------+-----------------+
Transformations vs Actions
Transformations (lazy — don't execute immediately):
filter(),select(),groupBy(),join()- Spark builds an execution plan but doesn't run it
Actions (eager — trigger execution):
show(),count(),collect(),write()- Spark executes the entire plan and returns results
# This doesn't run yet (lazy)
df_filtered = df.filter(col("amount") > 500)
df_grouped = df_filtered.groupBy("city").count()
# This triggers execution (action)
df_grouped.show() # Spark runs the entire pipeline nowWhy lazy evaluation? Spark optimizes the entire pipeline before execution. If you filter → aggregate → filter again, Spark might push filters down to read less data.
Ola processes 10 TB of ride data daily using Spark on AWS EMR. They:
- Read raw GPS logs from S3
- Calculate trip distances, durations, surge pricing
- Join with driver/customer data
- Write aggregated metrics to Redshift (for dashboards)
- Entire pipeline runs in 2 hours across 100 machines
PySpark Example — Real ETL Pipeline
Scenario: Flipkart wants to identify high-value customers (total spend ≥ ₹50,000 in last 6 months).
Data: 500 million orders (50 GB parquet files on S3)
Step 1: Read Data
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, months_between, current_date
spark = SparkSession.builder.appName("high-value-customers").getOrCreate()
# Read orders (partitioned by order_date)
orders = spark.read.parquet("s3://flipkart-data/orders/")
# Read customers
customers = spark.read.parquet("s3://flipkart-data/customers/")Step 2: Filter Last 6 Months
# Calculate date 6 months ago
from pyspark.sql.functions import date_sub
six_months_ago = date_sub(current_date(), 180)
orders_recent = orders.filter(col("order_date") >= six_months_ago)Step 3: Aggregate by Customer
customer_spend = orders_recent.groupBy("customer_id").agg(
sum("amount").alias("total_spend"),
count("*").alias("order_count"),
avg("amount").alias("avg_order_value")
)Step 4: Filter High-Value Customers
high_value = customer_spend.filter(col("total_spend") >= 50000)Step 5: Join with Customer Dimensions
high_value_enriched = high_value.join(
customers.select("customer_id", "name", "email", "city", "tier"),
on="customer_id",
how="left"
)Step 6: Write Results
# Write to S3 as Parquet
high_value_enriched.write.mode("overwrite").parquet("s3://flipkart-analytics/high-value-customers/")
# Or write directly to BigQuery
high_value_enriched.write.format("bigquery") \
.option("table", "flipkart.analytics.high_value_customers") \
.save()Step 7: Run on Cluster
# Submit to Spark cluster (AWS EMR, Databricks, etc.)
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 50 \
--executor-memory 16G \
--executor-cores 4 \
high_value_customers.pyExecution time: 15 minutes on 50 machines (vs 10+ hours on a single machine)
Key Insight: Spark's power is parallelism. This job processes 50 GB across 50 machines, each handling 1 GB. Without Spark, one machine would process 50 GB sequentially (10x slower).
Spark vs Alternatives — When to Use What
| Tool | Best For | Example Use Case | Cost | |------|----------|------------------|------| | BigQuery | SQL analytics on structured data | Daily revenue dashboards, cohort analysis | $5/TB queried | | Snowflake | SQL analytics, data sharing | Enterprise data warehouse, BI tool integration | $2-4/compute hour | | Apache Spark | Complex ETL, ML, unstructured data | Feature engineering, log parsing, training models | $0.10-0.50/hour per node | | Pandas | Small datasets (<10 GB) on one machine | Exploratory analysis, Jupyter notebooks | Free (but limited scale) | | dbt | SQL transformations in warehouse | Cleaning data, building marts | Free (uses warehouse compute) |
Decision Tree: Which Tool to Use?
Is your data structured (tables with columns)?
├─ YES
│ └─ Can you express your logic in SQL?
│ ├─ YES → Use BigQuery/Snowflake (+ dbt for transformations)
│ └─ NO (complex Python logic, ML) → Use Spark or Pandas (if data fits in memory)
└─ NO (unstructured: JSON logs, images, text)
└─ Use Spark (for large data) or Python (for small data)
When Companies Use Spark
Flipkart:
- ETL: Process 10 TB of clickstream logs daily (Spark on AWS EMR)
- Analytics: Query cleaned data in BigQuery (dashboards, reports)
- ML: Train recommendation models (Spark MLlib)
Swiggy:
- ETL: Aggregate real-time delivery GPS data (Spark Streaming)
- Analytics: Load aggregates into Redshift for BI tools
- ML: Predict delivery times (Spark ML)
PhonePe:
- ETL: Parse transaction logs, detect fraud (Spark on Databricks)
- Analytics: Store transaction summaries in BigQuery
- ML: Fraud detection models (Spark ML)
Modern Best Practice: Use SQL warehouses (BigQuery, Snowflake) for 90% of analytics. Use Spark only for pre-processing (ETL) or ML. Load Spark's output into the warehouse for final analysis.
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}