Topic 15 of

Data Manipulation with Pandas: GroupBy and Merge

Master Pandas groupby and merge with real e-commerce data examples

📚Intermediate
⏱️20
2 quizzes

Introduction

Pandas groupby and merge are the building blocks of data analysis in Python. They're how you answer questions like:

  • Flipkart: "What's the average order value per category?"
  • Swiggy: "How many unique customers ordered from each restaurant?"
  • Zomato: "Join customer data with order data to analyze spending patterns"

If you know SQL's GROUP BY and JOIN, you already understand 80% of what groupby() and merge() do. If not, don't worry—we'll build from the ground up.

GroupBy: The Split-Apply-Combine Pattern

groupby() follows a simple pattern:

  1. Split data into groups based on column values
  2. Apply a function to each group (sum, mean, count, etc.)
  3. Combine results into a new DataFrame
Think of it this way...

Think of groupby like sorting laundry. You split clothes into groups (whites, colors, delicates). Then you apply the same operation to each group (wash). Finally, you combine the clean clothes back together. groupby() does this with data instead of clothes.

Basic Example:

code.pyPython
import pandas as pd

# Flipkart sales data
sales = pd.DataFrame({
    'product': ['iPhone', 'MacBook', 'iPhone', 'iPad', 'MacBook', 'iPad'],
    'category': ['Phone', 'Laptop', 'Phone', 'Tablet', 'Laptop', 'Tablet'],
    'sales': [50000, 80000, 48000, 35000, 75000, 32000]
})

# Calculate total sales per category
category_sales = sales.groupby('category')['sales'].sum()
print(category_sales)

Output:

category Laptop 155000 Phone 98000 Tablet 67000 Name: sales, dtype: int64

⚠️ CheckpointQuiz error: Missing or invalid options array

Common GroupBy Aggregations

Single Aggregation:

code.pyPython
# Total sales per category
df.groupby('category')['sales'].sum()

# Average price per brand
df.groupby('brand')['price'].mean()

# Count orders per customer
df.groupby('customer_id')['order_id'].count()

# Count unique products per category
df.groupby('category')['product_id'].nunique()

Multiple Aggregations with agg():

code.pyPython
# Multiple metrics for sales
df.groupby('category')['sales'].agg(['sum', 'mean', 'count', 'min', 'max'])

# Different functions for different columns
df.groupby('category').agg({
    'sales': 'sum',
    'quantity': 'mean',
    'customer_id': 'nunique'
})

Example - Swiggy Restaurant Analysis:

code.pyPython
# Restaurant performance by city
restaurant_stats = df.groupby(['city', 'restaurant']).agg({
    'order_id': 'count',           # Total orders
    'order_value': ['mean', 'sum'], # Avg and total revenue
    'delivery_time': 'mean',        # Avg delivery time
    'customer_id': 'nunique'        # Unique customers
}).reset_index()

print(restaurant_stats.head())

Output:

city restaurant order_id order_value_mean order_value_sum delivery_time_mean customer_id_nunique 0 Bangalore Truffles 1500 350 525000 25 890 1 Bangalore Empire 2000 280 560000 30 1200 2 Delhi Karim's 1800 420 756000 35 1100
Tip

Pro Tip: Use reset_index() after groupby to convert the grouped columns from index back to regular columns. This makes the DataFrame easier to work with for further analysis or visualization.

GroupBy with Multiple Columns

Group by multiple columns to create hierarchical groups.

Example - Sales by City and Category:

code.pyPython
# Group by city, then category within each city
city_category_sales = df.groupby(['city', 'category'])['sales'].sum()
print(city_category_sales)

Output:

city category Bangalore Laptop 250000 Phone 180000 Tablet 95000 Delhi Laptop 320000 Phone 220000 Tablet 110000

Unstack for Better Visualization:

code.pyPython
# Convert to pivot table format
pivot = city_category_sales.unstack(fill_value=0)
print(pivot)

Output:

category Laptop Phone Tablet city Bangalore 250000 180000 95000 Delhi 320000 220000 110000

Filtering Groups with filter()

Keep only groups that meet a condition.

Example - Zomato High-Performing Restaurants:

code.pyPython
# Keep only restaurants with >100 orders
high_volume = df.groupby('restaurant').filter(lambda x: len(x) > 100)

# Or using a more complex condition
profitable = df.groupby('restaurant').filter(
    lambda x: x['revenue'].sum() > 50000 and x['rating'].mean() >= 4.0
)

Merge: Joining DataFrames

merge() combines DataFrames based on common columns, just like SQL JOINs.

Types of Joins:

  • inner: Keep only matching rows (intersection)
  • left: Keep all left rows, match from right
  • right: Keep all right rows, match from left
  • outer: Keep all rows from both (union)

Basic Syntax:

code.pyPython
# Merge on single column
result = df1.merge(df2, on='user_id', how='inner')

# Merge on multiple columns
result = df1.merge(df2, on=['user_id', 'product_id'], how='left')

# Merge on different column names
result = df1.merge(df2, left_on='customer_id', right_on='user_id', how='inner')

Example: Flipkart Order + Customer Analysis

Scenario: You have separate DataFrames for orders and customers. Merge them to analyze spending by customer segment.

code.pyPython
# Orders data
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5],
    'customer_id': [101, 102, 101, 103, 102],
    'order_value': [5000, 3000, 7000, 2000, 4500],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']
})

# Customer data
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104],
    'customer_name': ['Rahul', 'Priya', 'Amit', 'Sneha'],
    'city': ['Mumbai', 'Bangalore', 'Delhi', 'Mumbai'],
    'segment': ['Premium', 'Regular', 'Premium', 'Regular']
})

# Merge to add customer details to orders
merged = orders.merge(customers, on='customer_id', how='left')
print(merged)

Output:

order_id customer_id order_value order_date customer_name city segment 0 1 101 5000 2024-01-15 Rahul Mumbai Premium 1 2 102 3000 2024-01-16 Priya Bangalore Regular 2 3 101 7000 2024-01-17 Rahul Mumbai Premium 3 4 103 2000 2024-01-18 Amit Delhi Premium 4 5 102 4500 2024-01-19 Priya Bangalore Regular

Now analyze by segment:

code.pyPython
# Total spending by customer segment
segment_spending = merged.groupby('segment')['order_value'].agg(['sum', 'mean', 'count'])
print(segment_spending)

Output:

sum mean count segment Premium 14000 4666.67 3 Regular 7500 3750.00 2
Warning

Join Type Matters!

  • how='left' keeps all orders even if customer data is missing (NaN for customer columns)
  • how='inner' only keeps orders with matching customer records
  • For most analytics, use how='left' to avoid accidentally dropping data

Indicator for Join Diagnostics

Use indicator=True to see which rows came from left, right, or both.

code.pyPython
merged = orders.merge(customers, on='customer_id', how='outer', indicator=True)
print(merged['_merge'].value_counts())

Output:

both 5 # Matched in both DataFrames left_only 0 # Only in orders right_only 1 # Only in customers (customer 104 had no orders)

Real-World Pattern: GroupBy + Merge

Scenario: Calculate each customer's total spending and merge it back with customer details.

code.pyPython
# Step 1: Aggregate order data
customer_spending = orders.groupby('customer_id').agg({
    'order_value': 'sum',
    'order_id': 'count'
}).rename(columns={'order_value': 'total_spent', 'order_id': 'order_count'}).reset_index()

print(customer_spending)

Output:

customer_id total_spent order_count 0 101 12000 2 1 102 7500 2 2 103 2000 1
code.pyPython
# Step 2: Merge with customer data
customer_profile = customers.merge(customer_spending, on='customer_id', how='left')
customer_profile['total_spent'] = customer_profile['total_spent'].fillna(0)
customer_profile['order_count'] = customer_profile['order_count'].fillna(0)

print(customer_profile)

Output:

customer_id customer_name city segment total_spent order_count 0 101 Rahul Mumbai Premium 12000 2 1 102 Priya Bangalore Regular 7500 2 2 103 Amit Delhi Premium 2000 1 3 104 Sneha Mumbai Regular 0 0

Performance Tips

1. Use Categorical Data Types for Grouping Columns:

code.pyPython
# Convert to category for faster groupby
df['category'] = df['category'].astype('category')
df['city'] = df['city'].astype('category')

# Now groupby is faster
df.groupby(['city', 'category'])['sales'].sum()

2. Avoid Chaining Multiple groupby() Calls:

code.pyPython
# ❌ Slow: Multiple groupby operations
total_sales = df.groupby('category')['sales'].sum()
avg_price = df.groupby('category')['price'].mean()

# ✅ Fast: Single groupby with agg()
stats = df.groupby('category').agg({'sales': 'sum', 'price': 'mean'})

3. Use merge() Instead of Multiple Lookups:

code.pyPython
# ❌ Slow: Loop and lookup
for idx, row in df.iterrows():
    customer_city = customers[customers['id'] == row['customer_id']]['city'].values[0]

# ✅ Fast: Single merge operation
df = df.merge(customers[['customer_id', 'city']], on='customer_id', how='left')

Common Patterns

1. Top N per Group:

code.pyPython
# Top 3 products by sales in each category
df.sort_values('sales', ascending=False).groupby('category').head(3)

2. Rank within Groups:

code.pyPython
# Rank products by sales within each category
df['category_rank'] = df.groupby('category')['sales'].rank(ascending=False, method='dense')

3. Cumulative Sum per Group:

code.pyPython
# Running total of sales per customer
df['cumulative_sales'] = df.groupby('customer_id')['sales'].cumsum()

4. Percentage of Total per Group:

code.pyPython
# Each order as % of customer's total spending
df['pct_of_customer_total'] = df['order_value'] / df.groupby('customer_id')['order_value'].transform('sum') * 100

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}

⚠️ FinalQuiz error: Missing or invalid questions array