Topic 29 of

Pandas groupby, merge, pivot — Advanced Data Operations

SQL has GROUP BY and JOIN. Pandas has them too — with superpowers. Here's how to aggregate, combine, and reshape data in ways that make complex analysis feel simple.

📚Intermediate
⏱️12 min
10 quizzes
📊

groupby — Aggregating Data Like SQL GROUP BY

Pandas groupby() is like SQL's GROUP BY — it splits data into groups based on column values, applies aggregation functions, and combines results.

Basic Grouping and Aggregation

code.pyPython
import pandas as pd

# Swiggy orders dataset
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005, 1006],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Mumbai'],
    'restaurant': ['Dominos', 'KFC', 'Dominos', 'McDonalds', 'KFC', 'Subway'],
    'amount': [450, 650, 380, 720, 590, 420],
    'delivery_time': [35, 45, 30, 50, 40, 32]
})

# Total revenue by city (like SQL: SELECT city, SUM(amount) FROM orders GROUP BY city)
revenue_by_city = orders.groupby('city')['amount'].sum()
print(revenue_by_city)
# city
# Bangalore    720
# Delhi       1240
# Mumbai      1250

Multiple Aggregations

code.pyPython
# Multiple metrics per group
city_stats = orders.groupby('city')['amount'].agg(['sum', 'mean', 'count'])
print(city_stats)

Output:

sum mean count city Bangalore 720 720.000000 1 Delhi 1240 620.000000 2 Mumbai 1250 416.666667 3

Aggregating Multiple Columns

code.pyPython
# Different functions for different columns
city_stats = orders.groupby('city').agg({
    'amount': ['sum', 'mean'],
    'delivery_time': ['mean', 'max'],
    'order_id': 'count'
})
print(city_stats)

Named Aggregations (Cleaner Column Names)

code.pyPython
# More readable output with custom column names
city_stats = orders.groupby('city').agg(
    total_revenue=('amount', 'sum'),
    avg_order_value=('amount', 'mean'),
    avg_delivery=('delivery_time', 'mean'),
    order_count=('order_id', 'count')
).reset_index()

print(city_stats)

Output:

city total_revenue avg_order_value avg_delivery order_count 0 Bangalore 720 720.000000 50.0 1 1 Delhi 1240 620.000000 42.5 2 2 Mumbai 1250 416.666667 32.3 3

Grouping by Multiple Columns

code.pyPython
# Revenue by city AND restaurant
city_restaurant = orders.groupby(['city', 'restaurant'])['amount'].sum()
print(city_restaurant)
# city       restaurant
# Bangalore  McDonalds     720
# Delhi      KFC          1240
# Mumbai     Dominos       830
#            Subway        420
Think of it this way...

Zomato analyzes restaurant performance by city to decide where to expand. They groupby city to find total orders, average order value, and delivery times. Mumbai might have high volume but low AOV, while Bangalore has lower volume but 2x AOV.

🚀

Advanced groupby Techniques

Beyond basic aggregations, groupby unlocks powerful patterns for complex analysis.

Custom Aggregation Functions

code.pyPython
# Define custom function
def price_range(series):
    return series.max() - series.min()

city_stats = orders.groupby('city')['amount'].agg([
    'mean',
    ('range', price_range),  # Custom function
    ('std', 'std')  # Standard deviation
])

Filter Groups After Aggregation

code.pyPython
# Only cities with more than 2 orders
active_cities = orders.groupby('city').filter(lambda x: len(x) > 2)
print(active_cities)
# Returns full rows, not aggregated — only for Mumbai (3 orders)

Transform (Keep Original Shape)

transform() returns a result with the same shape as the original DataFrame — useful for adding group statistics to individual rows:

code.pyPython
# Add city average to each row
orders['city_avg_amount'] = orders.groupby('city')['amount'].transform('mean')
print(orders[['order_id', 'city', 'amount', 'city_avg_amount']])

Output:

order_id city amount city_avg_amount 0 1001 Mumbai 450 416.666667 1 1002 Delhi 650 620.000000 2 1003 Mumbai 380 416.666667 3 1004 Bangalore 720 720.000000 4 1005 Delhi 590 620.000000 5 1006 Mumbai 420 416.666667

Apply (Most Flexible)

apply() runs any function on each group:

code.pyPython
# Get top 2 orders per city
top_2_per_city = orders.groupby('city').apply(lambda x: x.nlargest(2, 'amount')).reset_index(drop=True)
print(top_2_per_city)

Cumulative Aggregations

code.pyPython
# Running total within each city
orders = orders.sort_values(['city', 'order_id'])
orders['cumulative_revenue'] = orders.groupby('city')['amount'].cumsum()
print(orders[['order_id', 'city', 'amount', 'cumulative_revenue']])
Info

Performance Tip: For simple aggregations, use built-in functions ('sum', 'mean') instead of lambdas. Built-in functions are Cython-optimized and 10-100x faster on large datasets.

⚠️ CheckpointQuiz error: Missing or invalid options array

🔗

merge and join — Combining DataFrames

Pandas merge() combines DataFrames based on common columns — like SQL JOINs. Essential for combining customer data, orders, products, etc.

Inner Join (Only Matching Rows)

code.pyPython
# Customers DataFrame
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104],
    'name': ['Priya', 'Rahul', 'Anjali', 'Vikas'],
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai']
})

# Orders DataFrame
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003],
    'customer_id': [101, 102, 101],
    'amount': [2500, 3200, 1800]
})

# Inner join: only customers who have orders
result = pd.merge(customers, orders, on='customer_id', how='inner')
print(result)

Output:

customer_id name city order_id amount 0 101 Priya Mumbai 1001 2500 1 101 Priya Mumbai 1003 1800 2 102 Rahul Delhi 1002 3200

Left Join (All Rows from Left Table)

code.pyPython
# Left join: all customers, with orders if they exist
result = pd.merge(customers, orders, on='customer_id', how='left')
print(result)

Output:

customer_id name city order_id amount 0 101 Priya Mumbai 1001.0 2500.0 1 101 Priya Mumbai 1003.0 1800.0 2 102 Rahul Delhi 1002.0 3200.0 3 103 Anjali Bangalore NaN NaN 4 104 Vikas Chennai NaN NaN

Right Join and Outer Join

code.pyPython
# Right join: all orders, with customer info if available
result = pd.merge(customers, orders, on='customer_id', how='right')

# Outer join: all customers and all orders (union)
result = pd.merge(customers, orders, on='customer_id', how='outer')

Merging on Different Column Names

code.pyPython
# If join columns have different names
customers = pd.DataFrame({
    'cust_id': [101, 102, 103],
    'name': ['Priya', 'Rahul', 'Anjali']
})

orders = pd.DataFrame({
    'order_id': [1001, 1002],
    'customer_id': [101, 102],
    'amount': [2500, 3200]
})

result = pd.merge(customers, orders, left_on='cust_id', right_on='customer_id', how='inner')

Merging on Multiple Columns

code.pyPython
# Join on city AND date
df1 = pd.DataFrame({
    'city': ['Mumbai', 'Delhi', 'Mumbai'],
    'date': ['2026-03-01', '2026-03-01', '2026-03-02'],
    'orders': [120, 95, 130]
})

df2 = pd.DataFrame({
    'city': ['Mumbai', 'Delhi', 'Mumbai'],
    'date': ['2026-03-01', '2026-03-01', '2026-03-02'],
    'revenue': [45000, 38000, 52000]
})

result = pd.merge(df1, df2, on=['city', 'date'], how='inner')
print(result)

Using .join() Method

.join() is a shortcut when joining on index:

code.pyPython
# Set index and join
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')

result = customers_indexed.join(orders_indexed, how='left')
Think of it this way...

SQL: SELECT * FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id Pandas: pd.merge(customers, orders, on='customer_id', how='inner')

The logic is identical — only the syntax changes.

🔄

Pivot Tables — Reshaping Data

Pivot tables transform row-based data into a cross-tabulated format — perfect for comparing groups across multiple dimensions.

Basic Pivot Table

code.pyPython
# Sample data: orders by city and day
data = {
    'date': ['2026-03-20', '2026-03-20', '2026-03-21', '2026-03-21', '2026-03-22', '2026-03-22'],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi'],
    'orders': [120, 95, 130, 102, 115, 98],
    'revenue': [45000, 38000, 52000, 41000, 48000, 39000]
}
df = pd.DataFrame(data)

# Pivot: rows=city, columns=date, values=revenue
pivot = df.pivot(index='city', columns='date', values='revenue')
print(pivot)

Output:

date 2026-03-20 2026-03-21 2026-03-22 city Delhi 38000 41000 39000 Mumbai 45000 52000 48000

Pivot Table with Aggregation

pivot() requires unique index-column combinations. For aggregation, use pivot_table():

code.pyPython
# If multiple entries per city-date, aggregate with sum
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc='sum')

# Multiple aggregations
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc=['sum', 'mean'])

Multi-Level Pivots

code.pyPython
# Rows=city, Columns=date, Values=orders AND revenue
pivot = df.pivot_table(index='city', columns='date', values=['orders', 'revenue'])
print(pivot)

Filling Missing Values

code.pyPython
# Fill NaN with 0 in pivot table
pivot = df.pivot_table(index='city', columns='date', values='revenue', fill_value=0)

Adding Totals (Margins)

code.pyPython
# Add row and column totals
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc='sum', margins=True)
# Adds 'All' row and column with totals

Unpivoting with melt()

melt() is the reverse of pivot — converts wide format to long format:

code.pyPython
# Wide format
wide = pd.DataFrame({
    'city': ['Mumbai', 'Delhi'],
    'Jan': [45000, 38000],
    'Feb': [52000, 41000],
    'Mar': [48000, 39000]
})

# Convert to long format
long = wide.melt(id_vars='city', var_name='month', value_name='revenue')
print(long)

Output:

city month revenue 0 Mumbai Jan 45000 1 Delhi Jan 38000 2 Mumbai Feb 52000 3 Delhi Feb 41000 4 Mumbai Mar 48000 5 Delhi Mar 39000
Info

When to Use Pivot vs Groupby: Use groupby when you want a flat aggregation (one result per group). Use pivot when you want to reshape data into a cross-tab format (compare groups across two dimensions).

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}