What is Pandas groupby, merge, pivot — Advanced Data Operations?

Master Pandas groupby for aggregations, merge/join for combining datasets, and pivot tables for reshaping data. Essential techniques for every data analyst.

Is Pandas groupby, merge, pivot — Advanced Data Operations suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 12 min to complete and includes 10 interactive quizzes to test your understanding.

How long does it take to learn Pandas groupby, merge, pivot — Advanced Data Operations?

You can complete this topic in about 12 min. The topic is part 29 of undefined in our comprehensive Data Analytics Learning Path.

Pandas groupby merge pivot — Advanced Aggregation & Joins | DataPath

📊

groupby — Aggregating Data Like SQL GROUP BY

Pandas groupby() is like SQL's GROUP BY — it splits data into groups based on column values, applies aggregation functions, and combines results.

Basic Grouping and Aggregation

code.pyPython

import pandas as pd

# Swiggy orders dataset
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005, 1006],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Mumbai'],
    'restaurant': ['Dominos', 'KFC', 'Dominos', 'McDonalds', 'KFC', 'Subway'],
    'amount': [450, 650, 380, 720, 590, 420],
    'delivery_time': [35, 45, 30, 50, 40, 32]
})

# Total revenue by city (like SQL: SELECT city, SUM(amount) FROM orders GROUP BY city)
revenue_by_city = orders.groupby('city')['amount'].sum()
print(revenue_by_city)
# city
# Bangalore    720
# Delhi       1240
# Mumbai      1250

Multiple Aggregations

code.pyPython

# Multiple metrics per group
city_stats = orders.groupby('city')['amount'].agg(['sum', 'mean', 'count'])
print(city_stats)

Output:

            sum       mean  count
city
Bangalore   720  720.000000      1
Delhi      1240  620.000000      2
Mumbai     1250  416.666667      3

Aggregating Multiple Columns

code.pyPython

# Different functions for different columns
city_stats = orders.groupby('city').agg({
    'amount': ['sum', 'mean'],
    'delivery_time': ['mean', 'max'],
    'order_id': 'count'
})
print(city_stats)

Named Aggregations (Cleaner Column Names)

code.pyPython

# More readable output with custom column names
city_stats = orders.groupby('city').agg(
    total_revenue=('amount', 'sum'),
    avg_order_value=('amount', 'mean'),
    avg_delivery=('delivery_time', 'mean'),
    order_count=('order_id', 'count')
).reset_index()

print(city_stats)

Output:

        city  total_revenue  avg_order_value  avg_delivery  order_count
0  Bangalore            720       720.000000          50.0            1
1      Delhi           1240       620.000000          42.5            2
2     Mumbai           1250       416.666667          32.3            3

Grouping by Multiple Columns

code.pyPython

# Revenue by city AND restaurant
city_restaurant = orders.groupby(['city', 'restaurant'])['amount'].sum()
print(city_restaurant)
# city       restaurant
# Bangalore  McDonalds     720
# Delhi      KFC          1240
# Mumbai     Dominos       830
#            Subway        420

Think of it this way...

Zomato analyzes restaurant performance by city to decide where to expand. They groupby city to find total orders, average order value, and delivery times. Mumbai might have high volume but low AOV, while Bangalore has lower volume but 2x AOV.

🚀

Advanced groupby Techniques

Beyond basic aggregations, groupby unlocks powerful patterns for complex analysis.

Custom Aggregation Functions

code.pyPython

# Define custom function
def price_range(series):
    return series.max() - series.min()

city_stats = orders.groupby('city')['amount'].agg([
    'mean',
    ('range', price_range),  # Custom function
    ('std', 'std')  # Standard deviation
])

Filter Groups After Aggregation

code.pyPython

# Only cities with more than 2 orders
active_cities = orders.groupby('city').filter(lambda x: len(x) > 2)
print(active_cities)
# Returns full rows, not aggregated — only for Mumbai (3 orders)

Transform (Keep Original Shape)

transform() returns a result with the same shape as the original DataFrame — useful for adding group statistics to individual rows:

code.pyPython

# Add city average to each row
orders['city_avg_amount'] = orders.groupby('city')['amount'].transform('mean')
print(orders[['order_id', 'city', 'amount', 'city_avg_amount']])

Output:

   order_id       city  amount  city_avg_amount
0      1001     Mumbai     450       416.666667
1      1002      Delhi     650       620.000000
2      1003     Mumbai     380       416.666667
3      1004  Bangalore     720       720.000000
4      1005      Delhi     590       620.000000
5      1006     Mumbai     420       416.666667

Apply (Most Flexible)

apply() runs any function on each group:

code.pyPython

# Get top 2 orders per city
top_2_per_city = orders.groupby('city').apply(lambda x: x.nlargest(2, 'amount')).reset_index(drop=True)
print(top_2_per_city)

Cumulative Aggregations

code.pyPython

# Running total within each city
orders = orders.sort_values(['city', 'order_id'])
orders['cumulative_revenue'] = orders.groupby('city')['amount'].cumsum()
print(orders[['order_id', 'city', 'amount', 'cumulative_revenue']])

Info

Performance Tip: For simple aggregations, use built-in functions ('sum', 'mean') instead of lambdas. Built-in functions are Cython-optimized and 10-100x faster on large datasets.

⚠️ CheckpointQuiz error: Missing or invalid options array

🔗

merge and join — Combining DataFrames

Pandas merge() combines DataFrames based on common columns — like SQL JOINs. Essential for combining customer data, orders, products, etc.

Inner Join (Only Matching Rows)

code.pyPython

# Customers DataFrame
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104],
    'name': ['Priya', 'Rahul', 'Anjali', 'Vikas'],
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai']
})

# Orders DataFrame
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003],
    'customer_id': [101, 102, 101],
    'amount': [2500, 3200, 1800]
})

# Inner join: only customers who have orders
result = pd.merge(customers, orders, on='customer_id', how='inner')
print(result)

Output:

   customer_id    name     city  order_id  amount
0          101   Priya   Mumbai      1001    2500
1          101   Priya   Mumbai      1003    1800
2          102   Rahul    Delhi      1002    3200

Left Join (All Rows from Left Table)

code.pyPython

# Left join: all customers, with orders if they exist
result = pd.merge(customers, orders, on='customer_id', how='left')
print(result)

Output:

   customer_id    name       city  order_id  amount
0          101   Priya     Mumbai    1001.0  2500.0
1          101   Priya     Mumbai    1003.0  1800.0
2          102   Rahul      Delhi    1002.0  3200.0
3          103  Anjali  Bangalore       NaN     NaN
4          104   Vikas    Chennai       NaN     NaN

Right Join and Outer Join

code.pyPython

# Right join: all orders, with customer info if available
result = pd.merge(customers, orders, on='customer_id', how='right')

# Outer join: all customers and all orders (union)
result = pd.merge(customers, orders, on='customer_id', how='outer')

Merging on Different Column Names

code.pyPython

# If join columns have different names
customers = pd.DataFrame({
    'cust_id': [101, 102, 103],
    'name': ['Priya', 'Rahul', 'Anjali']
})

orders = pd.DataFrame({
    'order_id': [1001, 1002],
    'customer_id': [101, 102],
    'amount': [2500, 3200]
})

result = pd.merge(customers, orders, left_on='cust_id', right_on='customer_id', how='inner')

Merging on Multiple Columns

code.pyPython

# Join on city AND date
df1 = pd.DataFrame({
    'city': ['Mumbai', 'Delhi', 'Mumbai'],
    'date': ['2026-03-01', '2026-03-01', '2026-03-02'],
    'orders': [120, 95, 130]
})

df2 = pd.DataFrame({
    'city': ['Mumbai', 'Delhi', 'Mumbai'],
    'date': ['2026-03-01', '2026-03-01', '2026-03-02'],
    'revenue': [45000, 38000, 52000]
})

result = pd.merge(df1, df2, on=['city', 'date'], how='inner')
print(result)

Using .join() Method

.join() is a shortcut when joining on index:

code.pyPython

# Set index and join
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')

result = customers_indexed.join(orders_indexed, how='left')

Think of it this way...

SQL: SELECT * FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id Pandas: pd.merge(customers, orders, on='customer_id', how='inner')

The logic is identical — only the syntax changes.

🔄

Pivot Tables — Reshaping Data

Pivot tables transform row-based data into a cross-tabulated format — perfect for comparing groups across multiple dimensions.

Basic Pivot Table

code.pyPython

# Sample data: orders by city and day
data = {
    'date': ['2026-03-20', '2026-03-20', '2026-03-21', '2026-03-21', '2026-03-22', '2026-03-22'],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi'],
    'orders': [120, 95, 130, 102, 115, 98],
    'revenue': [45000, 38000, 52000, 41000, 48000, 39000]
}
df = pd.DataFrame(data)

# Pivot: rows=city, columns=date, values=revenue
pivot = df.pivot(index='city', columns='date', values='revenue')
print(pivot)

Output:

date         2026-03-20  2026-03-21  2026-03-22
city
Delhi             38000       41000       39000
Mumbai            45000       52000       48000

Pivot Table with Aggregation

pivot() requires unique index-column combinations. For aggregation, use pivot_table():

code.pyPython

# If multiple entries per city-date, aggregate with sum
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc='sum')

# Multiple aggregations
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc=['sum', 'mean'])

Multi-Level Pivots

code.pyPython

# Rows=city, Columns=date, Values=orders AND revenue
pivot = df.pivot_table(index='city', columns='date', values=['orders', 'revenue'])
print(pivot)

Filling Missing Values

code.pyPython

# Fill NaN with 0 in pivot table
pivot = df.pivot_table(index='city', columns='date', values='revenue', fill_value=0)

Adding Totals (Margins)

code.pyPython

# Add row and column totals
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc='sum', margins=True)
# Adds 'All' row and column with totals

Unpivoting with melt()

melt() is the reverse of pivot — converts wide format to long format:

code.pyPython

# Wide format
wide = pd.DataFrame({
    'city': ['Mumbai', 'Delhi'],
    'Jan': [45000, 38000],
    'Feb': [52000, 41000],
    'Mar': [48000, 39000]
})

# Convert to long format
long = wide.melt(id_vars='city', var_name='month', value_name='revenue')
print(long)

Output:

     city month  revenue
0  Mumbai   Jan    45000
1   Delhi   Jan    38000
2  Mumbai   Feb    52000
3   Delhi   Feb    41000
4  Mumbai   Mar    48000
5   Delhi   Mar    39000

Info

When to Use Pivot vs Groupby: Use groupby when you want a flat aggregation (one result per group). Use pivot when you want to reshape data into a cross-tab format (compare groups across two dimensions).

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}