Pandas II: Joins and Groups

What You'll Learn

Concatenating DataFrames
Merging and Joining (SQL-style joins)
GroupBy operations (Split-Apply-Combine)
Aggregation methods
Pivot Tables

Combining DataFrames

Concatenation (Stacking):

code.py

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Stack vertically (add rows)
result = pd.concat([df1, df2], ignore_index=True)

# Stack horizontally (add columns)
result = pd.concat([df1, df2], axis=1)

Merging (Joining): Similar to SQL JOINs.

code.py

users = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
orders = pd.DataFrame({'order_id': [101, 102], 'user_id': [1, 2], 'amount': [50, 100]})

# Inner Join (only matching keys)
merged = pd.merge(users, orders, left_on='id', right_on='user_id', how='inner')

# Left Join (keep all left keys)
merged = pd.merge(users, orders, left_on='id', right_on='user_id', how='left')

# Other types: 'right', 'outer'

GroupBy: Split-Apply-Combine

This is one of the most powerful features in Pandas for data analysis.

code.py

df = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B', 'C'],
    'Sales': [100, 200, 150, 300, 400],
    'Region': ['East', 'West', 'East', 'West', 'East']
})

# 1. Split: Group by Category
grouped = df.groupby('Category')

# 2. Apply & Combine: Calculate mean sales per category
print(grouped['Sales'].mean())

Common Aggregations:

code.py

# Multiple aggregations
stats = df.groupby('Category')['Sales'].agg(['mean', 'sum', 'count', 'min', 'max'])

# Different aggregations per column
df.groupby('Category').agg({
    'Sales': 'sum',
    'Region': 'count'
})

Grouping by multiple columns:

code.py

# Sales by Category AND Region
summary = df.groupby(['Category', 'Region'])['Sales'].sum()

Pivot Tables

Create spreadsheet-style pivot tables.

code.py

pivot = df.pivot_table(
    values='Sales',
    index='Category',
    columns='Region',
    aggfunc='sum',
    fill_value=0
)
print(pivot)

Practice Exercise

code.py

import pandas as pd

# Sales Data
sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'product_id': [1, 2, 1, 3],
    'quantity': [5, 2, 3, 1]
})

# Products Data
products = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Widget', 'Gadget', 'Doodad'],
    'price': [10, 20, 15]
})

# 1. Merge sales with products
full_data = pd.merge(sales, products, left_on='product_id', right_on='id')

# 2. Calculate revenue
full_data['revenue'] = full_data['quantity'] * full_data['price']

# 3. Total revenue by product
product_revenue = full_data.groupby('name')['revenue'].sum()

print(product_revenue)

Next Steps

You have mastered the basics of data manipulation! Next, we'll dive into Exploratory Data Analysis (EDA) workflows.

What You'll Learn

Combining DataFrames

GroupBy: Split-Apply-Combine

Pivot Tables

Practice Exercise

Next Steps

Practice & Experiment