15 min read
Pandas II: Joins and Groups
Combining datasets with merges and joins, and analyzing data with GroupBy
What You'll Learn
- Concatenating DataFrames
- Merging and Joining (SQL-style joins)
- GroupBy operations (Split-Apply-Combine)
- Aggregation methods
- Pivot Tables
Combining DataFrames
Concatenation (Stacking):
code.py
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Stack vertically (add rows)
result = pd.concat([df1, df2], ignore_index=True)
# Stack horizontally (add columns)
result = pd.concat([df1, df2], axis=1)Merging (Joining): Similar to SQL JOINs.
code.py
users = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
orders = pd.DataFrame({'order_id': [101, 102], 'user_id': [1, 2], 'amount': [50, 100]})
# Inner Join (only matching keys)
merged = pd.merge(users, orders, left_on='id', right_on='user_id', how='inner')
# Left Join (keep all left keys)
merged = pd.merge(users, orders, left_on='id', right_on='user_id', how='left')
# Other types: 'right', 'outer'GroupBy: Split-Apply-Combine
This is one of the most powerful features in Pandas for data analysis.
code.py
df = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'C'],
'Sales': [100, 200, 150, 300, 400],
'Region': ['East', 'West', 'East', 'West', 'East']
})
# 1. Split: Group by Category
grouped = df.groupby('Category')
# 2. Apply & Combine: Calculate mean sales per category
print(grouped['Sales'].mean())Common Aggregations:
code.py
# Multiple aggregations
stats = df.groupby('Category')['Sales'].agg(['mean', 'sum', 'count', 'min', 'max'])
# Different aggregations per column
df.groupby('Category').agg({
'Sales': 'sum',
'Region': 'count'
})Grouping by multiple columns:
code.py
# Sales by Category AND Region
summary = df.groupby(['Category', 'Region'])['Sales'].sum()Pivot Tables
Create spreadsheet-style pivot tables.
code.py
pivot = df.pivot_table(
values='Sales',
index='Category',
columns='Region',
aggfunc='sum',
fill_value=0
)
print(pivot)Practice Exercise
code.py
import pandas as pd
# Sales Data
sales = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'product_id': [1, 2, 1, 3],
'quantity': [5, 2, 3, 1]
})
# Products Data
products = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Widget', 'Gadget', 'Doodad'],
'price': [10, 20, 15]
})
# 1. Merge sales with products
full_data = pd.merge(sales, products, left_on='product_id', right_on='id')
# 2. Calculate revenue
full_data['revenue'] = full_data['quantity'] * full_data['price']
# 3. Total revenue by product
product_revenue = full_data.groupby('name')['revenue'].sum()
print(product_revenue)Next Steps
You have mastered the basics of data manipulation! Next, we'll dive into Exploratory Data Analysis (EDA) workflows.
Practice & Experiment
Test your understanding by running Python code directly in your browser. Try the examples from the article above!