Merging DataFrames
Learn to combine DataFrames based on matching values in columns
Merging DataFrames
What is Merging?
Merging is like matching people by their ID cards.
Imagine you have:
- A list of students with their IDs and names
- A list of test scores with IDs and marks
Merging connects them using the ID.
Basic Merge Example
import pandas as pd
# Student names
students = pd.DataFrame({
'student_id': [1, 2, 3],
'name': ['John', 'Sarah', 'Mike']
})
# Test scores
scores = pd.DataFrame({
'student_id': [1, 2, 3],
'marks': [85, 92, 78]
})
# Merge them using student_id
result = pd.merge(students, scores, on='student_id')
print(result)Output:
student_id name marks
0 1 John 85
1 2 Sarah 92
2 3 Mike 78
The data is connected by matching student_id.
When Column Names Are Different
students = pd.DataFrame({
'id': [1, 2, 3],
'name': ['John', 'Sarah', 'Mike']
})
scores = pd.DataFrame({
'student_id': [1, 2, 3],
'marks': [85, 92, 78]
})
# Left table has 'id', right table has 'student_id'
result = pd.merge(students, scores, left_on='id', right_on='student_id')
print(result)Output:
id name student_id marks
0 1 John 1 85
1 2 Sarah 2 92
2 3 Mike 3 78
Types of Merges
Inner Merge (Default)
Only keeps rows where both tables have matching values.
students = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['John', 'Sarah', 'Mike', 'Emma']
})
scores = pd.DataFrame({
'id': [1, 2, 5],
'marks': [85, 92, 88]
})
# Inner merge - only matching IDs
result = pd.merge(students, scores, on='id', how='inner')
print(result)Output:
id name marks
0 1 John 85
1 2 Sarah 92
Only IDs 1 and 2 are in both tables.
Left Merge
Keep all rows from the left table.
result = pd.merge(students, scores, on='id', how='left')
print(result)Output:
id name marks
0 1 John 85.0
1 2 Sarah 92.0
2 3 Mike NaN
3 4 Emma NaN
All students kept. Missing scores become NaN.
Right Merge
Keep all rows from the right table.
result = pd.merge(students, scores, on='id', how='right')
print(result)Output:
id name marks
0 1 John 85
1 2 Sarah 92
2 5 NaN 88
All scores kept. ID 5 has no student name.
Outer Merge
Keep all rows from both tables.
result = pd.merge(students, scores, on='id', how='outer')
print(result)Output:
id name marks
0 1 John 85.0
1 2 Sarah 92.0
2 3 Mike NaN
3 4 Emma NaN
4 5 NaN 88.0
Everyone is included. Missing values become NaN.
Visual Summary
Inner: Only matches (A and B both have)
Left: All from left + matches from right
Right: All from right + matches from left
Outer: Everything from both
Merge on Multiple Columns
sales = pd.DataFrame({
'store': ['NYC', 'NYC', 'LA', 'LA'],
'product': ['Apple', 'Banana', 'Apple', 'Banana'],
'sales': [100, 150, 120, 180]
})
prices = pd.DataFrame({
'store': ['NYC', 'NYC', 'LA', 'LA'],
'product': ['Apple', 'Banana', 'Apple', 'Banana'],
'price': [1.0, 0.5, 1.2, 0.6]
})
# Merge on both store AND product
result = pd.merge(sales, prices, on=['store', 'product'])
print(result)Output:
store product sales price
0 NYC Apple 100 1.0
1 NYC Banana 150 0.5
2 LA Apple 120 1.2
3 LA Banana 180 0.6
Practice Example
import pandas as pd
# Customers
customers = pd.DataFrame({
'customer_id': [101, 102, 103],
'name': ['Alice', 'Bob', 'Charlie']
})
# Orders
orders = pd.DataFrame({
'order_id': [1, 2, 3, 4],
'customer_id': [101, 101, 102, 104],
'amount': [50, 75, 100, 25]
})
# Get all orders with customer names
result = pd.merge(orders, customers, on='customer_id', how='left')
print(result)Output:
order_id customer_id amount name
0 1 101 50 Alice
1 2 101 75 Alice
2 3 102 100 Bob
3 4 104 25 NaN
Key Points
pd.merge(df1, df2, on='column')joins on matching column valueshow='inner'keeps only matches (default)how='left'keeps all from left tablehow='right'keeps all from right tablehow='outer'keeps everything- Use
left_onandright_onwhen column names differ
Common Mistakes
Mistake 1: Forgetting the 'on' parameter
# This might not work as expected
pd.merge(df1, df2)
# Better - be clear about which column
pd.merge(df1, df2, on='id')Mistake 2: Using wrong merge type
# If you want all customers even without orders
# Don't use inner (default), use left
pd.merge(customers, orders, on='customer_id', how='left')What's Next?
You learned merge. Next, you'll learn join - another way to combine DataFrames using their index.