#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
6 min read min read

Merging DataFrames

Learn to combine DataFrames based on matching values in columns

Merging DataFrames

What is Merging?

Merging is like matching people by their ID cards.

Imagine you have:

  • A list of students with their IDs and names
  • A list of test scores with IDs and marks

Merging connects them using the ID.

Basic Merge Example

code.py
import pandas as pd

# Student names
students = pd.DataFrame({
    'student_id': [1, 2, 3],
    'name': ['John', 'Sarah', 'Mike']
})

# Test scores
scores = pd.DataFrame({
    'student_id': [1, 2, 3],
    'marks': [85, 92, 78]
})

# Merge them using student_id
result = pd.merge(students, scores, on='student_id')
print(result)

Output:

student_id name marks 0 1 John 85 1 2 Sarah 92 2 3 Mike 78

The data is connected by matching student_id.

When Column Names Are Different

code.py
students = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['John', 'Sarah', 'Mike']
})

scores = pd.DataFrame({
    'student_id': [1, 2, 3],
    'marks': [85, 92, 78]
})

# Left table has 'id', right table has 'student_id'
result = pd.merge(students, scores, left_on='id', right_on='student_id')
print(result)

Output:

id name student_id marks 0 1 John 1 85 1 2 Sarah 2 92 2 3 Mike 3 78

Types of Merges

Inner Merge (Default)

Only keeps rows where both tables have matching values.

code.py
students = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['John', 'Sarah', 'Mike', 'Emma']
})

scores = pd.DataFrame({
    'id': [1, 2, 5],
    'marks': [85, 92, 88]
})

# Inner merge - only matching IDs
result = pd.merge(students, scores, on='id', how='inner')
print(result)

Output:

id name marks 0 1 John 85 1 2 Sarah 92

Only IDs 1 and 2 are in both tables.

Left Merge

Keep all rows from the left table.

code.py
result = pd.merge(students, scores, on='id', how='left')
print(result)

Output:

id name marks 0 1 John 85.0 1 2 Sarah 92.0 2 3 Mike NaN 3 4 Emma NaN

All students kept. Missing scores become NaN.

Right Merge

Keep all rows from the right table.

code.py
result = pd.merge(students, scores, on='id', how='right')
print(result)

Output:

id name marks 0 1 John 85 1 2 Sarah 92 2 5 NaN 88

All scores kept. ID 5 has no student name.

Outer Merge

Keep all rows from both tables.

code.py
result = pd.merge(students, scores, on='id', how='outer')
print(result)

Output:

id name marks 0 1 John 85.0 1 2 Sarah 92.0 2 3 Mike NaN 3 4 Emma NaN 4 5 NaN 88.0

Everyone is included. Missing values become NaN.

Visual Summary

Inner: Only matches (A and B both have) Left: All from left + matches from right Right: All from right + matches from left Outer: Everything from both

Merge on Multiple Columns

code.py
sales = pd.DataFrame({
    'store': ['NYC', 'NYC', 'LA', 'LA'],
    'product': ['Apple', 'Banana', 'Apple', 'Banana'],
    'sales': [100, 150, 120, 180]
})

prices = pd.DataFrame({
    'store': ['NYC', 'NYC', 'LA', 'LA'],
    'product': ['Apple', 'Banana', 'Apple', 'Banana'],
    'price': [1.0, 0.5, 1.2, 0.6]
})

# Merge on both store AND product
result = pd.merge(sales, prices, on=['store', 'product'])
print(result)

Output:

store product sales price 0 NYC Apple 100 1.0 1 NYC Banana 150 0.5 2 LA Apple 120 1.2 3 LA Banana 180 0.6

Practice Example

code.py
import pandas as pd

# Customers
customers = pd.DataFrame({
    'customer_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie']
})

# Orders
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4],
    'customer_id': [101, 101, 102, 104],
    'amount': [50, 75, 100, 25]
})

# Get all orders with customer names
result = pd.merge(orders, customers, on='customer_id', how='left')
print(result)

Output:

order_id customer_id amount name 0 1 101 50 Alice 1 2 101 75 Alice 2 3 102 100 Bob 3 4 104 25 NaN

Key Points

  • pd.merge(df1, df2, on='column') joins on matching column values
  • how='inner' keeps only matches (default)
  • how='left' keeps all from left table
  • how='right' keeps all from right table
  • how='outer' keeps everything
  • Use left_on and right_on when column names differ

Common Mistakes

Mistake 1: Forgetting the 'on' parameter

code.py
# This might not work as expected
pd.merge(df1, df2)

# Better - be clear about which column
pd.merge(df1, df2, on='id')

Mistake 2: Using wrong merge type

code.py
# If you want all customers even without orders
# Don't use inner (default), use left
pd.merge(customers, orders, on='customer_id', how='left')

What's Next?

You learned merge. Next, you'll learn join - another way to combine DataFrames using their index.

SkillsetMaster - AI, Web Development & Data Analytics Courses