#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
4 min read min read

Removing Duplicates

Learn to find and remove duplicate rows

Removing Duplicates

What are Duplicates?

Duplicates are rows that appear more than once. They happen when:

  • Same data entered twice
  • Merging tables creates copies
  • Data import errors

Find Duplicates

code.py
import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'Sarah', 'John', 'Mike', 'Sarah'],
    'Age': [25, 30, 25, 28, 30]
})

print(df)

# Check which rows are duplicates
print(df.duplicated())

Output:

Name Age 0 John 25 1 Sarah 30 2 John 25 <- duplicate of row 0 3 Mike 28 4 Sarah 30 <- duplicate of row 1 0 False 1 False 2 True <- duplicate! 3 False 4 True <- duplicate!

Count Duplicates

code.py
# How many duplicates?
print(df.duplicated().sum())  # Output: 2

See the Duplicate Rows

code.py
# Show only the duplicate rows
duplicates = df[df.duplicated()]
print(duplicates)

Remove Duplicates

code.py
# Remove duplicates (keeps first occurrence)
clean_df = df.drop_duplicates()
print(clean_df)

Output:

Name Age 0 John 25 1 Sarah 30 3 Mike 28

Keep Last Instead of First

code.py
# Keep last occurrence instead of first
clean_df = df.drop_duplicates(keep='last')
print(clean_df)

Check Duplicates in Specific Columns

Sometimes only some columns should be unique:

code.py
df = pd.DataFrame({
    'Email': ['a@mail.com', 'b@mail.com', 'a@mail.com'],
    'Name': ['John', 'Sarah', 'Johnny']
})

# Check duplicates only in Email column
df['Is_Dup'] = df.duplicated(subset=['Email'])
print(df)

Output:

Email Name Is_Dup 0 a@mail.com John False 1 b@mail.com Sarah False 2 a@mail.com Johnny True <- same email

Remove Duplicates by Specific Columns

code.py
# Remove if same email (even if name different)
clean_df = df.drop_duplicates(subset=['Email'])
print(clean_df)

Output:

Email Name 0 a@mail.com John 1 b@mail.com Sarah

Count Unique Values

code.py
# How many unique names?
print(df['Name'].nunique())

# See unique values
print(df['Name'].unique())

# Count each value
print(df['Name'].value_counts())

Key Points

  • duplicated() finds duplicate rows
  • drop_duplicates() removes them
  • keep='first' (default) keeps first occurrence
  • keep='last' keeps last occurrence
  • subset=['col'] checks only specific columns
  • nunique() counts unique values

Common Mistake

code.py
# This doesn't change the original df!
df.drop_duplicates()

# You need to reassign or use inplace
df = df.drop_duplicates()
# OR
df.drop_duplicates(inplace=True)

What's Next?

Learn to validate your data - check if values make sense and flag errors.