#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
5 min read min read

Outlier Detection

Learn to find unusual values in your data

Outlier Detection

What are Outliers?

Outliers are unusual values far from the rest:

  • Most salaries are 50,000-80,000, but one is 500,000
  • Most ages are 20-60, but one is 150 (error!)

Why Care About Outliers?

  • Could be errors (typos, bad data)
  • Could be real but rare (CEO salary)
  • Can mess up averages and analysis

Method 1: Look at Min/Max

code.py
import pandas as pd

df = pd.DataFrame({
    'Salary': [50000, 55000, 60000, 52000, 500000, 58000]
})

print(df['Salary'].describe())

Look at min and max. Does 500,000 make sense?

Method 2: Standard Deviation

Values more than 2-3 standard deviations from mean are suspicious.

code.py
mean = df['Salary'].mean()
std = df['Salary'].std()

# Find outliers (more than 2 std from mean)
lower = mean - 2 * std
upper = mean + 2 * std

outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print("Outliers:")
print(outliers)

Method 3: IQR (Best Method)

IQR = Interquartile Range = difference between 75th and 25th percentile.

code.py
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Outlier boundaries
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

print(f"Normal range: {lower} to {upper}")

# Find outliers
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print("Outliers:")
print(outliers)

Method 4: Z-Score

How many standard deviations from mean:

code.py
from scipy import stats

df['Z_Score'] = stats.zscore(df['Salary'])

# Values with Z > 2 or Z < -2 are outliers
outliers = df[abs(df['Z_Score']) > 2]
print(outliers)

What to Do with Outliers?

Option 1: Remove Them

code.py
# Remove outliers
clean_df = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]

Option 2: Cap Them

code.py
# Cap at boundaries
df['Salary'] = df['Salary'].clip(lower=lower, upper=upper)

Option 3: Keep Them

If they're real and meaningful, keep them but note their effect.

Quick Outlier Check Function

code.py
def find_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    outliers = series[(series < lower) | (series > upper)]
    print(f"Found {len(outliers)} outliers")
    print(f"Range: {lower:.2f} to {upper:.2f}")
    return outliers

find_outliers_iqr(df['Salary'])

Key Points

  • Outliers = unusual values
  • Check with: min/max, std, IQR, Z-score
  • IQR method is most common
  • Decide: remove, cap, or keep
  • Always investigate - is it error or real?

Common Mistake

Don't automatically delete outliers! A CEO salary of 500K is real. A person aged 200 is an error.

What's Next?

Learn about distribution analysis - understanding how data is spread.

SkillsetMaster - AI, Web Development & Data Analytics Courses