5 min read min read
Outlier Detection
Learn to find unusual values in your data
Outlier Detection
What are Outliers?
Outliers are unusual values far from the rest:
- Most salaries are 50,000-80,000, but one is 500,000
- Most ages are 20-60, but one is 150 (error!)
Why Care About Outliers?
- Could be errors (typos, bad data)
- Could be real but rare (CEO salary)
- Can mess up averages and analysis
Method 1: Look at Min/Max
code.py
import pandas as pd
df = pd.DataFrame({
'Salary': [50000, 55000, 60000, 52000, 500000, 58000]
})
print(df['Salary'].describe())Look at min and max. Does 500,000 make sense?
Method 2: Standard Deviation
Values more than 2-3 standard deviations from mean are suspicious.
code.py
mean = df['Salary'].mean()
std = df['Salary'].std()
# Find outliers (more than 2 std from mean)
lower = mean - 2 * std
upper = mean + 2 * std
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print("Outliers:")
print(outliers)Method 3: IQR (Best Method)
IQR = Interquartile Range = difference between 75th and 25th percentile.
code.py
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
# Outlier boundaries
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
print(f"Normal range: {lower} to {upper}")
# Find outliers
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print("Outliers:")
print(outliers)Method 4: Z-Score
How many standard deviations from mean:
code.py
from scipy import stats
df['Z_Score'] = stats.zscore(df['Salary'])
# Values with Z > 2 or Z < -2 are outliers
outliers = df[abs(df['Z_Score']) > 2]
print(outliers)What to Do with Outliers?
Option 1: Remove Them
code.py
# Remove outliers
clean_df = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]Option 2: Cap Them
code.py
# Cap at boundaries
df['Salary'] = df['Salary'].clip(lower=lower, upper=upper)Option 3: Keep Them
If they're real and meaningful, keep them but note their effect.
Quick Outlier Check Function
code.py
def find_outliers_iqr(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = series[(series < lower) | (series > upper)]
print(f"Found {len(outliers)} outliers")
print(f"Range: {lower:.2f} to {upper:.2f}")
return outliers
find_outliers_iqr(df['Salary'])Key Points
- Outliers = unusual values
- Check with: min/max, std, IQR, Z-score
- IQR method is most common
- Decide: remove, cap, or keep
- Always investigate - is it error or real?
Common Mistake
Don't automatically delete outliers! A CEO salary of 500K is real. A person aged 200 is an error.
What's Next?
Learn about distribution analysis - understanding how data is spread.