10 min read
Data Cleaning: Missing Data
Detecting, dropping, and filling missing values in your datasets
What You'll Learn
- Detecting missing values
- Visualizing missing data patterns
- Dropping missing data
- Imputation techniques (filling values)
- Advanced interpolation methods
Detecting Missing Values
Checking for nulls:
code.py
import pandas as pd
import numpy as np
# Create dataframe with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Check for missing values (returns boolean mask)
print(df.isna())
# or
print(df.isnull())
# Count missing values per column
print(df.isna().sum())
# Total missing values
print(df.isna().sum().sum())
# Percentage of missing values
print(df.isna().mean() * 100)Dropping Missing Data
Removing rows or columns:
code.py
# Drop rows with ANY missing values
df_clean = df.dropna()
# Drop columns with ANY missing values
df_clean = df.dropna(axis=1)
# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')
# Drop rows based on specific columns
df_clean = df.dropna(subset=['A', 'B'])
# Keep rows with at least N non-missing values
df_clean = df.dropna(thresh=2)Filling Missing Values (Imputation)
Constant value imputation:
code.py
# Fill with 0
df_filled = df.fillna(0)
# Fill with string
df_filled = df.fillna('Unknown')
# Fill specific columns with different values
df_filled = df.fillna({
'A': 0,
'B': df['B'].mean()
})Statistical imputation:
code.py
# Fill with mean
df['A'] = df['A'].fillna(df['A'].mean())
# Fill with median (robust to outliers)
df['A'] = df['A'].fillna(df['A'].median())
# Fill with mode (for categorical data)
df['Category'] = df['Category'].fillna(df['Category'].mode()[0])Forward and Backward Fill:
code.py
# Forward fill (propagate last valid observation)
# Useful for time series data
df_ffill = df.fillna(method='ffill')
# Backward fill (use next valid observation)
df_bfill = df.fillna(method='bfill')Advanced Techniques
Interpolation:
code.py
# Linear interpolation
df['A'] = df['A'].interpolate(method='linear')
# Time-based interpolation (requires datetime index)
df['A'] = df['A'].interpolate(method='time')Best Practices
- Understand WHY data is missing: Is it random or systematic?
- Don't just drop: Dropping data reduces your sample size and can introduce bias.
- Check distribution: Ensure imputation doesn't drastically change the distribution of your data.
- Flag imputed values: Sometimes it's useful to create a new column indicating which values were imputed.
Practice Exercise
code.py
import pandas as pd
import numpy as np
# 1. Create dataset
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'price': [100, np.nan, 150, np.nan, 200],
'sales': [10, 20, np.nan, 40, 50]
})
# 2. Identify missing
print("Missing values:\n", df.isna().sum())
# 3. Fill price with mean
df['price'] = df['price'].fillna(df['price'].mean())
# 4. Drop rows with missing sales
df = df.dropna(subset=['sales'])
print("\nCleaned Data:\n", df)Next Steps
Now that your data is clean, let's learn about NumPy for numerical computing!
Practice & Experiment
Test your understanding by running Python code directly in your browser.