Data Cleaning: Missing Data

What You'll Learn

Detecting missing values
Visualizing missing data patterns
Dropping missing data
Imputation techniques (filling values)
Advanced interpolation methods

Detecting Missing Values

Checking for nulls:

code.pyPython

import pandas as pd
import numpy as np

# Create dataframe with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Check for missing values (returns boolean mask)
print(df.isna())
# or
print(df.isnull())

# Count missing values per column
print(df.isna().sum())

# Total missing values
print(df.isna().sum().sum())

# Percentage of missing values
print(df.isna().mean() * 100)

Dropping Missing Data

Removing rows or columns:

code.pyPython

# Drop rows with ANY missing values
df_clean = df.dropna()

# Drop columns with ANY missing values
df_clean = df.dropna(axis=1)

# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')

# Drop rows based on specific columns
df_clean = df.dropna(subset=['A', 'B'])

# Keep rows with at least N non-missing values
df_clean = df.dropna(thresh=2)

Filling Missing Values (Imputation)

Constant value imputation:

code.pyPython

# Fill with 0
df_filled = df.fillna(0)

# Fill with string
df_filled = df.fillna('Unknown')

# Fill specific columns with different values
df_filled = df.fillna({
    'A': 0,
    'B': df['B'].mean()
})

Statistical imputation:

code.pyPython

# Fill with mean
df['A'] = df['A'].fillna(df['A'].mean())

# Fill with median (robust to outliers)
df['A'] = df['A'].fillna(df['A'].median())

# Fill with mode (for categorical data)
df['Category'] = df['Category'].fillna(df['Category'].mode()[0])

Forward and Backward Fill:

code.pyPython

# Forward fill (propagate last valid observation)
# Useful for time series data
df_ffill = df.fillna(method='ffill')

# Backward fill (use next valid observation)
df_bfill = df.fillna(method='bfill')

Advanced Techniques

Interpolation:

code.pyPython

# Linear interpolation
df['A'] = df['A'].interpolate(method='linear')

# Time-based interpolation (requires datetime index)
df['A'] = df['A'].interpolate(method='time')

Best Practices

Understand WHY data is missing: Is it random or systematic?
Don't just drop: Dropping data reduces your sample size and can introduce bias.
Check distribution: Ensure imputation doesn't drastically change the distribution of your data.
Flag imputed values: Sometimes it's useful to create a new column indicating which values were imputed.

Practice Exercise

code.pyPython

import pandas as pd
import numpy as np

# 1. Create dataset
df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [100, np.nan, 150, np.nan, 200],
    'sales': [10, 20, np.nan, 40, 50]
})

# 2. Identify missing
print("Missing values:\n", df.isna().sum())

# 3. Fill price with mean
df['price'] = df['price'].fillna(df['price'].mean())

# 4. Drop rows with missing sales
df = df.dropna(subset=['sales'])

print("\nCleaned Data:\n", df)

Next Steps

Now that your data is clean, let's learn about NumPy for numerical computing!

What You'll Learn

Detecting Missing Values

Dropping Missing Data

Filling Missing Values (Imputation)

Advanced Techniques

Best Practices

Practice Exercise

Next Steps

Practice & Experiment