#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
10 min read

Data Cleaning: Missing Data

Detecting, dropping, and filling missing values in your datasets

What You'll Learn

  • Detecting missing values
  • Visualizing missing data patterns
  • Dropping missing data
  • Imputation techniques (filling values)
  • Advanced interpolation methods

Detecting Missing Values

Checking for nulls:

code.py
import pandas as pd
import numpy as np

# Create dataframe with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Check for missing values (returns boolean mask)
print(df.isna())
# or
print(df.isnull())

# Count missing values per column
print(df.isna().sum())

# Total missing values
print(df.isna().sum().sum())

# Percentage of missing values
print(df.isna().mean() * 100)

Dropping Missing Data

Removing rows or columns:

code.py
# Drop rows with ANY missing values
df_clean = df.dropna()

# Drop columns with ANY missing values
df_clean = df.dropna(axis=1)

# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')

# Drop rows based on specific columns
df_clean = df.dropna(subset=['A', 'B'])

# Keep rows with at least N non-missing values
df_clean = df.dropna(thresh=2)

Filling Missing Values (Imputation)

Constant value imputation:

code.py
# Fill with 0
df_filled = df.fillna(0)

# Fill with string
df_filled = df.fillna('Unknown')

# Fill specific columns with different values
df_filled = df.fillna({
    'A': 0,
    'B': df['B'].mean()
})

Statistical imputation:

code.py
# Fill with mean
df['A'] = df['A'].fillna(df['A'].mean())

# Fill with median (robust to outliers)
df['A'] = df['A'].fillna(df['A'].median())

# Fill with mode (for categorical data)
df['Category'] = df['Category'].fillna(df['Category'].mode()[0])

Forward and Backward Fill:

code.py
# Forward fill (propagate last valid observation)
# Useful for time series data
df_ffill = df.fillna(method='ffill')

# Backward fill (use next valid observation)
df_bfill = df.fillna(method='bfill')

Advanced Techniques

Interpolation:

code.py
# Linear interpolation
df['A'] = df['A'].interpolate(method='linear')

# Time-based interpolation (requires datetime index)
df['A'] = df['A'].interpolate(method='time')

Best Practices

  1. Understand WHY data is missing: Is it random or systematic?
  2. Don't just drop: Dropping data reduces your sample size and can introduce bias.
  3. Check distribution: Ensure imputation doesn't drastically change the distribution of your data.
  4. Flag imputed values: Sometimes it's useful to create a new column indicating which values were imputed.

Practice Exercise

code.py
import pandas as pd
import numpy as np

# 1. Create dataset
df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [100, np.nan, 150, np.nan, 200],
    'sales': [10, 20, np.nan, 40, 50]
})

# 2. Identify missing
print("Missing values:\n", df.isna().sum())

# 3. Fill price with mean
df['price'] = df['price'].fillna(df['price'].mean())

# 4. Drop rows with missing sales
df = df.dropna(subset=['sales'])

print("\nCleaned Data:\n", df)

Next Steps

Now that your data is clean, let's learn about NumPy for numerical computing!

Practice & Experiment

Test your understanding by running Python code directly in your browser.