✅
Data Cleaning Checklist
Phase 1: Initial Assessment
- [ ] Load data and inspect first/last rows
- [ ] Check dataset size (rows × columns)
- [ ] Review column names and data types
- [ ] Get summary statistics
- [ ] Check for completely empty columns/rows
Phase 2: Missing Values
- [ ] Identify columns with missing values
- [ ] Calculate % missing per column
- [ ] Decide strategy: drop, impute, or flag
- [ ] Document missing value decisions
Phase 3: Duplicates
- [ ] Check for exact duplicate rows
- [ ] Check for duplicate IDs (primary keys)
- [ ] Identify partial duplicates (same name, different ID)
- [ ] Remove or flag duplicates with justification
Phase 4: Data Types
- [ ] Convert strings to dates where applicable
- [ ] Cast numeric columns (currently strings)
- [ ] Standardize categorical values (case, spelling)
- [ ] Create derived columns (year, month from date)
Phase 5: Outliers
- [ ] Identify outliers using IQR or z-score
- [ ] Validate if outliers are errors or real
- [ ] Decide: cap, remove, or keep with justification
- [ ] Document outlier handling
Phase 6: Validation
- [ ] Check value ranges (age 0-120, price > 0)
- [ ] Validate formats (email, phone, ZIP)
- [ ] Cross-field validation (end_date > start_date)
- [ ] Check referential integrity (foreign keys exist)
Phase 7: Standardization
- [ ] Consistent naming (lowercase, underscores)
- [ ] Uniform categories (Mumbai vs Bombay → Mumbai)
- [ ] Date format consistency
- [ ] Unit consistency (cm vs inches → cm)
Phase 8: Documentation
- [ ] Document all cleaning steps
- [ ] Record rows removed and why
- [ ] Note assumptions made
- [ ] Save cleaned dataset with clear name
🐍
Python Implementation
Complete Cleaning Pipeline:
code.pyPython
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('raw_data.csv')
print(f"Original shape: {df.shape}")
# 1. Initial inspection
print(df.head())
print(df.info())
print(df.describe())
# 2. Missing values
print("\nMissing values:")
print(df.isnull().sum())
print("\n% Missing:")
print((df.isnull().sum() / len(df)) * 100)
# Drop columns with >50% missing
threshold = 0.5
df = df.loc[:, df.isnull().mean() < threshold]
# Impute numeric with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# Impute categorical with mode
cat_cols = df.select_dtypes(include=['object']).columns
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])
# 3. Duplicates
print(f"\nDuplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()
# 4. Data types
# Convert date columns
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# 5. Outliers (cap at 99th percentile)
def cap_outliers(df, column, percentile=0.99):
upper_limit = df[column].quantile(percentile)
df[column] = df[column].clip(upper=upper_limit)
return df
df = cap_outliers(df, 'price', 0.99)
# 6. Validation
# Remove invalid ages
df = df[(df['age'] >= 0) & (df['age'] <= 120)]
# Remove negative prices
df = df[df['price'] > 0]
# 7. Standardization
# Lowercase column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Standardize categorical values
df['city'] = df['city'].str.title() # Mumbai not MUMBAI
df['city'] = df['city'].replace({'Bombay': 'Mumbai', 'Bengaluru': 'Bangalore'})
print(f"\nFinal shape: {df.shape}")
print(f"Rows removed: {len(pd.read_csv('raw_data.csv')) - len(df)}")
# Save cleaned data
df.to_csv('cleaned_data.csv', index=False)⚠️ CheckpointQuiz error: Missing or invalid options array
🗄️
SQL Data Cleaning
Common Cleaning Queries:
query.sqlSQL
-- 1. Find duplicates
SELECT customer_id, email, COUNT(*)
FROM customers
GROUP BY customer_id, email
HAVING COUNT(*) > 1;
-- Remove duplicates (keep first occurrence)
DELETE FROM customers
WHERE id NOT IN (
SELECT MIN(id)
FROM customers
GROUP BY customer_id, email
);
-- 2. Handle missing values
-- Find rows with null in critical columns
SELECT *
FROM orders
WHERE customer_id IS NULL OR order_date IS NULL;
-- Replace nulls with default
UPDATE products
SET category = 'Uncategorized'
WHERE category IS NULL;
-- 3. Fix data types
-- Convert string to date
ALTER TABLE orders
ALTER COLUMN order_date TYPE DATE
USING order_date::DATE;
-- 4. Outlier detection (IQR method)
WITH stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY price) AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY price) AS q3
FROM products
)
SELECT *
FROM products, stats
WHERE price < (q1 - 1.5 * (q3 - q1)) -- Lower outliers
OR price > (q3 + 1.5 * (q3 - q1)); -- Upper outliers
-- 5. Standardize text
UPDATE customers
SET city = INITCAP(city), -- Title case
email = LOWER(email); -- Lowercase emails
-- Replace variations
UPDATE customers
SET city = REPLACE(city, 'Bengaluru', 'Bangalore');⚠️
Common Data Quality Issues
Issue 1: Inconsistent Categories
Problem: Mumbai, MUMBAI, Bombay, mumbai (4 variations)
Solution:
code.pyPython
# Standardize
df['city'] = df['city'].str.title() # Mumbai
df['city'] = df['city'].replace({'Bombay': 'Mumbai', 'Bengaluru': 'Bangalore'})
# Verify
print(df['city'].value_counts())Issue 2: Date Formats
Problem: 2026-03-15, 03/15/2026, 15-Mar-2026 (multiple formats)
Solution:
code.pyPython
# pandas auto-detects most formats
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Custom format
df['date'] = pd.to_datetime(df['date'], format='%d-%b-%Y')Issue 3: Outliers
Problem: Salaries: [₹5L, ₹6L, ₹5.5L, ₹500L] (₹500L is data entry error or outlier)
Solution:
code.pyPython
# IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]
# OR cap outliers (keep row, adjust value)
df['salary'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)Issue 4: Mixed Data Types
Problem: Age column has: 25, "30", "unknown", ""
Solution:
code.pyPython
# Convert to numeric, coerce errors to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')
# Then handle nulls
df['age'].fillna(df['age'].median(), inplace=True)⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}