Data Validation

What is Data Validation?

Data validation checks if your data makes sense:

Is age between 0 and 120?
Are prices positive?
Are emails in correct format?

Bad data leads to wrong results!

Check Value Ranges

code.py

import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Error'],
    'Age': [25, 30, -5],
    'Salary': [50000, 60000, -1000]
})

# Find invalid ages (negative or too high)
invalid_age = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Invalid ages:")
print(invalid_age)

Output:

    Name  Age  Salary
2  Error   -5   -1000

Check for Negative Numbers

code.py

# Find negative salaries
negative_salary = df[df['Salary'] < 0]
print("Negative salaries:")
print(negative_salary)

Validate Text Values

code.py

df = pd.DataFrame({
    'Status': ['Active', 'Inactive', 'active', 'Unknown', 'Active']
})

# Define valid values
valid_status = ['Active', 'Inactive']

# Find invalid values
invalid = df[~df['Status'].isin(valid_status)]
print("Invalid status values:")
print(invalid)

Validate with Conditions

code.py

df = pd.DataFrame({
    'Start_Date': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01']),
    'End_Date': pd.to_datetime(['2024-01-15', '2024-01-15', '2024-03-15'])
})

# End date should be after start date
invalid = df[df['End_Date'] < df['Start_Date']]
print("End before start:")
print(invalid)

Create Validation Flag

code.py

df = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Mike'],
    'Age': [25, -5, 150]
})

# Add flag for valid/invalid
df['Is_Valid'] = (df['Age'] >= 0) & (df['Age'] <= 120)
print(df)

Output:

    Name  Age  Is_Valid
0   John   25      True
1  Sarah   -5     False
2   Mike  150     False

Count Invalid Records

code.py

# How many invalid?
invalid_count = (~df['Is_Valid']).sum()
print(f"Invalid records: {invalid_count}")

# Percentage invalid
pct_invalid = (~df['Is_Valid']).mean() * 100
print(f"Percent invalid: {pct_invalid:.1f}%")

Fix Invalid Values

code.py

import numpy as np

df = pd.DataFrame({
    'Age': [25, -5, 150, 30]
})

# Option 1: Set invalid to NaN
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan

# Option 2: Clip to valid range
df['Age'] = df['Age'].clip(lower=0, upper=120)

Check for Required Fields

code.py

df = pd.DataFrame({
    'Email': ['a@mail.com', None, 'b@mail.com'],
    'Name': ['John', 'Sarah', None]
})

# Find rows missing required fields
missing_email = df[df['Email'].isna()]
missing_name = df[df['Name'].isna()]

print("Missing email:", len(missing_email))
print("Missing name:", len(missing_name))

Summary Statistics for Quick Check

code.py

df = pd.DataFrame({
    'Age': [25, 30, 28, -5, 150],
    'Salary': [50000, 60000, 55000, -1000, 70000]
})

# Quick check - look for suspicious min/max
print(df.describe())

Look at min and max to spot problems.

Key Points

Always check data before analysis
Look for: negatives, out of range, wrong format
.isin() checks if value in valid list
.clip() forces values into range
describe() shows quick summary
Create validation flags to track issues

Validation Checklist

Check	Code Example
Negative numbers	df[df['col'] < 0]
Out of range	df[(df['col'] < min)
Invalid category	df[~df['col'].isin(valid_list)]
Missing required	df[df['col'].isna()]
Future dates	df[df['date'] > today]

What's Next?

Congratulations! You've completed the Data Cleaning module. Next, learn Exploratory Data Analysis (EDA).