5 min read min read
Data Validation
Learn to check if your data is correct and makes sense
Data Validation
What is Data Validation?
Data validation checks if your data makes sense:
- Is age between 0 and 120?
- Are prices positive?
- Are emails in correct format?
Bad data leads to wrong results!
Check Value Ranges
code.py
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'Error'],
'Age': [25, 30, -5],
'Salary': [50000, 60000, -1000]
})
# Find invalid ages (negative or too high)
invalid_age = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Invalid ages:")
print(invalid_age)Output:
Name Age Salary
2 Error -5 -1000
Check for Negative Numbers
code.py
# Find negative salaries
negative_salary = df[df['Salary'] < 0]
print("Negative salaries:")
print(negative_salary)Validate Text Values
code.py
df = pd.DataFrame({
'Status': ['Active', 'Inactive', 'active', 'Unknown', 'Active']
})
# Define valid values
valid_status = ['Active', 'Inactive']
# Find invalid values
invalid = df[~df['Status'].isin(valid_status)]
print("Invalid status values:")
print(invalid)Validate with Conditions
code.py
df = pd.DataFrame({
'Start_Date': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01']),
'End_Date': pd.to_datetime(['2024-01-15', '2024-01-15', '2024-03-15'])
})
# End date should be after start date
invalid = df[df['End_Date'] < df['Start_Date']]
print("End before start:")
print(invalid)Create Validation Flag
code.py
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike'],
'Age': [25, -5, 150]
})
# Add flag for valid/invalid
df['Is_Valid'] = (df['Age'] >= 0) & (df['Age'] <= 120)
print(df)Output:
Name Age Is_Valid
0 John 25 True
1 Sarah -5 False
2 Mike 150 False
Count Invalid Records
code.py
# How many invalid?
invalid_count = (~df['Is_Valid']).sum()
print(f"Invalid records: {invalid_count}")
# Percentage invalid
pct_invalid = (~df['Is_Valid']).mean() * 100
print(f"Percent invalid: {pct_invalid:.1f}%")Fix Invalid Values
code.py
import numpy as np
df = pd.DataFrame({
'Age': [25, -5, 150, 30]
})
# Option 1: Set invalid to NaN
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan
# Option 2: Clip to valid range
df['Age'] = df['Age'].clip(lower=0, upper=120)Check for Required Fields
code.py
df = pd.DataFrame({
'Email': ['a@mail.com', None, 'b@mail.com'],
'Name': ['John', 'Sarah', None]
})
# Find rows missing required fields
missing_email = df[df['Email'].isna()]
missing_name = df[df['Name'].isna()]
print("Missing email:", len(missing_email))
print("Missing name:", len(missing_name))Summary Statistics for Quick Check
code.py
df = pd.DataFrame({
'Age': [25, 30, 28, -5, 150],
'Salary': [50000, 60000, 55000, -1000, 70000]
})
# Quick check - look for suspicious min/max
print(df.describe())Look at min and max to spot problems.
Key Points
- Always check data before analysis
- Look for: negatives, out of range, wrong format
- .isin() checks if value in valid list
- .clip() forces values into range
- describe() shows quick summary
- Create validation flags to track issues
Validation Checklist
| Check | Code Example |
|---|---|
| Negative numbers | df[df['col'] < 0] |
| Out of range | df[(df['col'] < min) |
| Invalid category | df[~df['col'].isin(valid_list)] |
| Missing required | df[df['col'].isna()] |
| Future dates | df[df['date'] > today] |
What's Next?
Congratulations! You've completed the Data Cleaning module. Next, learn Exploratory Data Analysis (EDA).