#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
5 min read min read

Data Validation

Learn to check if your data is correct and makes sense

Data Validation

What is Data Validation?

Data validation checks if your data makes sense:

  • Is age between 0 and 120?
  • Are prices positive?
  • Are emails in correct format?

Bad data leads to wrong results!

Check Value Ranges

code.py
import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Error'],
    'Age': [25, 30, -5],
    'Salary': [50000, 60000, -1000]
})

# Find invalid ages (negative or too high)
invalid_age = df[(df['Age'] < 0) | (df['Age'] > 120)]
print("Invalid ages:")
print(invalid_age)

Output:

Name Age Salary 2 Error -5 -1000

Check for Negative Numbers

code.py
# Find negative salaries
negative_salary = df[df['Salary'] < 0]
print("Negative salaries:")
print(negative_salary)

Validate Text Values

code.py
df = pd.DataFrame({
    'Status': ['Active', 'Inactive', 'active', 'Unknown', 'Active']
})

# Define valid values
valid_status = ['Active', 'Inactive']

# Find invalid values
invalid = df[~df['Status'].isin(valid_status)]
print("Invalid status values:")
print(invalid)

Validate with Conditions

code.py
df = pd.DataFrame({
    'Start_Date': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01']),
    'End_Date': pd.to_datetime(['2024-01-15', '2024-01-15', '2024-03-15'])
})

# End date should be after start date
invalid = df[df['End_Date'] < df['Start_Date']]
print("End before start:")
print(invalid)

Create Validation Flag

code.py
df = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Mike'],
    'Age': [25, -5, 150]
})

# Add flag for valid/invalid
df['Is_Valid'] = (df['Age'] >= 0) & (df['Age'] <= 120)
print(df)

Output:

Name Age Is_Valid 0 John 25 True 1 Sarah -5 False 2 Mike 150 False

Count Invalid Records

code.py
# How many invalid?
invalid_count = (~df['Is_Valid']).sum()
print(f"Invalid records: {invalid_count}")

# Percentage invalid
pct_invalid = (~df['Is_Valid']).mean() * 100
print(f"Percent invalid: {pct_invalid:.1f}%")

Fix Invalid Values

code.py
import numpy as np

df = pd.DataFrame({
    'Age': [25, -5, 150, 30]
})

# Option 1: Set invalid to NaN
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan

# Option 2: Clip to valid range
df['Age'] = df['Age'].clip(lower=0, upper=120)

Check for Required Fields

code.py
df = pd.DataFrame({
    'Email': ['a@mail.com', None, 'b@mail.com'],
    'Name': ['John', 'Sarah', None]
})

# Find rows missing required fields
missing_email = df[df['Email'].isna()]
missing_name = df[df['Name'].isna()]

print("Missing email:", len(missing_email))
print("Missing name:", len(missing_name))

Summary Statistics for Quick Check

code.py
df = pd.DataFrame({
    'Age': [25, 30, 28, -5, 150],
    'Salary': [50000, 60000, 55000, -1000, 70000]
})

# Quick check - look for suspicious min/max
print(df.describe())

Look at min and max to spot problems.

Key Points

  • Always check data before analysis
  • Look for: negatives, out of range, wrong format
  • .isin() checks if value in valid list
  • .clip() forces values into range
  • describe() shows quick summary
  • Create validation flags to track issues

Validation Checklist

CheckCode Example
Negative numbersdf[df['col'] < 0]
Out of rangedf[(df['col'] < min)
Invalid categorydf[~df['col'].isin(valid_list)]
Missing requireddf[df['col'].isna()]
Future datesdf[df['date'] > today]

What's Next?

Congratulations! You've completed the Data Cleaning module. Next, learn Exploratory Data Analysis (EDA).