12 min read
Data Cleaning: Time and Text
Working with dates, strings, and categorical data
What You'll Learn
- Working with datetime data
- String manipulation techniques
- Handling categorical data
- Text cleaning methods
- Data type conversions
Working with Dates and Times
Converting to datetime:
code.py
import pandas as pd
# Convert string to datetime
df['date'] = pd.to_datetime(df['date'])
# Specify format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
# Handle errors
df['date'] = pd.to_datetime(df['date'], errors='coerce')Extracting date components:
code.py
# Extract year, month, day
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_name'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
# Extract time components
df['hour'] = df['datetime'].dt.hour
df['minute'] = df['datetime'].dt.minuteDate arithmetic:
code.py
from datetime import timedelta
# Add days
df['future_date'] = df['date'] + timedelta(days=30)
# Date difference
df['days_diff'] = (df['end_date'] - df['start_date']).dt.days
# Filter by date range
mask = (df['date'] >= '2024-01-01') & (df['date'] <= '2024-12-31')
df_filtered = df[mask]String Operations
Basic string methods:
code.py
# Convert case
df['name'] = df['name'].str.lower()
df['name'] = df['name'].str.upper()
df['name'] = df['name'].str.title()
# Strip whitespace
df['text'] = df['text'].str.strip()
df['text'] = df['text'].str.lstrip()
df['text'] = df['text'].str.rstrip()
# Replace
df['text'] = df['text'].str.replace('old', 'new')
# Length
df['name_length'] = df['name'].str.len()String contains and search:
code.py
# Check if contains
mask = df['email'].str.contains('@gmail.com')
gmail_users = df[mask]
# Case-insensitive
mask = df['text'].str.contains('python', case=False)
# Regex patterns
mask = df['email'].str.contains(r'\w+@\w+\.com')
# Starts with / ends with
df[df['name'].str.startswith('John')]
df[df['email'].str.endswith('.com')]String splitting:
code.py
# Split string
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)
# Split and get first element
df['first_name'] = df['full_name'].str.split(' ').str[0]
# Split by multiple delimiters
df['parts'] = df['text'].str.split('[,;]', regex=True)Categorical Data
Creating categories:
code.py
# Convert to category
df['category'] = df['category'].astype('category')
# Create from scratch
df['size'] = pd.Categorical(
df['size'],
categories=['Small', 'Medium', 'Large'],
ordered=True
)Category operations:
code.py
# Get categories
categories = df['category'].cat.categories
# Add categories
df['category'] = df['category'].cat.add_categories(['New Category'])
# Remove categories
df['category'] = df['category'].cat.remove_categories(['Old Category'])
# Rename categories
df['category'] = df['category'].cat.rename_categories({
'A': 'Category A',
'B': 'Category B'
})Binning continuous data:
code.py
# Create bins
df['age_group'] = pd.cut(
df['age'],
bins=[0, 18, 35, 50, 100],
labels=['Child', 'Young Adult', 'Adult', 'Senior']
)
# Equal-width bins
df['score_bin'] = pd.cut(df['score'], bins=5)
# Equal-frequency bins
df['quantile'] = pd.qcut(df['value'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])Text Cleaning
Removing special characters:
code.py
import re
# Remove punctuation
df['clean_text'] = df['text'].str.replace('[^\w\s]', '', regex=True)
# Remove numbers
df['clean_text'] = df['text'].str.replace('\d+', '', regex=True)
# Remove extra spaces
df['clean_text'] = df['text'].str.replace('\s+', ' ', regex=True).str.strip()Email and URL extraction:
code.py
# Extract email
df['email'] = df['text'].str.extract(r'([\w\.-]+@[\w\.-]+\.\w+)')
# Extract URLs
df['url'] = df['text'].str.extract(r'(https?://\S+)')Data Type Conversions
Converting types:
code.py
# To numeric
df['value'] = pd.to_numeric(df['value'], errors='coerce')
# To string
df['id'] = df['id'].astype(str)
# To integer (requires no NaN)
df['count'] = df['count'].astype(int)
# To float
df['price'] = df['price'].astype(float)Handling currency:
code.py
# Remove currency symbols and convert
df['price'] = df['price'].str.replace('$', '').str.replace(',', '')
df['price'] = pd.to_numeric(df['price'])Practice Exercise
code.py
import pandas as pd
# Sample data
data = {
'date': ['2024-01-15', '2024-02-20', '2024-03-10'],
'name': [' JOHN DOE ', 'jane smith', 'Bob WILSON'],
'email': ['john@gmail.com', 'jane@yahoo.com', 'bob@outlook.com'],
'price': ['$1,234.56', '$567.89', '$8,901.23']
}
df = pd.DataFrame(data)
# Clean data
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
df['name'] = df['name'].str.strip().str.title()
df['domain'] = df['email'].str.split('@').str[1]
df['price'] = df['price'].str.replace('$', '').str.replace(',', '')
df['price'] = pd.to_numeric(df['price'])
print(df)Next Steps
Learn how to handle missing data!
Practice & Experiment
Test your understanding by running Python code directly in your browser. Try the examples from the article above!