#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
12 min read

Data Cleaning: Time and Text

Working with dates, strings, and categorical data

What You'll Learn

  • Working with datetime data
  • String manipulation techniques
  • Handling categorical data
  • Text cleaning methods
  • Data type conversions

Working with Dates and Times

Converting to datetime:

code.py
import pandas as pd

# Convert string to datetime
df['date'] = pd.to_datetime(df['date'])

# Specify format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Handle errors
df['date'] = pd.to_datetime(df['date'], errors='coerce')

Extracting date components:

code.py
# Extract year, month, day
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_name'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter

# Extract time components
df['hour'] = df['datetime'].dt.hour
df['minute'] = df['datetime'].dt.minute

Date arithmetic:

code.py
from datetime import timedelta

# Add days
df['future_date'] = df['date'] + timedelta(days=30)

# Date difference
df['days_diff'] = (df['end_date'] - df['start_date']).dt.days

# Filter by date range
mask = (df['date'] >= '2024-01-01') & (df['date'] <= '2024-12-31')
df_filtered = df[mask]

String Operations

Basic string methods:

code.py
# Convert case
df['name'] = df['name'].str.lower()
df['name'] = df['name'].str.upper()
df['name'] = df['name'].str.title()

# Strip whitespace
df['text'] = df['text'].str.strip()
df['text'] = df['text'].str.lstrip()
df['text'] = df['text'].str.rstrip()

# Replace
df['text'] = df['text'].str.replace('old', 'new')

# Length
df['name_length'] = df['name'].str.len()

String contains and search:

code.py
# Check if contains
mask = df['email'].str.contains('@gmail.com')
gmail_users = df[mask]

# Case-insensitive
mask = df['text'].str.contains('python', case=False)

# Regex patterns
mask = df['email'].str.contains(r'\w+@\w+\.com')

# Starts with / ends with
df[df['name'].str.startswith('John')]
df[df['email'].str.endswith('.com')]

String splitting:

code.py
# Split string
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)

# Split and get first element
df['first_name'] = df['full_name'].str.split(' ').str[0]

# Split by multiple delimiters
df['parts'] = df['text'].str.split('[,;]', regex=True)

Categorical Data

Creating categories:

code.py
# Convert to category
df['category'] = df['category'].astype('category')

# Create from scratch
df['size'] = pd.Categorical(
    df['size'],
    categories=['Small', 'Medium', 'Large'],
    ordered=True
)

Category operations:

code.py
# Get categories
categories = df['category'].cat.categories

# Add categories
df['category'] = df['category'].cat.add_categories(['New Category'])

# Remove categories
df['category'] = df['category'].cat.remove_categories(['Old Category'])

# Rename categories
df['category'] = df['category'].cat.rename_categories({
    'A': 'Category A',
    'B': 'Category B'
})

Binning continuous data:

code.py
# Create bins
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 50, 100],
    labels=['Child', 'Young Adult', 'Adult', 'Senior']
)

# Equal-width bins
df['score_bin'] = pd.cut(df['score'], bins=5)

# Equal-frequency bins
df['quantile'] = pd.qcut(df['value'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

Text Cleaning

Removing special characters:

code.py
import re

# Remove punctuation
df['clean_text'] = df['text'].str.replace('[^\w\s]', '', regex=True)

# Remove numbers
df['clean_text'] = df['text'].str.replace('\d+', '', regex=True)

# Remove extra spaces
df['clean_text'] = df['text'].str.replace('\s+', ' ', regex=True).str.strip()

Email and URL extraction:

code.py
# Extract email
df['email'] = df['text'].str.extract(r'([\w\.-]+@[\w\.-]+\.\w+)')

# Extract URLs
df['url'] = df['text'].str.extract(r'(https?://\S+)')

Data Type Conversions

Converting types:

code.py
# To numeric
df['value'] = pd.to_numeric(df['value'], errors='coerce')

# To string
df['id'] = df['id'].astype(str)

# To integer (requires no NaN)
df['count'] = df['count'].astype(int)

# To float
df['price'] = df['price'].astype(float)

Handling currency:

code.py
# Remove currency symbols and convert
df['price'] = df['price'].str.replace('$', '').str.replace(',', '')
df['price'] = pd.to_numeric(df['price'])

Practice Exercise

code.py
import pandas as pd

# Sample data
data = {
    'date': ['2024-01-15', '2024-02-20', '2024-03-10'],
    'name': ['  JOHN DOE  ', 'jane smith', 'Bob WILSON'],
    'email': ['john@gmail.com', 'jane@yahoo.com', 'bob@outlook.com'],
    'price': ['$1,234.56', '$567.89', '$8,901.23']
}

df = pd.DataFrame(data)

# Clean data
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
df['name'] = df['name'].str.strip().str.title()
df['domain'] = df['email'].str.split('@').str[1]
df['price'] = df['price'].str.replace('$', '').str.replace(',', '')
df['price'] = pd.to_numeric(df['price'])

print(df)

Next Steps

Learn how to handle missing data!

Practice & Experiment

Test your understanding by running Python code directly in your browser. Try the examples from the article above!

SkillsetMaster - AI, Web Development & Data Analytics Courses