4 min read min read
Univariate Analysis - Categorical
Learn to analyze text/category columns
Univariate Analysis - Categorical
What is Categorical Data?
Categories are groups or labels:
- City: NYC, LA, Chicago
- Gender: Male, Female
- Status: Active, Inactive
- Rating: Good, Average, Bad
Count Each Category
code.py
import pandas as pd
df = pd.DataFrame({
'City': ['NYC', 'LA', 'NYC', 'Chicago', 'NYC', 'LA', 'NYC']
})
# How many in each city?
print(df['City'].value_counts())Output:
NYC 4
LA 2
Chicago 1
Show Percentages
code.py
# Percentage in each city
print(df['City'].value_counts(normalize=True) * 100)Output:
NYC 57.14
LA 28.57
Chicago 14.29
Count Unique Values
code.py
# How many different cities?
print("Unique cities:", df['City'].nunique())
# What are they?
print("Cities:", df['City'].unique())Find Most Common
code.py
# Most common value
print("Most common:", df['City'].mode()[0])
# Top 3 most common
print(df['City'].value_counts().head(3))Find Rare Values
code.py
# Values that appear only once
counts = df['City'].value_counts()
rare = counts[counts == 1]
print("Rare values:", rare)Check for Problems
code.py
df = pd.DataFrame({
'Status': ['Active', 'active', 'ACTIVE', 'Inactive', None]
})
# See all values
print(df['Status'].value_counts(dropna=False))Output:
Active 1
active 1
ACTIVE 1
Inactive 1
NaN 1
Problem: Same value in different cases!
Fix Case Issues
code.py
# Make all lowercase
df['Status'] = df['Status'].str.lower()
print(df['Status'].value_counts())Output:
active 3
inactive 1
Quick Analysis Template
code.py
def analyze_categorical(series):
print(f"Column: {series.name}")
print(f"Total: {series.count()}")
print(f"Missing: {series.isna().sum()}")
print(f"Unique: {series.nunique()}")
print(f"Most common: {series.mode()[0]}")
print(f"\nValue counts:")
print(series.value_counts())
analyze_categorical(df['Status'])Key Points
- value_counts() counts each category
- normalize=True shows percentages
- nunique() counts unique values
- mode() finds most common
- Check for case issues (Active vs active)
- Check for missing values (dropna=False)
What's Next?
Learn bivariate analysis - comparing two columns together.