Univariate Analysis - Categorical

What is Categorical Data?

Categories are groups or labels:

City: NYC, LA, Chicago
Gender: Male, Female
Status: Active, Inactive
Rating: Good, Average, Bad

Count Each Category

code.pyPython

import pandas as pd

df = pd.DataFrame({
    'City': ['NYC', 'LA', 'NYC', 'Chicago', 'NYC', 'LA', 'NYC']
})

# How many in each city?
print(df['City'].value_counts())

Output:

NYC        4
LA         2
Chicago    1

Show Percentages

code.pyPython

# Percentage in each city
print(df['City'].value_counts(normalize=True) * 100)

Output:

NYC        57.14
LA         28.57
Chicago    14.29

Count Unique Values

code.pyPython

# How many different cities?
print("Unique cities:", df['City'].nunique())

# What are they?
print("Cities:", df['City'].unique())

Find Most Common

code.pyPython

# Most common value
print("Most common:", df['City'].mode()[0])

# Top 3 most common
print(df['City'].value_counts().head(3))

Find Rare Values

code.pyPython

# Values that appear only once
counts = df['City'].value_counts()
rare = counts[counts == 1]
print("Rare values:", rare)

Check for Problems

code.pyPython

df = pd.DataFrame({
    'Status': ['Active', 'active', 'ACTIVE', 'Inactive', None]
})

# See all values
print(df['Status'].value_counts(dropna=False))

Output:

Active      1
active      1
ACTIVE      1
Inactive    1
NaN         1

Problem: Same value in different cases!

Fix Case Issues

code.pyPython

# Make all lowercase
df['Status'] = df['Status'].str.lower()
print(df['Status'].value_counts())

Output:

active      3
inactive    1

Quick Analysis Template

code.pyPython

def analyze_categorical(series):
    print(f"Column: {series.name}")
    print(f"Total: {series.count()}")
    print(f"Missing: {series.isna().sum()}")
    print(f"Unique: {series.nunique()}")
    print(f"Most common: {series.mode()[0]}")
    print(f"\nValue counts:")
    print(series.value_counts())

analyze_categorical(df['Status'])

Key Points

value_counts() counts each category
normalize=True shows percentages
nunique() counts unique values
mode() finds most common
Check for case issues (Active vs active)
Check for missing values (dropna=False)

What's Next?

Learn bivariate analysis - comparing two columns together.