#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
4 min read min read

Univariate Analysis - Categorical

Learn to analyze text/category columns

Univariate Analysis - Categorical

What is Categorical Data?

Categories are groups or labels:

  • City: NYC, LA, Chicago
  • Gender: Male, Female
  • Status: Active, Inactive
  • Rating: Good, Average, Bad

Count Each Category

code.py
import pandas as pd

df = pd.DataFrame({
    'City': ['NYC', 'LA', 'NYC', 'Chicago', 'NYC', 'LA', 'NYC']
})

# How many in each city?
print(df['City'].value_counts())

Output:

NYC 4 LA 2 Chicago 1

Show Percentages

code.py
# Percentage in each city
print(df['City'].value_counts(normalize=True) * 100)

Output:

NYC 57.14 LA 28.57 Chicago 14.29

Count Unique Values

code.py
# How many different cities?
print("Unique cities:", df['City'].nunique())

# What are they?
print("Cities:", df['City'].unique())

Find Most Common

code.py
# Most common value
print("Most common:", df['City'].mode()[0])

# Top 3 most common
print(df['City'].value_counts().head(3))

Find Rare Values

code.py
# Values that appear only once
counts = df['City'].value_counts()
rare = counts[counts == 1]
print("Rare values:", rare)

Check for Problems

code.py
df = pd.DataFrame({
    'Status': ['Active', 'active', 'ACTIVE', 'Inactive', None]
})

# See all values
print(df['Status'].value_counts(dropna=False))

Output:

Active 1 active 1 ACTIVE 1 Inactive 1 NaN 1

Problem: Same value in different cases!

Fix Case Issues

code.py
# Make all lowercase
df['Status'] = df['Status'].str.lower()
print(df['Status'].value_counts())

Output:

active 3 inactive 1

Quick Analysis Template

code.py
def analyze_categorical(series):
    print(f"Column: {series.name}")
    print(f"Total: {series.count()}")
    print(f"Missing: {series.isna().sum()}")
    print(f"Unique: {series.nunique()}")
    print(f"Most common: {series.mode()[0]}")
    print(f"\nValue counts:")
    print(series.value_counts())

analyze_categorical(df['Status'])

Key Points

  • value_counts() counts each category
  • normalize=True shows percentages
  • nunique() counts unique values
  • mode() finds most common
  • Check for case issues (Active vs active)
  • Check for missing values (dropna=False)

What's Next?

Learn bivariate analysis - comparing two columns together.