Descriptive Statistics

What are Descriptive Statistics?

Numbers that describe your data in simple terms:

Average salary is $50,000
Ages range from 20 to 65
Most people live in NYC

The Big Three: Mean, Median, Mode

Mean (Average)

Add all values, divide by count.

code.py

import pandas as pd

df = pd.DataFrame({'Salary': [40000, 50000, 60000, 50000, 100000]})

print(df['Salary'].mean())  # 60000

Problem: Mean is affected by extreme values. One person earning 100K pulls the average up.

Median (Middle Value)

The middle number when sorted.

code.py

print(df['Salary'].median())  # 50000

Better for: Data with extreme values (salaries, house prices).

Mode (Most Common)

The value that appears most often.

code.py

print(df['Salary'].mode())  # 50000

Best for: Categories (most popular product, common city).

Spread: How Different are Values?

Range

Difference between max and min.

code.py

range_val = df['Salary'].max() - df['Salary'].min()
print(range_val)  # 60000

Standard Deviation

How spread out the values are from the mean.

code.py

print(df['Salary'].std())

Low std = values are close together
High std = values are spread out

Quick Summary with describe()

code.py

df = pd.DataFrame({
    'Age': [25, 30, 28, 35, 22, 45, 33],
    'Salary': [50000, 60000, 55000, 70000, 45000, 80000, 65000]
})

print(df.describe())

Output:

             Age        Salary
count   7.000000      7.000000
mean   31.142857  60714.285714
std     7.559289  11726.533919
min    22.000000  45000.000000
25%    26.500000  52500.000000
50%    30.000000  60000.000000
75%    34.000000  67500.000000
max    45.000000  80000.000000

What Each Stat Means

Stat	Meaning
count	How many values
mean	Average
std	Spread (standard deviation)
min	Smallest value
25%	Lower quarter (25th percentile)
50%	Middle (median)
75%	Upper quarter (75th percentile)
max	Largest value

Percentiles Explained

25% of people earn below 52,500 (25th percentile) 50% of people earn below 60,000 (median) 75% of people earn below 67,500 (75th percentile)

For Text/Categories

code.py

df = pd.DataFrame({
    'City': ['NYC', 'LA', 'NYC', 'Chicago', 'NYC']
})

# Count each value
print(df['City'].value_counts())

Output:

NYC        3
LA         1
Chicago    1

Key Points

Mean = average (affected by extremes)
Median = middle value (better for skewed data)
Mode = most common value
Std = how spread out values are
describe() gives all stats at once
value_counts() for categories

What's Next?

Learn to analyze one column at a time (univariate analysis).