Descriptive Statistics Review

What are Descriptive Statistics?

Numbers that describe your data:

What's the average?
How spread out is it?
What's typical?

Measures of Center

Mean (Average)

code.py

import numpy as np
import pandas as pd

data = [10, 20, 30, 40, 50]

# NumPy
mean = np.mean(data)
print(mean)  # 30.0

# Pandas
df = pd.DataFrame({'values': data})
print(df['values'].mean())  # 30.0

Median (Middle Value)

code.py

data = [10, 20, 30, 100, 200]

print(np.median(data))  # 30.0

Median is better when you have outliers!

Mode (Most Common)

code.py

from scipy import stats

data = [1, 2, 2, 3, 3, 3, 4]
print(stats.mode(data))  # 3 (appears most)

Measures of Spread

Range

code.py

data = [10, 20, 30, 40, 50]

range_val = max(data) - min(data)
print(range_val)  # 40

Variance

How far values are from the mean:

code.py

print(np.var(data))  # 200.0

Standard Deviation

Square root of variance (same units as data):

code.py

print(np.std(data))  # 14.14

Rule: Most data falls within 2 standard deviations of mean.

Quartiles and IQR

code.py

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Q1 = np.percentile(data, 25)  # 3.25
Q2 = np.percentile(data, 50)  # 5.5 (median)
Q3 = np.percentile(data, 75)  # 7.75

IQR = Q3 - Q1  # 4.5
print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")

IQR = Interquartile Range (middle 50% of data)

Pandas describe()

Get all stats at once:

code.py

df = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000]
})

print(df.describe())

Output:

Stat	Age	Salary
count	5.00	5.00
mean	35.00	70000.00
std	7.91	15811.39
min	25.00	50000.00
25%	30.00	60000.00
50%	35.00	70000.00
75%	40.00	80000.00
max	45.00	90000.00

Skewness

Is data tilted left or right?

code.py

from scipy.stats import skew

data = [1, 2, 2, 3, 3, 3, 10]  # Has outlier

print(skew(data))  # Positive = right skew

Positive skew: Tail goes right (outliers are high)
Negative skew: Tail goes left (outliers are low)
Zero: Symmetric

Kurtosis

How peaked is the data?

code.py

from scipy.stats import kurtosis

print(kurtosis(data))

High kurtosis: Sharp peak, heavy tails
Low kurtosis: Flat peak

Complete Example

code.py

import pandas as pd
import numpy as np
from scipy import stats

# Sample data
df = pd.DataFrame({
    'Sales': [100, 150, 120, 180, 200, 90, 160, 140, 170, 130]
})

# All descriptive stats
print("=== Descriptive Statistics ===")
print(f"Mean: {df['Sales'].mean():.2f}")
print(f"Median: {df['Sales'].median():.2f}")
print(f"Mode: {df['Sales'].mode()[0]}")
print(f"Std Dev: {df['Sales'].std():.2f}")
print(f"Variance: {df['Sales'].var():.2f}")
print(f"Min: {df['Sales'].min()}")
print(f"Max: {df['Sales'].max()}")
print(f"Range: {df['Sales'].max() - df['Sales'].min()}")
print(f"Q1: {df['Sales'].quantile(0.25):.2f}")
print(f"Q3: {df['Sales'].quantile(0.75):.2f}")
print(f"IQR: {df['Sales'].quantile(0.75) - df['Sales'].quantile(0.25):.2f}")
print(f"Skewness: {df['Sales'].skew():.2f}")

When to Use What?

Statistic	Use When
Mean	Data is symmetric, no outliers
Median	Data has outliers or is skewed
Std Dev	Measuring spread
IQR	Comparing with median

Key Points

Mean = average (affected by outliers)
Median = middle value (robust to outliers)
Std Dev = typical distance from mean
IQR = spread of middle 50%
Use describe() for quick summary
Check skewness for data shape

What's Next?

Learn the basics of probability.