5 min read min read
Descriptive Statistics Review
Review key statistical concepts for data analysis
Descriptive Statistics Review
What are Descriptive Statistics?
Numbers that describe your data:
- What's the average?
- How spread out is it?
- What's typical?
Measures of Center
Mean (Average)
code.py
import numpy as np
import pandas as pd
data = [10, 20, 30, 40, 50]
# NumPy
mean = np.mean(data)
print(mean) # 30.0
# Pandas
df = pd.DataFrame({'values': data})
print(df['values'].mean()) # 30.0Median (Middle Value)
code.py
data = [10, 20, 30, 100, 200]
print(np.median(data)) # 30.0Median is better when you have outliers!
Mode (Most Common)
code.py
from scipy import stats
data = [1, 2, 2, 3, 3, 3, 4]
print(stats.mode(data)) # 3 (appears most)Measures of Spread
Range
code.py
data = [10, 20, 30, 40, 50]
range_val = max(data) - min(data)
print(range_val) # 40Variance
How far values are from the mean:
code.py
print(np.var(data)) # 200.0Standard Deviation
Square root of variance (same units as data):
code.py
print(np.std(data)) # 14.14Rule: Most data falls within 2 standard deviations of mean.
Quartiles and IQR
code.py
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Q1 = np.percentile(data, 25) # 3.25
Q2 = np.percentile(data, 50) # 5.5 (median)
Q3 = np.percentile(data, 75) # 7.75
IQR = Q3 - Q1 # 4.5
print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")IQR = Interquartile Range (middle 50% of data)
Pandas describe()
Get all stats at once:
code.py
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]
})
print(df.describe())Output:
| Stat | Age | Salary |
|---|---|---|
| count | 5.00 | 5.00 |
| mean | 35.00 | 70000.00 |
| std | 7.91 | 15811.39 |
| min | 25.00 | 50000.00 |
| 25% | 30.00 | 60000.00 |
| 50% | 35.00 | 70000.00 |
| 75% | 40.00 | 80000.00 |
| max | 45.00 | 90000.00 |
Skewness
Is data tilted left or right?
code.py
from scipy.stats import skew
data = [1, 2, 2, 3, 3, 3, 10] # Has outlier
print(skew(data)) # Positive = right skew- Positive skew: Tail goes right (outliers are high)
- Negative skew: Tail goes left (outliers are low)
- Zero: Symmetric
Kurtosis
How peaked is the data?
code.py
from scipy.stats import kurtosis
print(kurtosis(data))- High kurtosis: Sharp peak, heavy tails
- Low kurtosis: Flat peak
Complete Example
code.py
import pandas as pd
import numpy as np
from scipy import stats
# Sample data
df = pd.DataFrame({
'Sales': [100, 150, 120, 180, 200, 90, 160, 140, 170, 130]
})
# All descriptive stats
print("=== Descriptive Statistics ===")
print(f"Mean: {df['Sales'].mean():.2f}")
print(f"Median: {df['Sales'].median():.2f}")
print(f"Mode: {df['Sales'].mode()[0]}")
print(f"Std Dev: {df['Sales'].std():.2f}")
print(f"Variance: {df['Sales'].var():.2f}")
print(f"Min: {df['Sales'].min()}")
print(f"Max: {df['Sales'].max()}")
print(f"Range: {df['Sales'].max() - df['Sales'].min()}")
print(f"Q1: {df['Sales'].quantile(0.25):.2f}")
print(f"Q3: {df['Sales'].quantile(0.75):.2f}")
print(f"IQR: {df['Sales'].quantile(0.75) - df['Sales'].quantile(0.25):.2f}")
print(f"Skewness: {df['Sales'].skew():.2f}")When to Use What?
| Statistic | Use When |
|---|---|
| Mean | Data is symmetric, no outliers |
| Median | Data has outliers or is skewed |
| Std Dev | Measuring spread |
| IQR | Comparing with median |
Key Points
- Mean = average (affected by outliers)
- Median = middle value (robust to outliers)
- Std Dev = typical distance from mean
- IQR = spread of middle 50%
- Use describe() for quick summary
- Check skewness for data shape
What's Next?
Learn the basics of probability.