5 min read min read
Distribution Analysis
Learn to understand how your data is spread
Distribution Analysis
What is Distribution?
Distribution shows how values are spread out:
- Are most values in the middle?
- Are they spread evenly?
- Is there a long tail on one side?
Normal Distribution (Bell Curve)
Most common pattern:
- Most values in the middle
- Fewer values at extremes
- Symmetric left and right
Example: Human heights
Check Distribution Shape
code.py
import pandas as pd
df = pd.DataFrame({
'Age': [22, 25, 27, 28, 30, 31, 32, 35, 40, 65]
})
# Basic stats
print("Mean:", df['Age'].mean())
print("Median:", df['Age'].median())
print("Skewness:", df['Age'].skew())Skewness Explained
code.py
skew = df['Age'].skew()| Skewness | Shape | Example |
|---|---|---|
| ~0 | Symmetric | Test scores |
| > 0 | Right tail (most low) | Income |
| < 0 | Left tail (most high) | Age at retirement |
Compare Mean and Median
code.py
mean = df['Age'].mean()
median = df['Age'].median()
if mean > median:
print("Right-skewed (has high outliers)")
elif mean < median:
print("Left-skewed (has low outliers)")
else:
print("Symmetric")Percentiles
code.py
# Where do values fall?
print("10th percentile:", df['Age'].quantile(0.10))
print("25th percentile:", df['Age'].quantile(0.25))
print("50th percentile:", df['Age'].quantile(0.50)) # Median
print("75th percentile:", df['Age'].quantile(0.75))
print("90th percentile:", df['Age'].quantile(0.90))90th percentile = 90% of values are below this
Value Counts (Histogram Data)
code.py
# Count values in ranges
print(df['Age'].value_counts(bins=5).sort_index())Output:
(21.957, 30.6] 5
(30.6, 39.2] 3
(39.2, 47.8] 1
(47.8, 56.4] 0
(56.4, 65.0] 1
Most people are 22-30 years old.
Kurtosis (Peakedness)
code.py
print("Kurtosis:", df['Age'].kurtosis())- High kurtosis: Sharp peak, heavy tails
- Low kurtosis: Flat top, light tails
- ~0: Normal bell curve
Quick Distribution Summary
code.py
def analyze_distribution(series):
print(f"Column: {series.name}")
print(f"Mean: {series.mean():.2f}")
print(f"Median: {series.median():.2f}")
print(f"Std: {series.std():.2f}")
print(f"Skewness: {series.skew():.2f}")
print(f"Min: {series.min()}")
print(f"Max: {series.max()}")
# Shape interpretation
if abs(series.skew()) < 0.5:
print("Shape: Approximately symmetric")
elif series.skew() > 0:
print("Shape: Right-skewed (tail to right)")
else:
print("Shape: Left-skewed (tail to left)")
analyze_distribution(df['Age'])Common Distributions
| Distribution | Shape | Examples |
|---|---|---|
| Normal | Bell curve | Height, test scores |
| Right-skewed | Tail right | Income, house prices |
| Uniform | Flat | Dice rolls |
| Bimodal | Two peaks | Mixed groups |
Key Points
- Distribution = how values spread
- Skewness tells direction of tail
- Mean vs Median reveals skewness
- Percentiles show where values fall
- Most real data is NOT perfectly normal
What's Next?
Learn to create summary reports that combine all your analysis.