Distribution Analysis

What is Distribution?

Distribution shows how values are spread out:

Are most values in the middle?
Are they spread evenly?
Is there a long tail on one side?

Normal Distribution (Bell Curve)

Most common pattern:

Most values in the middle
Fewer values at extremes
Symmetric left and right

Example: Human heights

Check Distribution Shape

code.pyPython

import pandas as pd

df = pd.DataFrame({
    'Age': [22, 25, 27, 28, 30, 31, 32, 35, 40, 65]
})

# Basic stats
print("Mean:", df['Age'].mean())
print("Median:", df['Age'].median())
print("Skewness:", df['Age'].skew())

Skewness Explained

code.pyPython

skew = df['Age'].skew()

Skewness	Shape	Example
~0	Symmetric	Test scores
> 0	Right tail (most low)	Income
< 0	Left tail (most high)	Age at retirement

Compare Mean and Median

code.pyPython

mean = df['Age'].mean()
median = df['Age'].median()

if mean > median:
    print("Right-skewed (has high outliers)")
elif mean < median:
    print("Left-skewed (has low outliers)")
else:
    print("Symmetric")

Percentiles

code.pyPython

# Where do values fall?
print("10th percentile:", df['Age'].quantile(0.10))
print("25th percentile:", df['Age'].quantile(0.25))
print("50th percentile:", df['Age'].quantile(0.50))  # Median
print("75th percentile:", df['Age'].quantile(0.75))
print("90th percentile:", df['Age'].quantile(0.90))

90th percentile = 90% of values are below this

Value Counts (Histogram Data)

code.pyPython

# Count values in ranges
print(df['Age'].value_counts(bins=5).sort_index())

Output:

(21.957, 30.6]    5
(30.6, 39.2]      3
(39.2, 47.8]      1
(47.8, 56.4]      0
(56.4, 65.0]      1

Most people are 22-30 years old.

Kurtosis (Peakedness)

code.pyPython

print("Kurtosis:", df['Age'].kurtosis())

High kurtosis: Sharp peak, heavy tails
Low kurtosis: Flat top, light tails
~0: Normal bell curve

Quick Distribution Summary

code.pyPython

def analyze_distribution(series):
    print(f"Column: {series.name}")
    print(f"Mean: {series.mean():.2f}")
    print(f"Median: {series.median():.2f}")
    print(f"Std: {series.std():.2f}")
    print(f"Skewness: {series.skew():.2f}")
    print(f"Min: {series.min()}")
    print(f"Max: {series.max()}")

    # Shape interpretation
    if abs(series.skew()) < 0.5:
        print("Shape: Approximately symmetric")
    elif series.skew() > 0:
        print("Shape: Right-skewed (tail to right)")
    else:
        print("Shape: Left-skewed (tail to left)")

analyze_distribution(df['Age'])

Common Distributions

Distribution	Shape	Examples
Normal	Bell curve	Height, test scores
Right-skewed	Tail right	Income, house prices
Uniform	Flat	Dice rolls
Bimodal	Two peaks	Mixed groups

Key Points

Distribution = how values spread
Skewness tells direction of tail
Mean vs Median reveals skewness
Percentiles show where values fall
Most real data is NOT perfectly normal

What's Next?

Learn to create summary reports that combine all your analysis.