Statistical Functions

Basic Statistics

Mean (Average)

code.py

import numpy as np

scores = np.array([85, 90, 78, 92, 88])
average = np.mean(scores)
print("Average:", average)

Output: 86.6

What mean tells you: Central value. Sum divided by count.

Median (Middle Value)

code.py

import numpy as np

salaries = np.array([45000, 50000, 55000, 60000, 150000])
median = np.median(salaries)
print("Median:", median)

Output: 55000

Why median matters: Not affected by extreme values (150000 doesn't skew it).

Mode (Most Common)

NumPy doesn't have built-in mode, but you can find it.

code.py

import numpy as np
from scipy import stats

grades = np.array(["A", "B", "A", "C", "A", "B"])
mode = stats.mode(grades)[0]
print("Most common grade:", mode)

Or count manually:

code.py

import numpy as np

numbers = np.array([1, 2, 2, 3, 2, 4])
values, counts = np.unique(numbers, return_counts=True)
mode_index = np.argmax(counts)
mode = values[mode_index]
print("Mode:", mode)

Spread and Variability

Standard Deviation

Measures how spread out numbers are.

code.py

import numpy as np

data1 = np.array([10, 10, 10, 10])
data2 = np.array([5, 10, 15, 20])

print("Data 1 std:", np.std(data1))
print("Data 2 std:", np.std(data2))

Output:

Data 1 std: 0.0
Data 2 std: 5.59

What this means:

0 = no variation (all same)
Higher number = more spread out

Variance

Square of standard deviation.

code.py

import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
variance = np.var(data)
std_dev = np.std(data)

print("Variance:", variance)
print("Std Dev:", std_dev)
print("Check:", std_dev ** 2)

Relationship: variance = std_dev²

Minimum and Maximum

code.py

import numpy as np

temps = np.array([72, 68, 75, 70, 73])

print("Min:", np.min(temps))
print("Max:", np.max(temps))
print("Range:", np.ptp(temps))

Output:

Min: 68
Max: 75
Range: 7

What ptp means: Peak to peak (max - min).

Percentiles and Quantiles

Percentiles

code.py

import numpy as np

scores = np.array([65, 70, 75, 80, 85, 90, 95])

p25 = np.percentile(scores, 25)
p50 = np.percentile(scores, 50)
p75 = np.percentile(scores, 75)

print("25th percentile:", p25)
print("50th percentile:", p50)
print("75th percentile:", p75)

What percentiles mean:

25th: 25 percent of data is below this
50th: Same as median
75th: 75 percent of data is below this

Quartiles

code.py

import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)

print("Q1 (25 percent):", q1)
print("Q2 (50 percent):", q2)
print("Q3 (75 percent):", q3)
print("IQR:", q3 - q1)

What IQR is: Interquartile Range. Spread of middle 50 percent.

Correlation

Measure relationship between two datasets.

code.py

import numpy as np

study_hours = np.array([2, 3, 4, 5, 6])
test_scores = np.array([60, 70, 80, 85, 95])

correlation = np.corrcoef(study_hours, test_scores)[0, 1]
print("Correlation:", round(correlation, 2))

Correlation values:

1.0 = perfect positive (both increase together)
0.0 = no relationship
-1.0 = perfect negative (one increases, other decreases)

Cumulative Statistics

Cumulative Sum

code.py

import numpy as np

daily_sales = np.array([100, 150, 200, 180, 220])
cumulative = np.cumsum(daily_sales)
print("Daily sales:", daily_sales)
print("Cumulative:", cumulative)

Output:

Daily sales: [100 150 200 180 220]
Cumulative: [100 250 450 630 850]

Use case: Track running totals over time.

Cumulative Product

code.py

import numpy as np

growth_rates = np.array([1.05, 1.03, 1.04])
cumulative = np.cumprod(growth_rates)
print("Growth rates:", growth_rates)
print("Cumulative growth:", cumulative)

Use case: Compound growth calculations.

Statistics on Specific Axis

For 2D arrays, calculate along rows or columns.

code.py

import numpy as np

scores = np.array([[85, 90, 88], [78, 82, 85], [92, 95, 90]])

print("Overall average:", scores.mean())
print("Average per student (rows):", scores.mean(axis=1))
print("Average per assignment (columns):", scores.mean(axis=0))

Output:

Overall average: 87.22
Average per student: [87.67 81.67 92.33]
Average per assignment: [85. 89. 87.67]

axis parameter:

axis=0: Down columns (per column)
axis=1: Across rows (per row)
No axis: Entire array

Practice Example

The scenario: Analyze monthly sales data for insights.

code.py

import numpy as np

monthly_sales = np.array([15000, 18000, 22000, 19000, 25000, 21000, 23000, 26000, 24000, 28000, 27000, 30000])

print("Sales Analysis")
print("=" * 50)
print()

print("Basic Statistics:")
print("Mean:", round(monthly_sales.mean(), 2))
print("Median:", monthly_sales.median())
print("Min:", monthly_sales.min())
print("Max:", monthly_sales.max())
print("Range:", monthly_sales.ptp())
print("Std Dev:", round(monthly_sales.std(), 2))
print()

print("Quartile Analysis:")
q1 = np.percentile(monthly_sales, 25)
q2 = np.percentile(monthly_sales, 50)
q3 = np.percentile(monthly_sales, 75)
print("Q1 (25 percent):", q1)
print("Q2 (Median):", q2)
print("Q3 (75 percent):", q3)
print("IQR:", q3 - q1)
print()

cumulative = np.cumsum(monthly_sales)
print("Cumulative Sales:")
print("Q1:", cumulative[2])
print("Q2:", cumulative[5])
print("Q3:", cumulative[8])
print("Year Total:", cumulative[-1])
print()

above_average = monthly_sales[monthly_sales > monthly_sales.mean()]
print("Above Average Months:", len(above_average))
print("Values:", above_average)
print()

growth = np.diff(monthly_sales)
print("Month-over-Month Growth:")
print("Average growth:", round(growth.mean(), 2))
print("Max growth:", growth.max())
print("Min growth:", growth.min())

What this analyzes:

Central tendency (mean, median)
Spread (range, std dev)
Distribution (quartiles, IQR)
Cumulative totals
Performance relative to average
Growth trends

Weighted Average

code.py

import numpy as np

grades = np.array([85, 90, 88])
weights = np.array([0.3, 0.3, 0.4])

weighted_avg = np.average(grades, weights=weights)
print("Weighted average:", weighted_avg)

Use case: Some values matter more than others.

Histogram Data

code.py

import numpy as np

data = np.array([1, 2, 2, 3, 3, 3, 4, 4, 5])
counts, bins = np.histogram(data, bins=5)

print("Counts:", counts)
print("Bin edges:", bins)

What this gives: Frequency distribution of data.

Covariance

code.py

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

cov_matrix = np.cov(x, y)
print("Covariance:", cov_matrix[0, 1])

What covariance shows: How two variables change together.

Key Points to Remember

mean() for average, median() for middle value, std() for spread, var() for variance.

Percentiles show data distribution. 50th percentile equals median.

Use axis parameter: axis=0 for columns, axis=1 for rows, none for entire array.

Correlation measures relationship strength (-1 to 1). Covariance shows how variables change together.

Cumulative functions (cumsum, cumprod) useful for tracking running totals.

Common Mistakes

Mistake 1: Confusing mean and median

code.py

[1, 2, 100]  # Mean = 34, Median = 2

Use median for data with outliers.

Mistake 2: Wrong axis

code.py

matrix.mean(axis=0)  # Per column
matrix.mean(axis=1)  # Per row

Mistake 3: Percentile vs percent

code.py

np.percentile(data, 25)  # 25th percentile
data * 0.25  # 25 percent of values (different!)

Mistake 4: Standard deviation vs variance

code.py

std = np.std(data)
var = np.var(data)
# var = std ** 2

What's Next?

You now know statistical analysis with NumPy. Next, you'll learn array reshaping and manipulation - transforming arrays into different shapes and structures.