Statistical Functions
Learn to perform statistical analysis with NumPy
Statistical Functions
Basic Statistics
Mean (Average)
import numpy as np
scores = np.array([85, 90, 78, 92, 88])
average = np.mean(scores)
print("Average:", average)Output: 86.6
What mean tells you: Central value. Sum divided by count.
Median (Middle Value)
import numpy as np
salaries = np.array([45000, 50000, 55000, 60000, 150000])
median = np.median(salaries)
print("Median:", median)Output: 55000
Why median matters: Not affected by extreme values (150000 doesn't skew it).
Mode (Most Common)
NumPy doesn't have built-in mode, but you can find it.
import numpy as np
from scipy import stats
grades = np.array(["A", "B", "A", "C", "A", "B"])
mode = stats.mode(grades)[0]
print("Most common grade:", mode)Or count manually:
import numpy as np
numbers = np.array([1, 2, 2, 3, 2, 4])
values, counts = np.unique(numbers, return_counts=True)
mode_index = np.argmax(counts)
mode = values[mode_index]
print("Mode:", mode)Spread and Variability
Standard Deviation
Measures how spread out numbers are.
import numpy as np
data1 = np.array([10, 10, 10, 10])
data2 = np.array([5, 10, 15, 20])
print("Data 1 std:", np.std(data1))
print("Data 2 std:", np.std(data2))Output:
Data 1 std: 0.0
Data 2 std: 5.59
What this means:
- 0 = no variation (all same)
- Higher number = more spread out
Variance
Square of standard deviation.
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
variance = np.var(data)
std_dev = np.std(data)
print("Variance:", variance)
print("Std Dev:", std_dev)
print("Check:", std_dev ** 2)Relationship: variance = std_dev²
Minimum and Maximum
import numpy as np
temps = np.array([72, 68, 75, 70, 73])
print("Min:", np.min(temps))
print("Max:", np.max(temps))
print("Range:", np.ptp(temps))Output:
Min: 68
Max: 75
Range: 7
What ptp means: Peak to peak (max - min).
Percentiles and Quantiles
Percentiles
import numpy as np
scores = np.array([65, 70, 75, 80, 85, 90, 95])
p25 = np.percentile(scores, 25)
p50 = np.percentile(scores, 50)
p75 = np.percentile(scores, 75)
print("25th percentile:", p25)
print("50th percentile:", p50)
print("75th percentile:", p75)What percentiles mean:
- 25th: 25 percent of data is below this
- 50th: Same as median
- 75th: 75 percent of data is below this
Quartiles
import numpy as np
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)
print("Q1 (25 percent):", q1)
print("Q2 (50 percent):", q2)
print("Q3 (75 percent):", q3)
print("IQR:", q3 - q1)What IQR is: Interquartile Range. Spread of middle 50 percent.
Correlation
Measure relationship between two datasets.
import numpy as np
study_hours = np.array([2, 3, 4, 5, 6])
test_scores = np.array([60, 70, 80, 85, 95])
correlation = np.corrcoef(study_hours, test_scores)[0, 1]
print("Correlation:", round(correlation, 2))Correlation values:
- 1.0 = perfect positive (both increase together)
- 0.0 = no relationship
- -1.0 = perfect negative (one increases, other decreases)
Cumulative Statistics
Cumulative Sum
import numpy as np
daily_sales = np.array([100, 150, 200, 180, 220])
cumulative = np.cumsum(daily_sales)
print("Daily sales:", daily_sales)
print("Cumulative:", cumulative)Output:
Daily sales: [100 150 200 180 220]
Cumulative: [100 250 450 630 850]
Use case: Track running totals over time.
Cumulative Product
import numpy as np
growth_rates = np.array([1.05, 1.03, 1.04])
cumulative = np.cumprod(growth_rates)
print("Growth rates:", growth_rates)
print("Cumulative growth:", cumulative)Use case: Compound growth calculations.
Statistics on Specific Axis
For 2D arrays, calculate along rows or columns.
import numpy as np
scores = np.array([[85, 90, 88], [78, 82, 85], [92, 95, 90]])
print("Overall average:", scores.mean())
print("Average per student (rows):", scores.mean(axis=1))
print("Average per assignment (columns):", scores.mean(axis=0))Output:
Overall average: 87.22
Average per student: [87.67 81.67 92.33]
Average per assignment: [85. 89. 87.67]
axis parameter:
- axis=0: Down columns (per column)
- axis=1: Across rows (per row)
- No axis: Entire array
Practice Example
The scenario: Analyze monthly sales data for insights.
import numpy as np
monthly_sales = np.array([15000, 18000, 22000, 19000, 25000, 21000, 23000, 26000, 24000, 28000, 27000, 30000])
print("Sales Analysis")
print("=" * 50)
print()
print("Basic Statistics:")
print("Mean:", round(monthly_sales.mean(), 2))
print("Median:", monthly_sales.median())
print("Min:", monthly_sales.min())
print("Max:", monthly_sales.max())
print("Range:", monthly_sales.ptp())
print("Std Dev:", round(monthly_sales.std(), 2))
print()
print("Quartile Analysis:")
q1 = np.percentile(monthly_sales, 25)
q2 = np.percentile(monthly_sales, 50)
q3 = np.percentile(monthly_sales, 75)
print("Q1 (25 percent):", q1)
print("Q2 (Median):", q2)
print("Q3 (75 percent):", q3)
print("IQR:", q3 - q1)
print()
cumulative = np.cumsum(monthly_sales)
print("Cumulative Sales:")
print("Q1:", cumulative[2])
print("Q2:", cumulative[5])
print("Q3:", cumulative[8])
print("Year Total:", cumulative[-1])
print()
above_average = monthly_sales[monthly_sales > monthly_sales.mean()]
print("Above Average Months:", len(above_average))
print("Values:", above_average)
print()
growth = np.diff(monthly_sales)
print("Month-over-Month Growth:")
print("Average growth:", round(growth.mean(), 2))
print("Max growth:", growth.max())
print("Min growth:", growth.min())What this analyzes:
- Central tendency (mean, median)
- Spread (range, std dev)
- Distribution (quartiles, IQR)
- Cumulative totals
- Performance relative to average
- Growth trends
Weighted Average
import numpy as np
grades = np.array([85, 90, 88])
weights = np.array([0.3, 0.3, 0.4])
weighted_avg = np.average(grades, weights=weights)
print("Weighted average:", weighted_avg)Use case: Some values matter more than others.
Histogram Data
import numpy as np
data = np.array([1, 2, 2, 3, 3, 3, 4, 4, 5])
counts, bins = np.histogram(data, bins=5)
print("Counts:", counts)
print("Bin edges:", bins)What this gives: Frequency distribution of data.
Covariance
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
cov_matrix = np.cov(x, y)
print("Covariance:", cov_matrix[0, 1])What covariance shows: How two variables change together.
Key Points to Remember
mean() for average, median() for middle value, std() for spread, var() for variance.
Percentiles show data distribution. 50th percentile equals median.
Use axis parameter: axis=0 for columns, axis=1 for rows, none for entire array.
Correlation measures relationship strength (-1 to 1). Covariance shows how variables change together.
Cumulative functions (cumsum, cumprod) useful for tracking running totals.
Common Mistakes
Mistake 1: Confusing mean and median
[1, 2, 100] # Mean = 34, Median = 2Use median for data with outliers.
Mistake 2: Wrong axis
matrix.mean(axis=0) # Per column
matrix.mean(axis=1) # Per rowMistake 3: Percentile vs percent
np.percentile(data, 25) # 25th percentile
data * 0.25 # 25 percent of values (different!)Mistake 4: Standard deviation vs variance
std = np.std(data)
var = np.var(data)
# var = std ** 2What's Next?
You now know statistical analysis with NumPy. Next, you'll learn array reshaping and manipulation - transforming arrays into different shapes and structures.