#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
6 min read min read

Statistical Functions

Learn to perform statistical analysis with NumPy

Statistical Functions

Basic Statistics

Mean (Average)

code.pyPython
import numpy as np

scores = np.array([85, 90, 78, 92, 88])
average = np.mean(scores)
print("Average:", average)

Output: 86.6

What mean tells you: Central value. Sum divided by count.

Median (Middle Value)

code.pyPython
import numpy as np

salaries = np.array([45000, 50000, 55000, 60000, 150000])
median = np.median(salaries)
print("Median:", median)

Output: 55000

Why median matters: Not affected by extreme values (150000 doesn't skew it).

Mode (Most Common)

NumPy doesn't have built-in mode, but you can find it.

code.pyPython
import numpy as np
from scipy import stats

grades = np.array(["A", "B", "A", "C", "A", "B"])
mode = stats.mode(grades)[0]
print("Most common grade:", mode)

Or count manually:

code.pyPython
import numpy as np

numbers = np.array([1, 2, 2, 3, 2, 4])
values, counts = np.unique(numbers, return_counts=True)
mode_index = np.argmax(counts)
mode = values[mode_index]
print("Mode:", mode)

Spread and Variability

Standard Deviation

Measures how spread out numbers are.

code.pyPython
import numpy as np

data1 = np.array([10, 10, 10, 10])
data2 = np.array([5, 10, 15, 20])

print("Data 1 std:", np.std(data1))
print("Data 2 std:", np.std(data2))

Output:

Data 1 std: 0.0 Data 2 std: 5.59

What this means:

  • 0 = no variation (all same)
  • Higher number = more spread out

Variance

Square of standard deviation.

code.pyPython
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
variance = np.var(data)
std_dev = np.std(data)

print("Variance:", variance)
print("Std Dev:", std_dev)
print("Check:", std_dev ** 2)

Relationship: variance = std_dev²

Minimum and Maximum

code.pyPython
import numpy as np

temps = np.array([72, 68, 75, 70, 73])

print("Min:", np.min(temps))
print("Max:", np.max(temps))
print("Range:", np.ptp(temps))

Output:

Min: 68 Max: 75 Range: 7

What ptp means: Peak to peak (max - min).

Percentiles and Quantiles

Percentiles

code.pyPython
import numpy as np

scores = np.array([65, 70, 75, 80, 85, 90, 95])

p25 = np.percentile(scores, 25)
p50 = np.percentile(scores, 50)
p75 = np.percentile(scores, 75)

print("25th percentile:", p25)
print("50th percentile:", p50)
print("75th percentile:", p75)

What percentiles mean:

  • 25th: 25 percent of data is below this
  • 50th: Same as median
  • 75th: 75 percent of data is below this

Quartiles

code.pyPython
import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)

print("Q1 (25 percent):", q1)
print("Q2 (50 percent):", q2)
print("Q3 (75 percent):", q3)
print("IQR:", q3 - q1)

What IQR is: Interquartile Range. Spread of middle 50 percent.

Correlation

Measure relationship between two datasets.

code.pyPython
import numpy as np

study_hours = np.array([2, 3, 4, 5, 6])
test_scores = np.array([60, 70, 80, 85, 95])

correlation = np.corrcoef(study_hours, test_scores)[0, 1]
print("Correlation:", round(correlation, 2))

Correlation values:

  • 1.0 = perfect positive (both increase together)
  • 0.0 = no relationship
  • -1.0 = perfect negative (one increases, other decreases)

Cumulative Statistics

Cumulative Sum

code.pyPython
import numpy as np

daily_sales = np.array([100, 150, 200, 180, 220])
cumulative = np.cumsum(daily_sales)
print("Daily sales:", daily_sales)
print("Cumulative:", cumulative)

Output:

Daily sales: [100 150 200 180 220] Cumulative: [100 250 450 630 850]

Use case: Track running totals over time.

Cumulative Product

code.pyPython
import numpy as np

growth_rates = np.array([1.05, 1.03, 1.04])
cumulative = np.cumprod(growth_rates)
print("Growth rates:", growth_rates)
print("Cumulative growth:", cumulative)

Use case: Compound growth calculations.

Statistics on Specific Axis

For 2D arrays, calculate along rows or columns.

code.pyPython
import numpy as np

scores = np.array([[85, 90, 88], [78, 82, 85], [92, 95, 90]])

print("Overall average:", scores.mean())
print("Average per student (rows):", scores.mean(axis=1))
print("Average per assignment (columns):", scores.mean(axis=0))

Output:

Overall average: 87.22 Average per student: [87.67 81.67 92.33] Average per assignment: [85. 89. 87.67]

axis parameter:

  • axis=0: Down columns (per column)
  • axis=1: Across rows (per row)
  • No axis: Entire array

Practice Example

The scenario: Analyze monthly sales data for insights.

code.pyPython
import numpy as np

monthly_sales = np.array([15000, 18000, 22000, 19000, 25000, 21000, 23000, 26000, 24000, 28000, 27000, 30000])

print("Sales Analysis")
print("=" * 50)
print()

print("Basic Statistics:")
print("Mean:", round(monthly_sales.mean(), 2))
print("Median:", monthly_sales.median())
print("Min:", monthly_sales.min())
print("Max:", monthly_sales.max())
print("Range:", monthly_sales.ptp())
print("Std Dev:", round(monthly_sales.std(), 2))
print()

print("Quartile Analysis:")
q1 = np.percentile(monthly_sales, 25)
q2 = np.percentile(monthly_sales, 50)
q3 = np.percentile(monthly_sales, 75)
print("Q1 (25 percent):", q1)
print("Q2 (Median):", q2)
print("Q3 (75 percent):", q3)
print("IQR:", q3 - q1)
print()

cumulative = np.cumsum(monthly_sales)
print("Cumulative Sales:")
print("Q1:", cumulative[2])
print("Q2:", cumulative[5])
print("Q3:", cumulative[8])
print("Year Total:", cumulative[-1])
print()

above_average = monthly_sales[monthly_sales > monthly_sales.mean()]
print("Above Average Months:", len(above_average))
print("Values:", above_average)
print()

growth = np.diff(monthly_sales)
print("Month-over-Month Growth:")
print("Average growth:", round(growth.mean(), 2))
print("Max growth:", growth.max())
print("Min growth:", growth.min())

What this analyzes:

  1. Central tendency (mean, median)
  2. Spread (range, std dev)
  3. Distribution (quartiles, IQR)
  4. Cumulative totals
  5. Performance relative to average
  6. Growth trends

Weighted Average

code.pyPython
import numpy as np

grades = np.array([85, 90, 88])
weights = np.array([0.3, 0.3, 0.4])

weighted_avg = np.average(grades, weights=weights)
print("Weighted average:", weighted_avg)

Use case: Some values matter more than others.

Histogram Data

code.pyPython
import numpy as np

data = np.array([1, 2, 2, 3, 3, 3, 4, 4, 5])
counts, bins = np.histogram(data, bins=5)

print("Counts:", counts)
print("Bin edges:", bins)

What this gives: Frequency distribution of data.

Covariance

code.pyPython
import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

cov_matrix = np.cov(x, y)
print("Covariance:", cov_matrix[0, 1])

What covariance shows: How two variables change together.

Key Points to Remember

mean() for average, median() for middle value, std() for spread, var() for variance.

Percentiles show data distribution. 50th percentile equals median.

Use axis parameter: axis=0 for columns, axis=1 for rows, none for entire array.

Correlation measures relationship strength (-1 to 1). Covariance shows how variables change together.

Cumulative functions (cumsum, cumprod) useful for tracking running totals.

Common Mistakes

Mistake 1: Confusing mean and median

code.pyPython
[1, 2, 100]  # Mean = 34, Median = 2

Use median for data with outliers.

Mistake 2: Wrong axis

code.pyPython
matrix.mean(axis=0)  # Per column
matrix.mean(axis=1)  # Per row

Mistake 3: Percentile vs percent

code.pyPython
np.percentile(data, 25)  # 25th percentile
data * 0.25  # 25 percent of values (different!)

Mistake 4: Standard deviation vs variance

code.pyPython
std = np.std(data)
var = np.var(data)
# var = std ** 2

What's Next?

You now know statistical analysis with NumPy. Next, you'll learn array reshaping and manipulation - transforming arrays into different shapes and structures.