Topic 30 of

NumPy for Analysts — Fast Numerical Computing in Python

Pandas is built on NumPy. Understanding NumPy arrays makes you a faster, smarter analyst — and unlocks the full power of Python's numerical ecosystem.

📚Beginner
⏱️10 min
10 quizzes
🔢

What is NumPy and Why Analysts Need It

NumPy (Numerical Python) is the foundation of Python's scientific computing stack. It provides fast, memory-efficient arrays and mathematical functions — the engine behind Pandas, scikit-learn, and most data science libraries.

Why NumPy Matters for Analysts

Speed: NumPy operations are 10-100x faster than Python lists because they use optimized C code under the hood. Processing millions of numbers? NumPy does it in milliseconds.

Memory Efficiency: NumPy arrays use less memory than Python lists. A list of 1 million integers takes ~8x more memory than a NumPy array.

Vectorization: Apply operations to entire arrays without loops. Instead of iterating through 1 million values, NumPy processes them all at once.

code.pyPython
import numpy as np

# Python list approach (slow)
amounts = [2500, 3200, 1800, 4100, 2900]
with_gst = []
for amount in amounts:
    with_gst.append(amount * 1.18)

# NumPy array approach (fast, clean)
amounts = np.array([2500, 3200, 1800, 4100, 2900])
with_gst = amounts * 1.18  # Vectorized operation — all at once
print(with_gst)  # [2950. 3776. 2124. 4838. 3422.]

When to Use NumPy vs Pandas

Use NumPy when:

  • You need pure numerical operations (math, statistics, linear algebra)
  • You're working with multi-dimensional arrays (matrices, images, tensors)
  • Performance is critical and you don't need labeled rows/columns

Use Pandas when:

  • You're working with tabular data (rows and columns with labels)
  • You need to merge, group, or pivot data
  • You want to handle missing data elegantly

In Practice: Most analysts use both — NumPy powers Pandas under the hood, and Pandas makes NumPy easier to use for tabular data.

Think of it this way...

If Pandas is Excel with programming, NumPy is a high-performance calculator. Pandas gives you tables and labels; NumPy gives you raw speed and mathematical power.

📦

NumPy Arrays — The Core Data Structure

A NumPy array is a grid of values, all of the same type. Unlike Python lists, arrays are fixed-size and homogeneous (all elements must be the same data type).

Creating Arrays

code.pyPython
import numpy as np

# From a Python list
arr = np.array([1, 2, 3, 4, 5])
print(arr)  # [1 2 3 4 5]

# 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix)
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]

# Array of zeros
zeros = np.zeros(5)  # [0. 0. 0. 0. 0.]
zeros_matrix = np.zeros((3, 4))  # 3 rows, 4 columns

# Array of ones
ones = np.ones(5)  # [1. 1. 1. 1. 1.]

# Array with a range of values
range_arr = np.arange(0, 10, 2)  # [0 2 4 6 8] (start, stop, step)

# Array with evenly spaced values
linspace = np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.  ] (start, stop, count)

# Random arrays
random_arr = np.random.rand(5)  # 5 random values between 0 and 1
random_int = np.random.randint(1, 100, size=10)  # 10 random integers between 1 and 99

Array Attributes

code.pyPython
arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.shape)   # (2, 3) — 2 rows, 3 columns
print(arr.ndim)    # 2 — number of dimensions
print(arr.size)    # 6 — total number of elements
print(arr.dtype)   # dtype('int64') — data type of elements

Array Indexing and Slicing

code.pyPython
arr = np.array([10, 20, 30, 40, 50])

# Indexing (like Python lists)
print(arr[0])   # 10 (first element)
print(arr[-1])  # 50 (last element)

# Slicing
print(arr[1:4])  # [20 30 40] (index 1 to 3)
print(arr[:3])   # [10 20 30] (first 3)
print(arr[2:])   # [30 40 50] (from index 2 onward)

# 2D array indexing
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix[0, 0])    # 1 (row 0, column 0)
print(matrix[1, 2])    # 6 (row 1, column 2)
print(matrix[:, 1])    # [2 5 8] (all rows, column 1)
print(matrix[1, :])    # [4 5 6] (row 1, all columns)

Boolean Indexing (Filtering)

code.pyPython
amounts = np.array([2500, 3200, 1800, 4100, 2900])

# Filter: amounts greater than 3000
high_value = amounts[amounts > 3000]
print(high_value)  # [3200 4100]

# Multiple conditions
medium = amounts[(amounts > 2000) & (amounts < 4000)]
print(medium)  # [2500 3200 2900]

⚠️ CheckpointQuiz error: Missing or invalid options array

Array Operations and Vectorization

NumPy's superpower is vectorization — applying operations to entire arrays without explicit loops.

Arithmetic Operations

code.pyPython
amounts = np.array([2500, 3200, 1800, 4100, 2900])

# Scalar operations (applied to every element)
with_gst = amounts * 1.18
print(with_gst)  # [2950. 3776. 2124. 4838. 3422.]

discounted = amounts - 500
print(discounted)  # [2000 2700 1300 3600 2400]

# Element-wise array operations
revenue_day1 = np.array([45000, 38000, 52000])
revenue_day2 = np.array([48000, 39000, 55000])
total_revenue = revenue_day1 + revenue_day2
print(total_revenue)  # [93000 77000 107000]

growth = (revenue_day2 - revenue_day1) / revenue_day1 * 100
print(growth)  # [ 6.66666667  2.63157895  5.76923077]

Aggregation Functions

code.pyPython
amounts = np.array([2500, 3200, 1800, 4100, 2900])

print(amounts.sum())      # 14500 (total)
print(amounts.mean())     # 2900.0 (average)
print(amounts.median())   # 2900.0 (middle value) — wait, this is wrong!
print(np.median(amounts)) # 2900.0 (correct: use np.median, not method)
print(amounts.std())      # 797.18 (standard deviation)
print(amounts.min())      # 1800 (minimum)
print(amounts.max())      # 4100 (maximum)
print(amounts.argmin())   # 2 (index of minimum)
print(amounts.argmax())   # 3 (index of maximum)

# Percentiles
print(np.percentile(amounts, 25))  # 2250.0 (25th percentile)
print(np.percentile(amounts, 75))  # 3550.0 (75th percentile)

Axis-Wise Operations on 2D Arrays

code.pyPython
# City revenue by day (rows=cities, columns=days)
revenue = np.array([
    [45000, 48000, 52000],  # Mumbai
    [38000, 39000, 41000],  # Delhi
    [35000, 37000, 36000]   # Bangalore
])

# Total revenue per city (sum across columns)
city_totals = revenue.sum(axis=1)
print(city_totals)  # [145000 118000 108000]

# Total revenue per day (sum across rows)
day_totals = revenue.sum(axis=0)
print(day_totals)  # [118000 124000 129000]

# Average revenue per city
city_avg = revenue.mean(axis=1)
print(city_avg)  # [48333.33 39333.33 36000.]

Axis Reminder:

  • axis=0: operate down rows (column-wise aggregation)
  • axis=1: operate across columns (row-wise aggregation)

Universal Functions (ufuncs)

NumPy provides fast mathematical functions that work element-wise:

code.pyPython
amounts = np.array([100, 1000, 10000, 100000])

# Logarithm (useful for skewed data)
log_amounts = np.log10(amounts)
print(log_amounts)  # [2. 3. 4. 5.]

# Square root
sqrt_amounts = np.sqrt(amounts)
print(sqrt_amounts)  # [ 10.  31.62  100.  316.23]

# Exponential
exp_vals = np.exp([1, 2, 3])
print(exp_vals)  # [ 2.72  7.39 20.09]

# Rounding
values = np.array([2.3, 4.7, 5.5, 6.2])
print(np.round(values))    # [2. 5. 6. 6.]
print(np.floor(values))    # [2. 4. 5. 6.]
print(np.ceil(values))     # [3. 5. 6. 7.]
📊

Statistical Functions for Analysts

NumPy includes functions for common statistical calculations — essential for exploratory analysis.

Descriptive Statistics

code.pyPython
# Zomato order amounts
amounts = np.array([450, 680, 520, 890, 340, 720, 550, 480, 650, 920])

# Central tendency
mean = np.mean(amounts)      # 620.0 (average)
median = np.median(amounts)  # 585.0 (middle value)

# Spread
std = np.std(amounts)        # 184.39 (standard deviation)
var = np.var(amounts)        # 34000.0 (variance)
range_val = np.ptp(amounts)  # 580 (peak-to-peak: max - min)

# Percentiles/Quantiles
q25 = np.percentile(amounts, 25)   # 482.5 (25th percentile)
q75 = np.percentile(amounts, 75)   # 717.5 (75th percentile)
IQR = q75 - q25                     # 235.0 (interquartile range)

print(f"Mean: ₹{mean:.2f}")
print(f"Median: ₹{median:.2f}")
print(f"Std Dev: ₹{std:.2f}")
print(f"IQR: ₹{IQR:.2f}")

Correlation and Covariance

code.pyPython
# Swiggy: delivery time vs customer rating
delivery_time = np.array([25, 30, 35, 40, 45, 50, 55, 60])
rating = np.array([4.8, 4.7, 4.5, 4.3, 4.0, 3.8, 3.5, 3.2])

# Correlation coefficient (-1 to 1)
correlation = np.corrcoef(delivery_time, rating)[0, 1]
print(f"Correlation: {correlation:.3f}")  # -0.998 (strong negative correlation)

# Covariance
covariance = np.cov(delivery_time, rating)[0, 1]
print(f"Covariance: {covariance:.2f}")

Handling NaN Values

code.pyPython
# Data with missing values
amounts = np.array([2500, np.nan, 1800, 4100, np.nan, 2900])

# Regular mean fails
print(np.mean(amounts))  # nan

# NaN-safe functions
print(np.nanmean(amounts))  # 2825.0 (ignores NaN)
print(np.nanmedian(amounts))  # 2700.0
print(np.nansum(amounts))  # 11300.0
print(np.nanstd(amounts))  # 874.96

Random Sampling (for A/B Testing)

code.pyPython
# Randomly assign users to test groups
user_ids = np.arange(1, 10001)  # 10,000 users
np.random.shuffle(user_ids)

control_group = user_ids[:5000]   # First 5000
test_group = user_ids[5000:]      # Last 5000

# Random sample with replacement
sample = np.random.choice(amounts, size=100, replace=True)

# Random sample without replacement
sample_unique = np.random.choice(amounts, size=5, replace=False)
Info

For Analysts: Use NumPy for pure numerical calculations (mean, std, correlation). Use Pandas when you need to group by categories, handle missing data with business logic, or work with labeled data.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}