Topic 41 of

Statistics for Data Analysts — Essential Concepts You Need

Statistics is the foundation of data analysis. You don't need a math PhD — just these essential concepts to make data-driven decisions, spot patterns, and communicate insights with confidence.

📚Beginner
⏱️13 min
10 quizzes
📊

Why Data Analysts Need Statistics

Statistics is the science of collecting, analyzing, and interpreting data. As a data analyst, statistics is your superpower — it helps you answer questions, test hypotheses, and make predictions from messy real-world data.

What Statistics Does for Analysts

1. Summarize Data (Descriptive Statistics)

  • Take 10,000 customer orders → Summarize as "Average order value: ₹1,250"
  • Reduce complexity: Turn millions of data points into a few key metrics
  • Communicate clearly: "Sales increased 15%" is better than showing raw numbers

2. Make Inferences (Inferential Statistics)

  • Survey 1,000 customers → Predict what all 10 million customers think
  • A/B test 5,000 users → Decide which website version to show all users
  • Estimate the unknown: You can't measure everyone, so you sample and infer

3. Quantify Uncertainty

  • "We're 95% confident sales will be between ₹50L and ₹60L this month"
  • Statistics tells you HOW SURE you can be, not just WHAT you found
  • Confidence intervals, p-values, significance — all about quantifying uncertainty

4. Prove (or Disprove) Hypotheses

  • "Does adding free shipping increase conversions?"
  • Run A/B test → Use statistics to say "Yes, with 99% confidence" or "No significant difference"
  • Avoid false conclusions: Prevents you from seeing patterns in random noise

Statistics in Daily Analyst Work

Example 1: Swiggy Delivery Times

  • Question: Are delivery times longer on weekends?
  • Data: 100,000 orders (delivery time in minutes)
  • Statistics Used:
    • Mean delivery time (weekday vs weekend)
    • T-test: Is the difference statistically significant? (Not just random variation)
    • Result: "Weekend deliveries are 7 minutes slower (p < 0.001) — significant"

Example 2: Flipkart Pricing Experiment

  • Question: Does showing "Limited Stock" badge increase purchases?
  • Data: 50,000 users (25K control, 25K treatment)
  • Statistics Used:
    • Conversion rate (control: 2.3%, treatment: 2.8%)
    • Z-test: Is 0.5% difference significant or luck?
    • Result: "21% increase, statistically significant — roll out to all users"

Example 3: Zomato Restaurant Rating

  • Question: Is 4.2★ rating reliably better than 4.0★?
  • Data: Restaurant A (4.2★, 50 reviews), Restaurant B (4.0★, 5,000 reviews)
  • Statistics Used:
    • Standard error (how uncertain is each rating?)
    • Restaurant B's 4.0★ is MORE RELIABLE (more samples = less uncertainty)
    • Result: "Restaurant B safer bet — larger sample size"
Think of it this way...

Statistics is like weather forecasting. Meteorologists can't measure temperature at every square meter, so they sample data, use statistics to model patterns, and give you a prediction with confidence levels ("80% chance of rain"). Data analysts do the same: sample data, find patterns, make predictions with known uncertainty.

🔍

Two Types of Statistics: Descriptive vs Inferential

Statistics divides into two branches — one describes what you have, the other predicts what you don't.

Descriptive Statistics (What IS)

Purpose: Summarize and describe data you already have.

Common Techniques:

  • Measures of central tendency: Mean, median, mode (typical value)
  • Measures of spread: Standard deviation, variance, range (how varied data is)
  • Frequency distributions: Histograms, bar charts (how data is distributed)
  • Summary tables: Count, sum, min, max, percentiles

Example — E-commerce Sales Analysis:

Dataset: 10,000 orders from January 2025 Descriptive Statistics: - Total sales: ₹1.25 crore - Average order value: ₹1,250 - Median order value: ₹980 (half of orders are above/below this) - Standard deviation: ₹540 (typical variation from average) - Minimum order: ₹150 (someone bought a phone case) - Maximum order: ₹85,000 (someone bought a laptop + accessories) - 95th percentile: ₹3,200 (95% of orders are below this amount)

When to Use:

  • Exploring new datasets (EDA — exploratory data analysis)
  • Creating dashboards (KPIs are descriptive stats)
  • Communicating to stakeholders ("Here's what happened last month")

Inferential Statistics (What COULD BE)

Purpose: Make predictions or conclusions about a larger population based on a sample.

Common Techniques:

  • Hypothesis testing: T-tests, chi-square tests (is difference real or random?)
  • Confidence intervals: "True average is between ₹1,200 and ₹1,300 (95% confidence)"
  • Regression analysis: Predict sales based on ad spend
  • A/B testing: Which website version performs better?

Example — Customer Survey:

Population: 10 million Flipkart customers Sample: Survey 2,000 customers about new feature Results: - 68% like new feature (in sample) - 95% Confidence Interval: [66%, 70%] - Inference: "Between 66% and 70% of ALL 10M customers likely approve" - Margin of error: ±2% (with 95% confidence) Action: If target was 60% approval, you've exceeded it — launch feature

When to Use:

  • A/B tests (sample of users → decision for all users)
  • Market research (survey 1,000 people → predict city-wide behavior)
  • Quality control (test 100 products → estimate defect rate in 1M production run)
  • Forecasting (historical data → predict future trends)

Key Difference: Sample vs Population

| Aspect | Descriptive | Inferential | |--------|-------------|-------------| | Data Scope | Entire dataset | Sample of larger population | | Goal | Summarize what you have | Predict what you don't have | | Uncertainty | No uncertainty (exact values) | Quantifies uncertainty (confidence intervals) | | Questions | "What IS the average?" | "What WILL BE the average?" | | Example | Last month's average revenue | Next month's predicted revenue |

Info

In practice, you use BOTH. Start with descriptive statistics (explore data), then use inferential statistics (make decisions). Descriptive tells you "what happened," inferential tells you "what to do next."

⚠️ CheckpointQuiz error: Missing or invalid options array

📚

Essential Statistical Concepts for Analysts

Here are 7 concepts you'll use daily as a data analyst.

1. Mean, Median, Mode (Central Tendency)

Where you see this: Every dashboard, every summary table.

  • Mean (average): Sum all values ÷ count
    • Order values: ₹100, ₹150, ₹200, ₹500, ₹10,000 → Mean = ₹2,190 (skewed by outlier)
  • Median (middle value): Sort data, pick middle value
    • Same data → Median = ₹200 (more representative when outliers exist)
  • Mode (most common): Value that appears most frequently
    • Shoe sizes: 7, 7, 8, 8, 8, 9, 10 → Mode = 8 (most sold size)

When to use which: Covered in detail in next topic (Mean, Median, Mode).


2. Standard Deviation & Variance (Spread)

Where you see this: Risk analysis, quality control, anomaly detection.

  • Variance: Average squared difference from mean (how spread out data is)
  • Standard Deviation (SD): Square root of variance (same units as original data)

Example — Delivery Time Consistency:

  • Restaurant A: Average 30 min, SD 5 min (consistent: 25-35 min range)
  • Restaurant B: Average 30 min, SD 15 min (inconsistent: 15-45 min range)
  • Insight: Restaurant A is more reliable despite same average

3. Normal Distribution (Bell Curve)

Where you see this: Everywhere in nature and business.

  • Shape: Symmetric bell curve, most data near mean, fewer at extremes
  • 68-95-99.7 Rule:
    • 68% of data within 1 SD of mean
    • 95% within 2 SD
    • 99.7% within 3 SD
  • Example: Heights, test scores, measurement errors, website load times

Why it matters: Many statistical tests ASSUME normal distribution (T-tests, regression). Always check this assumption.


4. Correlation vs Causation

Where you see this: Every analysis involving relationships between variables.

  • Correlation: Two variables move together (might be coincidence)
    • Example: Ice cream sales and drowning deaths both increase in summer (correlated, not causal)
  • Causation: One variable DIRECTLY CAUSES change in another
    • Example: Adding free shipping CAUSES higher conversion rates (proven via A/B test)

Rule: Correlation ≠ Causation. Need experiments (A/B tests) to prove causation.


5. P-Value (Statistical Significance)

Where you see this: A/B tests, hypothesis tests, regression outputs.

  • Definition: Probability that observed difference is due to random chance
  • Interpretation:
    • p < 0.05 (5%): Result is statistically significant (likely real effect)
    • p ≥ 0.05: Not significant (could be random noise)
  • Example: A/B test shows 2.5% vs 2.8% conversion. P-value = 0.03 → Significant difference (not luck).

6. Confidence Intervals

Where you see this: Survey results, forecasts, any estimate from sample data.

  • Definition: Range where true population value likely falls
  • Example: "Average order value: ₹1,250 (95% CI: ₹1,200 - ₹1,300)"
    • Interpretation: If you repeated this study 100 times, 95 times the true average would fall in this range

Why 95%?: Industry standard (balances confidence and precision). Some use 90% or 99% depending on risk tolerance.


7. Sample Size & Power

Where you see this: Planning A/B tests, surveys, experiments.

  • Sample size: How many observations you need for reliable results
  • Statistical power: Probability of detecting a real effect (if it exists)
  • Trade-off: Larger sample = more confident, but more expensive/time-consuming

Example — A/B Test Planning:

  • Want to detect 5% conversion lift (from 2% to 2.1%)
  • Need ~30,000 users per variant (60K total)
  • If you only have 1,000 users → Underpowered (can't detect small changes reliably)
🛠️

Statistics in Analyst Workflow

Here's how you actually USE statistics in a typical project.

Step-by-Step: Analyzing Swiggy Delivery Time Trends

Business Question: Are delivery times increasing over time? Should we investigate operations?


Step 1: Descriptive Statistics (Explore)

code.pyPython
# Load data: 100,000 deliveries from Jan-Mar 2025
import pandas as pd

df = pd.read_csv('deliveries.csv')

# Summary statistics
df['delivery_time_min'].describe()
# Output:
# count    100000.0
# mean     32.5
# std      8.2
# min      10.0
# 25%      27.0
# 50%      31.0   ← Median
# 75%      37.0
# max      95.0

Insights:

  • Average delivery: 32.5 minutes
  • Median: 31 minutes (close to mean → not heavily skewed)
  • Standard deviation: 8.2 minutes (typical variation)
  • Some outliers: Max 95 minutes (investigate these)

Step 2: Visualize Distribution

code.pyPython
import matplotlib.pyplot as plt

df['delivery_time_min'].hist(bins=50)
plt.xlabel('Delivery Time (minutes)')
plt.ylabel('Frequency')
plt.title('Distribution of Delivery Times')
plt.show()

Observation: Roughly normal distribution (bell curve) with slight right skew (long tail of slow deliveries).


Step 3: Compare Time Periods (Inferential Statistics)

code.pyPython
# Split data: Jan vs Mar
jan = df[df['month'] == 1]['delivery_time_min']
mar = df[df['month'] == 3]['delivery_time_min']

# Descriptive comparison
print(f"Jan mean: {jan.mean():.1f} min")  # 31.2 min
print(f"Mar mean: {mar.mean():.1f} min")  # 33.8 min
# Difference: 2.6 minutes increase

Question: Is 2.6 min difference statistically significant or random variation?

Answer: Run T-test (compares means of two groups)

code.pyPython
from scipy import stats

t_stat, p_value = stats.ttest_ind(jan, mar)
print(f"P-value: {p_value:.4f}")  # 0.0001

Interpretation:

  • P-value = 0.0001 (< 0.05) → Statistically significant
  • Delivery times ARE increasing (not random)
  • Action: Alert operations team, investigate causes (more orders, traffic, driver shortage?)

Step 4: Confidence Interval (Quantify Increase)

code.pyPython
mean_diff = mar.mean() - jan.mean()  # 2.6 min
se = stats.sem(mar - jan)  # Standard error of difference
ci = stats.t.interval(0.95, len(mar)-1, loc=mean_diff, scale=se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}] minutes")
# Output: [2.2, 3.0] minutes

Interpretation: True increase is between 2.2 and 3.0 minutes (with 95% confidence).


Step 5: Communicate to Stakeholders

Bad: "March deliveries are slower." Good: "Delivery times increased 2.6 minutes (8% slower) from January to March. This increase is statistically significant (p < 0.001) and consistent across all cities. We estimate true increase is 2.2-3.0 minutes (95% confidence). Recommend investigating operational capacity."

Key: Use statistics to be PRECISE and CONFIDENT, not vague.

Info

Statistics transforms "I think deliveries are slower" into "Deliveries are 2.6 minutes slower (95% CI: 2.2-3.0 min, p < 0.001) — significant and actionable." This is the power of quantifying uncertainty.

⚠️

Common Statistical Mistakes to Avoid

Even experienced analysts make these errors. Knowing them keeps you from misleading stakeholders.

Mistake 1: Confusing Correlation with Causation

Example: "Cities with more data analysts have higher GDP. Therefore, hiring data analysts increases GDP." Problem: Correlation ≠ causation. Maybe rich cities can afford more analysts (reverse causality). Or third factor (tech industry) causes both. Fix: Use experiments (A/B tests) or causal inference methods (regression with controls).


Mistake 2: Using Mean When You Should Use Median

Example: Average salary at startup: ₹12 lakhs. Sounds great! Reality: CEO earns ₹1 crore, 9 employees earn ₹5 lakhs. Mean = ₹14.5L (misleading). Median = ₹5L (typical employee). Fix: Use median for skewed data (income, order values, website load times).


Mistake 3: Ignoring Sample Size

Example: Restaurant A (4.5★, 10 reviews) vs Restaurant B (4.3★, 5,000 reviews). Choose A? Problem: Small samples have high uncertainty. A's rating could be luck. Fix: Consider sample size. B's 4.3★ is more reliable. Check confidence intervals.


Mistake 4: P-Hacking (Cherry-Picking Results)

Example: Run 20 A/B tests. One shows p = 0.04 (significant). Declare victory! Problem: With 20 tests, 1 false positive is expected (5% error rate). You found noise, not signal. Fix: Pre-register hypothesis, use Bonferroni correction (stricter p-value for multiple tests), or split data (train/validation sets).


Mistake 5: Stopping A/B Test Too Early

Example: Day 1 of A/B test shows 10% lift (p = 0.03). Stop test and declare winner. Problem: Early data is noisy. P-values fluctuate. Stopping early inflates false positives (peeking problem). Fix: Pre-calculate required sample size, wait until you hit it. Don't peek at p-values mid-test.


Mistake 6: Assuming Normality Without Checking

Example: Run T-test on website load times (heavily right-skewed: most fast, few very slow). Problem: T-test assumes normality. Skewed data violates assumption → invalid results. Fix: Check distribution (histogram), use non-parametric tests (Mann-Whitney U) for skewed data, or log-transform data to normalize.


Mistake 7: Extrapolating Beyond Data Range

Example: Revenue grows 20% per month for 6 months. Extrapolate: "We'll make ₹100Cr in 2 years!" Problem: Linear/exponential extrapolation breaks at scale (market saturation, competition). Fix: Build realistic models with constraints, use domain knowledge, don't blindly extrapolate trends.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}