Topic 10 of 12

Statistics for Data Analytics

Statistics turns hunches into evidence. Learn the fundamentals that separate guesswork from data-driven decisions.

๐Ÿ“šIntermediate
โฑ๏ธ18 min
โœ…7 quizzes

Why Statistics for Analysts?

Statistics helps you: โœ… Summarize large datasets โœ… Identify patterns and outliers โœ… Test hypotheses โœ… Quantify uncertainty โœ… Make data-backed predictions

Descriptive Statistics

Measures of Central Tendency

Mean (Average):

Mean = Sum of all values / Count Example: [10, 20, 30] โ†’ Mean = 20

Median (Middle Value):

Sort data, pick middle value Example: [10, 15, 100] โ†’ Median = 15 (Better than mean when outliers exist!)

Mode (Most Frequent):

Example: [1, 2, 2, 3] โ†’ Mode = 2

Measures of Spread

Range:

Range = Max - Min Example: [10, 50] โ†’ Range = 40

Variance: Average of squared differences from mean

High variance = Data is spread out Low variance = Data is clustered

Standard Deviation (ฯƒ): Square root of variance

ยฑ1ฯƒ contains ~68% of data ยฑ2ฯƒ contains ~95% of data ยฑ3ฯƒ contains ~99.7% of data

Probability Basics

Probability = Favorable outcomes / Total outcomes

Example: Probability of rolling 6 on dice = 1/6 = 16.7%

Key Concepts

  • Independent events: Coin flip outcomes don't affect each other
  • Dependent events: Drawing cards without replacement
  • Conditional probability: P(A|B) = Probability of A given B happened

Correlation vs Causation

Correlation

Measures relationship strength between two variables (-1 to +1).

  • +1 = Perfect positive (both increase together)
  • 0 = No relationship
  • -1 = Perfect negative (one increases, other decreases)

Example: Ice cream sales correlate with drowning deaths (both increase in summer), but ice cream doesn't cause drowning!

Causation

One variable directly causes change in another.

Establish causation:

  1. Correlation exists
  2. Temporal order (cause before effect)
  3. No confounding variables
  4. Controlled experiment

Normal Distribution (Bell Curve)

Most data in nature follows a bell curve.

Properties:

  • Mean = Median = Mode (center)
  • Symmetric
  • 68-95-99.7 rule applies

Real examples:

  • Heights
  • Test scores
  • Measurement errors

Hypothesis Testing

The Process

  1. State hypothesis

    • H0 (Null): No effect
    • H1 (Alternative): There is an effect
  2. Collect data

  3. Calculate p-value

    • Probability of seeing results if H0 is true
  4. Decide

    • p < 0.05: Reject H0 (statistically significant!)
    • p โ‰ฅ 0.05: Fail to reject H0

Example

Question: Did new website design increase sales?

  • H0: New design has no effect
  • H1: New design increases sales

Results: p-value = 0.02

Conclusion: Reject H0. Design likely increased sales (95% confidence).

Confidence Intervals

Range where true value likely lies.

Example: "Average customer age is 35 ยฑ 2 years (95% CI)" = We're 95% confident true average is between 33-37.

Statistical Significance

p-value < 0.05 = Statistically significant

What it means:

  • Less than 5% chance result is due to random chance
  • NOT the same as "important" or "large effect"

Example:

  • Finding: Website tweak increases clicks by 0.1%
  • p-value: 0.001 (highly significant!)
  • But: 0.1% increase might not be business-relevant

Common Distributions

| Distribution | Use Case | |--------------|----------| | Normal | Heights, test scores | | Binomial | Success/failure (coin flips) | | Poisson | Event counts (website visits/hour) | | Uniform | Random number generators |

Real Example: Salary Analysis

Dataset: 1000 employee salaries

Questions to answer:

  1. What's typical salary?

    • Mean: โ‚น52,000
    • Median: โ‚น48,000 (better - not affected by CEO's โ‚น5M salary!)
  2. How spread out are salaries?

    • Standard deviation: โ‚น15,000
    • 68% of employees earn โ‚น37K-โ‚น67K
  3. Are salaries normally distributed?

    • Check histogram - if bell-shaped, yes
  4. Do engineers earn more than designers?

    • Hypothesis test
    • p-value = 0.03
    • Yes, statistically significant difference!

Python for Statistics

code.pyPython
import pandas as pd
import numpy as np

df = pd.read_csv('sales.csv')

# Descriptive stats
print(df['revenue'].mean())
print(df['revenue'].median())
print(df['revenue'].std())

# Correlation
correlation = df['ad_spend'].corr(df['revenue'])
print(f"Correlation: {correlation}")

# Hypothesis test (t-test)
from scipy import stats
group_a = df[df['variant']=='A']['conversion_rate']
group_b = df[df['variant']=='B']['conversion_rate']
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_value}")

Common Mistakes

โŒ Confusing correlation with causation โœ… Remember: Correlation โ‰  Causation

โŒ p-hacking (testing until you get p<0.05) โœ… Define hypothesis before collecting data

โŒ Ignoring sample size โœ… Larger samples = more reliable results

โŒ Assuming significance = importance โœ… Consider practical significance too

Summary

โœ… Mean vs Median (use median for outliers) โœ… Standard deviation measures spread โœ… Correlation โ‰  Causation โœ… p < 0.05 = Statistically significant โœ… Confidence intervals quantify uncertainty โœ… Hypothesis testing proves/disproves theories

Next: A/B Testing & Experimentation! ๐Ÿงช