What is Statistics for Data Analytics?

Learn essential statistics for data analysis: mean, median, standard deviation, correlation, hypothesis testing, and statistical significance.

Is Statistics for Data Analytics suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 18 min to complete and includes 7 interactive quizzes to test your understanding.

How long does it take to learn Statistics for Data Analytics?

You can complete this topic in about 18 min. The topic is part 10 of 12 in our comprehensive Data Analytics Learning Path.

Statistics for Data Analytics

Why Statistics for Analysts?

Statistics helps you: ✅ Summarize large datasets ✅ Identify patterns and outliers ✅ Test hypotheses ✅ Quantify uncertainty ✅ Make data-backed predictions

Descriptive Statistics

Measures of Central Tendency

Mean (Average):

Mean = Sum of all values / Count
Example: [10, 20, 30] → Mean = 20

Median (Middle Value):

Sort data, pick middle value
Example: [10, 15, 100] → Median = 15
(Better than mean when outliers exist!)

Mode (Most Frequent):

Example: [1, 2, 2, 3] → Mode = 2

Measures of Spread

Range:

Range = Max - Min
Example: [10, 50] → Range = 40

Variance: Average of squared differences from mean

High variance = Data is spread out
Low variance = Data is clustered

Standard Deviation (σ): Square root of variance

±1σ contains ~68% of data
±2σ contains ~95% of data
±3σ contains ~99.7% of data

Probability Basics

Probability = Favorable outcomes / Total outcomes

Example: Probability of rolling 6 on dice = 1/6 = 16.7%

Key Concepts

Independent events: Coin flip outcomes don't affect each other
Dependent events: Drawing cards without replacement
Conditional probability: P(A|B) = Probability of A given B happened

Correlation vs Causation

Correlation

Measures relationship strength between two variables (-1 to +1).

+1 = Perfect positive (both increase together)
0 = No relationship
-1 = Perfect negative (one increases, other decreases)

Example: Ice cream sales correlate with drowning deaths (both increase in summer), but ice cream doesn't cause drowning!

Causation

One variable directly causes change in another.

Establish causation:

Correlation exists
Temporal order (cause before effect)
No confounding variables
Controlled experiment

Normal Distribution (Bell Curve)

Most data in nature follows a bell curve.

Properties:

Mean = Median = Mode (center)
Symmetric
68-95-99.7 rule applies

Real examples:

Heights
Test scores
Measurement errors

Hypothesis Testing

The Process

State hypothesis
- H0 (Null): No effect
- H1 (Alternative): There is an effect
Collect data
Calculate p-value
- Probability of seeing results if H0 is true
Decide
- p < 0.05: Reject H0 (statistically significant!)
- p ≥ 0.05: Fail to reject H0

Example

Question: Did new website design increase sales?

H0: New design has no effect
H1: New design increases sales

Results: p-value = 0.02

Conclusion: Reject H0. Design likely increased sales (95% confidence).

Confidence Intervals

Range where true value likely lies.

Example: "Average customer age is 35 ± 2 years (95% CI)" = We're 95% confident true average is between 33-37.

Statistical Significance

p-value < 0.05 = Statistically significant

What it means:

Less than 5% chance result is due to random chance
NOT the same as "important" or "large effect"

Example:

Finding: Website tweak increases clicks by 0.1%
p-value: 0.001 (highly significant!)
But: 0.1% increase might not be business-relevant

Common Distributions

| Distribution | Use Case | |--------------|----------| | Normal | Heights, test scores | | Binomial | Success/failure (coin flips) | | Poisson | Event counts (website visits/hour) | | Uniform | Random number generators |

Real Example: Salary Analysis

Dataset: 1000 employee salaries

Questions to answer:

What's typical salary?
- Mean: ₹52,000
- Median: ₹48,000 (better - not affected by CEO's ₹5M salary!)
How spread out are salaries?
- Standard deviation: ₹15,000
- 68% of employees earn ₹37K-₹67K
Are salaries normally distributed?
- Check histogram - if bell-shaped, yes
Do engineers earn more than designers?
- Hypothesis test
- p-value = 0.03
- Yes, statistically significant difference!

Python for Statistics

code.pyPython

import pandas as pd
import numpy as np

df = pd.read_csv('sales.csv')

# Descriptive stats
print(df['revenue'].mean())
print(df['revenue'].median())
print(df['revenue'].std())

# Correlation
correlation = df['ad_spend'].corr(df['revenue'])
print(f"Correlation: {correlation}")

# Hypothesis test (t-test)
from scipy import stats
group_a = df[df['variant']=='A']['conversion_rate']
group_b = df[df['variant']=='B']['conversion_rate']
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_value}")

Common Mistakes

❌ Confusing correlation with causation ✅ Remember: Correlation ≠ Causation

❌ p-hacking (testing until you get p<0.05) ✅ Define hypothesis before collecting data

❌ Ignoring sample size ✅ Larger samples = more reliable results

❌ Assuming significance = importance ✅ Consider practical significance too

Summary

✅ Mean vs Median (use median for outliers) ✅ Standard deviation measures spread ✅ Correlation ≠ Causation ✅ p < 0.05 = Statistically significant ✅ Confidence intervals quantify uncertainty ✅ Hypothesis testing proves/disproves theories

Next: A/B Testing & Experimentation! 🧪