#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
Module 5
9 min read

Simpsons Paradox

When trends reverse after grouping data

What You'll Learn

  • What Simpson's Paradox is
  • How it occurs
  • Famous real-world examples
  • How to avoid misleading conclusions
  • When aggregation misleads

Simpson's Paradox

Simpson's Paradox

Definition: A trend appears in different groups but reverses when groups are combined

The shock: Overall data shows one pattern, but every subgroup shows the opposite!

Key insight: Aggregating data can be misleading

Classic Example: UC Berkeley Admissions

UC Berkeley Admissions Example

Overall data (1973):

  • Men: 44% admitted
  • Women: 35% admitted
  • Conclusion: Gender bias against women?

By department: Every department admitted women at higher or equal rates!

What happened? Women applied to more competitive departments

Reality: No bias against women (possibly slight bias FOR women)

How It Happens

Requirements:

  1. Confounding variable (like department)
  2. Different group sizes
  3. Confounder relates to both variables

Mathematical structure: Group A: Treatment better than Control Group B: Treatment better than Control Combined: Control better than Treatment! 😱

Medical Example

Drug trial:

Group A (Young patients): Drug: 90% recovery (90/100) No drug: 80% recovery (800/1000)

Group B (Old patients): Drug: 20% recovery (20/100) No drug: 10% recovery (10/100)

Combined: Drug: 110/200 = 55% recovery No drug: 810/1100 = 73% recovery

Paradox: Drug better in BOTH groups, but worse overall!

Reason: More sick (old) patients got the drug

Baseball Batting Example

Player A vs Player B (1995):

First half: A: .250 (better) B: .200

Second half: A: .400 (better) B: .350

Full season: B has higher average than A!

How? Different numbers of at-bats in each period

Why This Matters

Bad conclusions from:

  • Ignoring important groupings
  • Aggregating without thought
  • Not controlling for confounders

Can lead to:

  • Wrong business decisions
  • Misleading research
  • Incorrect policy

Real-World Cases

Kidney stone treatment: Treatment A better for large and small stones Treatment B better overall (Due to case mix)

COVID-19 mortality: Country A: Lower mortality in every age group Country B: Lower overall mortality (Due to age distribution)

College rankings: School improves in every category Falls in overall ranking (Due to weighting changes)

How to Avoid

Step 1: Visualize subgroups Don't just look at totals

Step 2: Identify confounders What varies across groups?

Step 3: Stratify analysis Report by meaningful groups

Step 4: Use appropriate statistics Adjust for confounders in models

Step 5: Think causally What's really driving the relationship?

When to Aggregate vs Stratify

Aggregate when:

  • Groups truly comparable
  • No important confounders
  • Large sample needed

Stratify when:

  • Groups differ systematically
  • Confounders present
  • Seeking causal insights

Practice Exercise

Company hiring:

Department 1: Men: 12/12 hired (100%) Women: 10/10 hired (100%)

Department 2: Men: 40/200 hired (20%) Women: 30/100 hired (30%)

Questions:

  1. Who has better hiring rate overall?
  2. Who has better rate in each department?
  3. Is this Simpson's Paradox?
  4. What's the confounder?

Answers:

  1. Men: 52/212 = 24.5%, Women: 40/110 = 36.4% (Women better!)
  2. Dept 1: Tie, Dept 2: Women better
  3. Yes! Women better in each dept AND overall (reverse paradox)
  4. Department application rates

Statistical Solutions

Mantel-Haenszel method: Combine stratified data properly

Regression adjustment: Control for confounders statistically

Propensity score matching: Match similar cases across groups

Causal inference: Use DAGs to identify what to control

Key Takeaways

1. Aggregation hides information Always check subgroups

2. Confounders matter Control for lurking variables

3. Context is crucial Understand your data structure

4. Don't trust averages blindly Dig deeper into the groups

5. Think about causality What's really causing what?

Warning Signs

Watch for:

  • Very different group sizes
  • Natural subgroups (age, location, etc.)
  • Unexpected aggregate results
  • Confounders in the data

Next Steps

Learn about Simple Linear Regression!

Tip: When in doubt, break down aggregated data into meaningful groups!

SkillsetMaster - AI, Web Development & Data Analytics Courses