Avoiding Peeking | Learn Statistics Free - SkillsetMaster | Learn Data Analytics Free

What You'll Learn

Why peeking is problematic
Inflated false positive rates
Sequential testing methods
When to stop experiments
Proper analysis practices

The Peeking Problem

What is peeking? Checking results during experiment and stopping early if significant

Why is it tempting?

Want results fast
See significance, assume it's real
Business pressure

The problem: Dramatically increases false positive rate!

Example: α = 0.05 (5% false positive rate) With peeking: Could be 20-30%!

Why Peeking Inflates Error Rates

Multiple testing problem: Each peek is a test!

Analogy: Flipping coin 20 times Probability of at least one heads >> 50%

In A/B testing: Check 10 times during experiment ~26% chance of false positive somewhere!

Simulation: Run A/A test (no real difference) Peek 10 times, stop if p<0.05 → Declare "winner" 26% of the time

Random Early Patterns

Small samples fluctuate:

Day 1: Treatment up 15% (p=0.04) 🎉 Day 3: Treatment up 8% (p=0.10) Day 7: Treatment up 2% (p=0.45) Day 14: Treatment down 1% (p=0.68)

If stopped on Day 1: False positive!

Reality: No real effect, just noise

The Cost of Peeking

False positives: Implement changes that don't work

Waste engineering resources
Potential harm to metrics
Loss of credibility

Example: Peek at Day 2, see 10% improvement Launch feature After full rollout: Actually -2% impact!

Better: Wait for planned sample size

Sequential Testing Methods

Alternative to fixed sample: Look at data early BUT adjust for it

1. Alpha spending Allocate significance level across peeks

2. Optimizely Stats Engine Uses sequential testing Can peek without inflation

3. Bayesian methods Posterior probabilities don't have peeking problem

4. Confidence sequences Always-valid confidence intervals

Alpha Spending Functions

Idea: Spend α budget across multiple looks

Example:

Look 1: Use α = 0.01
Look 2: Use α = 0.02
Look 3: Use α = 0.02
Total: α = 0.05

Methods:

O'Brien-Fleming (conservative early)
Pocock (equal spending)

Allows early stopping while controlling error rate

When Can You Stop Early?

Scenario 1: Overwhelming evidence Effect so large it's undeniable

Example: MDE = 2%, seeing 15% improvement Can likely stop early

But: Use sequential testing framework!

Scenario 2: Harm detected Guardrail metric severely violated → Stop for safety!

Scenario 3: No chance of significance Futility analysis shows can't reach significance → Stop to save resources

Sample Ratio Mismatch (SRM)

What it is: Unequal group sizes when expecting 50/50

Red flag for:

Randomization issues
Tracking bugs
Bot traffic

Check: Chi-square test on group sizes

Example: Expecting 5000/5000 Observe 5234/4766 → Investigate before analyzing!

Don't analyze if SRM present!

Proper Analysis Workflow

1. Pre-register plan

Sample size
Duration
Primary metric
Analysis method

2. Monitor health metrics

SRM check
Data quality
Technical issues

3. Wait for planned end Don't peek at primary metric!

4. Analyze once At predetermined time

5. Make decision Based on pre-registered criteria

What You CAN Monitor

Safe to check:

✓ Sample size accumulation Are we getting traffic?

✓ Technical metrics Errors, crashes, load time

✓ Sample ratio 50/50 split maintained?

✓ Secondary/guardrail metrics Nothing breaking?

✗ Primary metric Wait until the end!

Bayesian Approach

Alternative framework: Posterior probability instead of p-values

Advantages:

Can peek without inflation
More intuitive interpretation
Incorporate prior knowledge

Example: "95% probability that B > A" vs "p = 0.03" (frequentist)

Tools:

VWO
Google Optimize (used Bayesian)

Trade-off: Need to specify priors

Multiple Comparisons

Testing multiple variants: A vs B vs C vs D

Problem: More comparisons = more false positives

Solution: Bonferroni correction Divide α by number of comparisons

Example: 3 comparisons, α = 0.05 Adjust to α = 0.05/3 = 0.017 per test

Or: Use Bayesian ranking

Segmentation Analysis

After finding overall effect: Can explore segments (gender, device, etc.)

But:

Don't choose segments based on where effect looks big!
Adjust for multiple comparisons
Report as exploratory

Valid: Pre-specified: "We'll check mobile vs desktop"

Invalid: Post-hoc: "Effect only in Android users in California!"

Real-World Pressure

Common scenarios:

CEO: "Can we peek at results?" You: "We can check data quality, but looking at primary metric invalidates the test"

PM: "It's been 3 days, what do you see?" You: "Need 2 more weeks for valid results"

Engineer: "Should we ship this?" You: "Not until experiment completes"

Set expectations upfront!

Statistical Monitoring Dashboard

Create dashboard with:

Green to monitor:

Sample size progress
Sample ratio metric
Data quality checks
Guardrail metrics

Red - don't look:

Primary metric results
P-values
Treatment vs control comparison

Alternatives to Early Stopping

If you need speed:

1. Smaller MDE Detect bigger effects faster (But miss smaller improvements)

2. Higher traffic allocation 90/10 instead of 50/50 (But unbalanced comparison)

3. Sequential testing Use proper methods that allow peeking

4. Tier testing Test on subset first, then scale

Practice Exercise

Scenario: Running 2-week A/B test Day 3: Treatment +8%, p=0.04 Day 7: Treatment +3%, p=0.18 Day 14: Treatment +5%, p=0.06

Questions:

Should you stop on Day 3?
What if you did? What's the risk?
What should you do instead?
What if a guardrail metric crashed on Day 5?

Answers:

No! Need full sample size
High risk of false positive (20-30%)
Wait for Day 14, use pre-registered criteria
Stop immediately for guardrail violation

Key Takeaways

1. Peeking inflates false positives 5% → 20-30%

2. Set sample size upfront Stick to the plan

3. If you must peek Use sequential testing methods

4. Monitor technical health Not primary metric

5. Business pressure is real Educate stakeholders early

Next Steps

Learn about Trend & Seasonality!

Tip: Patience in experimentation prevents costly mistakes!