Avoiding Peeking
Prevent false positives from early stopping
What You'll Learn
- Why peeking is problematic
- Inflated false positive rates
- Sequential testing methods
- When to stop experiments
- Proper analysis practices
The Peeking Problem
What is peeking? Checking results during experiment and stopping early if significant
Why is it tempting?
- Want results fast
- See significance, assume it's real
- Business pressure
The problem: Dramatically increases false positive rate!
Example: α = 0.05 (5% false positive rate) With peeking: Could be 20-30%!
Why Peeking Inflates Error Rates
Multiple testing problem: Each peek is a test!
Analogy: Flipping coin 20 times Probability of at least one heads >> 50%
In A/B testing: Check 10 times during experiment ~26% chance of false positive somewhere!
Simulation: Run A/A test (no real difference) Peek 10 times, stop if p<0.05 → Declare "winner" 26% of the time
Random Early Patterns
Small samples fluctuate:
Day 1: Treatment up 15% (p=0.04) 🎉 Day 3: Treatment up 8% (p=0.10) Day 7: Treatment up 2% (p=0.45) Day 14: Treatment down 1% (p=0.68)
If stopped on Day 1: False positive!
Reality: No real effect, just noise
The Cost of Peeking
False positives: Implement changes that don't work
- Waste engineering resources
- Potential harm to metrics
- Loss of credibility
Example: Peek at Day 2, see 10% improvement Launch feature After full rollout: Actually -2% impact!
Better: Wait for planned sample size
Sequential Testing Methods
Alternative to fixed sample: Look at data early BUT adjust for it
1. Alpha spending Allocate significance level across peeks
2. Optimizely Stats Engine Uses sequential testing Can peek without inflation
3. Bayesian methods Posterior probabilities don't have peeking problem
4. Confidence sequences Always-valid confidence intervals
Alpha Spending Functions
Idea: Spend α budget across multiple looks
Example:
- Look 1: Use α = 0.01
- Look 2: Use α = 0.02
- Look 3: Use α = 0.02
- Total: α = 0.05
Methods:
- O'Brien-Fleming (conservative early)
- Pocock (equal spending)
Allows early stopping while controlling error rate
When Can You Stop Early?
Scenario 1: Overwhelming evidence Effect so large it's undeniable
Example: MDE = 2%, seeing 15% improvement Can likely stop early
But: Use sequential testing framework!
Scenario 2: Harm detected Guardrail metric severely violated → Stop for safety!
Scenario 3: No chance of significance Futility analysis shows can't reach significance → Stop to save resources
Sample Ratio Mismatch (SRM)
What it is: Unequal group sizes when expecting 50/50
Red flag for:
- Randomization issues
- Tracking bugs
- Bot traffic
Check: Chi-square test on group sizes
Example: Expecting 5000/5000 Observe 5234/4766 → Investigate before analyzing!
Don't analyze if SRM present!
Proper Analysis Workflow
1. Pre-register plan
- Sample size
- Duration
- Primary metric
- Analysis method
2. Monitor health metrics
- SRM check
- Data quality
- Technical issues
3. Wait for planned end Don't peek at primary metric!
4. Analyze once At predetermined time
5. Make decision Based on pre-registered criteria
What You CAN Monitor
Safe to check:
✓ Sample size accumulation Are we getting traffic?
✓ Technical metrics Errors, crashes, load time
✓ Sample ratio 50/50 split maintained?
✓ Secondary/guardrail metrics Nothing breaking?
✗ Primary metric Wait until the end!
Bayesian Approach
Alternative framework: Posterior probability instead of p-values
Advantages:
- Can peek without inflation
- More intuitive interpretation
- Incorporate prior knowledge
Example: "95% probability that B > A" vs "p = 0.03" (frequentist)
Tools:
- VWO
- Google Optimize (used Bayesian)
Trade-off: Need to specify priors
Multiple Comparisons
Testing multiple variants: A vs B vs C vs D
Problem: More comparisons = more false positives
Solution: Bonferroni correction Divide α by number of comparisons
Example: 3 comparisons, α = 0.05 Adjust to α = 0.05/3 = 0.017 per test
Or: Use Bayesian ranking
Segmentation Analysis
After finding overall effect: Can explore segments (gender, device, etc.)
But:
- Don't choose segments based on where effect looks big!
- Adjust for multiple comparisons
- Report as exploratory
Valid: Pre-specified: "We'll check mobile vs desktop"
Invalid: Post-hoc: "Effect only in Android users in California!"
Real-World Pressure
Common scenarios:
CEO: "Can we peek at results?" You: "We can check data quality, but looking at primary metric invalidates the test"
PM: "It's been 3 days, what do you see?" You: "Need 2 more weeks for valid results"
Engineer: "Should we ship this?" You: "Not until experiment completes"
Set expectations upfront!
Statistical Monitoring Dashboard
Create dashboard with:
Green to monitor:
- Sample size progress
- Sample ratio metric
- Data quality checks
- Guardrail metrics
Red - don't look:
- Primary metric results
- P-values
- Treatment vs control comparison
Alternatives to Early Stopping
If you need speed:
1. Smaller MDE Detect bigger effects faster (But miss smaller improvements)
2. Higher traffic allocation 90/10 instead of 50/50 (But unbalanced comparison)
3. Sequential testing Use proper methods that allow peeking
4. Tier testing Test on subset first, then scale
Practice Exercise
Scenario: Running 2-week A/B test Day 3: Treatment +8%, p=0.04 Day 7: Treatment +3%, p=0.18 Day 14: Treatment +5%, p=0.06
Questions:
- Should you stop on Day 3?
- What if you did? What's the risk?
- What should you do instead?
- What if a guardrail metric crashed on Day 5?
Answers:
- No! Need full sample size
- High risk of false positive (20-30%)
- Wait for Day 14, use pre-registered criteria
- Stop immediately for guardrail violation
Key Takeaways
1. Peeking inflates false positives 5% → 20-30%
2. Set sample size upfront Stick to the plan
3. If you must peek Use sequential testing methods
4. Monitor technical health Not primary metric
5. Business pressure is real Educate stakeholders early
Next Steps
Learn about Trend & Seasonality!
Tip: Patience in experimentation prevents costly mistakes!