7 A/B Testing Mistakes That Invalidate Your Results
Most teams are running A/B tests wrong. These 7 mistakes silently corrupt your data and lead to decisions that hurt conversion rates instead of helping them.
A/B testing looks simple: show half your visitors version A, the other half version B, pick the winner. But between the concept and the execution lies a long list of ways to get invalid results—and make worse decisions than if you’d never tested at all.
Here are the seven mistakes we see most often, and how to avoid each one.
1. Stopping the Test When You See Significance
This is the most common mistake, and it’s insidious because it feels responsible. You check results daily, you see 95% significance, you stop the test. You’ve been statistically rigorous—haven’t you?
No. If you check results every day and stop at first significance, your actual false positive rate is much higher than 5%. Research by Optimizely and others has shown that “peeking” at results daily can push your real false positive rate above 25%—meaning 1 in 4 of your “winners” are actually losers you happened to stop at the right moment.
Fix: Calculate your required sample size before the test. Don’t look at significance until you’ve hit that threshold. If you must check interim results, use sequential testing methods designed for it.
2. Running Multiple Variants Without Adjusting for Multiple Comparisons
Testing A vs B vs C vs D vs E sounds efficient. You’re testing four variants in the time it takes to test one. The problem: every comparison you make has a chance of a false positive, and those chances compound.
With 4 variants vs. a control at 95% significance per comparison, the chance that at least one comparison is a false positive is about 23%. Add more variants and it gets worse fast.
Fix: Use the Bonferroni correction (divide your significance threshold by the number of comparisons) or test fewer variants at once. Two to three variants maximum. If you want to test many ideas, run sequential A/B tests, not one massive multi-variant experiment.
3. Testing Multiple Changes at Once
You change the headline, the hero image, the CTA button colour, and the form length—then declare the variant a winner. Great, but which change worked? You’ve learned nothing except that something was better.
This is fine if your only goal is lifting conversion now. But if you want to build institutional knowledge about what works for your audience, you need to isolate variables.
Fix: One significant change per test. If a variant wins, you know exactly what to attribute it to. If it loses, you know what to avoid. Over time, you build a tested playbook.
4. Not Accounting for Sample Ratio Mismatch
You set up a 50/50 split. You check the test three days in and notice variant A has 8,430 visitors while variant B has 7,980. That’s a 5% imbalance. Is that a problem?
It depends. Small variations (under 1%) are expected noise. Larger imbalances indicate a technical problem—bot filtering, caching issues, cookie problems, or a bug in the test setup. When traffic isn’t split evenly, your results are unreliable.
Fix: Always check your traffic split before analysing results. If the ratio is more than 1–2% off from your expected split, investigate the cause before interpreting any results.
5. Ignoring Segment Effects
A variant wins on average, so you roll it out globally. But the win was entirely driven by new visitors—for returning customers, the variant actually hurt conversion by 12%. Now you’ve shipped a change that damaged your best segment.
Aggregate results hide segment effects all the time. A change that’s good for mobile might be bad for desktop. A change that helps paid traffic might alienate organic visitors. Rolling out based only on overall averages can cause net harm.
Fix: After a test concludes, segment the results by key dimensions: new vs. returning, mobile vs. desktop, traffic source, geography. If you see significant divergence, consider shipping the change only to the segments where it wins—or investigating why it’s hurting others before global rollout.
6. Running Tests During Anomalous Periods
You launch a test on Black Friday. Your results come back in a week—massive winner! You roll it out. Conversions return to normal. What happened?
Holiday traffic behaves differently. Visitor intent, urgency, and demographics are all skewed. A variant that works great for deal-seeking holiday shoppers may be neutral or negative for your regular audience.
The same applies to: product launches, PR spikes, outage recovery periods, major algorithm changes, and promotional periods.
Fix: Check your traffic patterns before starting a test. Avoid launching during known anomalies. If you have no choice (the test is time-sensitive), note the context and validate the result by running a follow-up test under normal conditions before shipping.
7. Using the Wrong Primary Metric
You test a new checkout flow and measure conversion rate (add-to-cart clicks). The variant wins by 15%. You ship it. Revenue per visitor drops 8%. Why?
Because more people added to cart, but fewer completed purchase—and the ones who did convert bought cheaper items. Your proxy metric (add-to-cart) didn’t correlate with the business outcome (revenue).
This happens constantly when teams test on micro-conversions: clicks, scroll depth, hover events, time on page. These are leading indicators, not outcomes.
Fix: Set your primary metric as the business outcome you actually care about: purchase completions, revenue per visitor, qualified lead submissions, or plan upgrades. Micro-conversions can be secondary metrics, but they should never determine the winner.
A Clean Test Setup Checklist
Before launching any A/B test:
- Sample size calculated and documented
- Test duration set (don’t stop early)
- Only one meaningful change per variant
- Traffic split verified after 24 hours
- Primary metric is a business outcome
- No known anomalies (holidays, promos, launches) in test window
- Segment analysis plan defined before results are seen
Getting these right won’t guarantee wins—but it will guarantee that your results are real. That’s the foundation everything else is built on.
Style: Flat design illustration, dark navy background (#0a0a2e), purple and coral accents, minimalist, professional SaaS blog header Size: 1360x768 (16:9 landscape) Prompt: A split screen, left side showing a red warning triangle over a jagged downward graph line, right side showing a clean upward purple graph with a green shield checkmark. Subtle caution triangles fading in the background. Dark navy, coral and purple accents, no text. Filename: ab-testing-mistakes-to-avoid.jpg