A/B testing replaces opinion with evidence. Done correctly, it tells you whether a change to your product causes a measurable improvement. Done incorrectly, it produces statistically meaningless results that make bad decisions look scientific — and that is far more dangerous than having no data at all.
I have seen teams at federal agencies and tech companies make expensive mistakes because they "ran an A/B test" without understanding the math underneath it. This guide gives you the foundation to run experiments you can actually trust.
Key Takeaways
- Calculate sample size before starting. Stopping early because results look significant inflates false positive rates dramatically.
- p-value is not the probability you are right. A p-value of 0.05 means there is a 5% chance of seeing this large a difference if there is truly no effect.
- Run tests for full business cycles. At least two full weeks to account for day-of-week effects and novelty bias.
- Segment after significance. A 0% overall lift can mask a 5% lift in mobile and -5% in desktop. Always break results down.
Core Statistical Concepts
Every A/B test rests on the same four concepts. Understanding them is not optional — they determine whether your results mean anything.
Null Hypothesis
The change has no effect. Your test is an attempt to gather enough evidence to reject this assumption.
p-value
The probability of seeing a result this extreme if the null hypothesis were true. p < 0.05 means less than 5% chance this happened by random chance. Not the probability your result is correct.
Statistical Significance
Result is significant when p < alpha (usually 0.05). The difference is unlikely to be random — but may still be too small to matter practically.
Statistical Power
The probability of detecting a real effect when one exists. Standard power is 0.80 (80% chance of detecting a true effect). Underpowered tests miss real effects.
Sample Size Calculation
The single most common mistake in A/B testing is starting without calculating sample size. If you stop when results look significant, you are running as many tests as it takes to get lucky — which is not science.
from scipy.stats import norm import numpy as np def sample_size(baseline_rate, mde, alpha=0.05, power=0.80): z_alpha = norm.ppf(1 - alpha/2) # 1.96 for alpha=0.05 z_beta = norm.ppf(power) # 0.842 for power=0.80 p2 = baseline_rate + mde p_pooled = (baseline_rate + p2) / 2 n = (z_alpha + z_beta)**2 * p_pooled * (1 - p_pooled) / mde**2 return int(np.ceil(n)) # 5% baseline, want to detect 1% absolute lift n = sample_size(0.05, 0.01) print(f"Need {n} users per group") # ~3,620
If the required sample size takes too long to collect, the effect you are trying to detect is probably too small to be worth testing for. Accept lower power or redesign the test.
Analyzing Results
Once you have hit your sample size target, analysis is straightforward. A two-proportion z-test (or chi-squared) tells you whether the conversion rate difference is statistically significant.
from scipy.stats import chi2_contingency import numpy as np control_conv, control_n = 450, 10000 treatment_conv, treatment_n = 520, 10000 contingency = [[control_conv, control_n - control_conv], [treatment_conv, treatment_n - treatment_conv]] chi2, p_value, _, _ = chi2_contingency(contingency) ctrl_rate = control_conv / control_n trt_rate = treatment_conv / treatment_n lift = (trt_rate - ctrl_rate) / ctrl_rate * 100 print(f"Control: {ctrl_rate:.3f}, Treatment: {trt_rate:.3f}") print(f"Relative lift: {lift:.1f}%") print(f"p-value: {p_value:.4f}") print(f"Significant: {p_value < 0.05}")
Common Mistakes That Invalidate Tests
Peeking & Stopping Early
Stopping the test as soon as results look significant. Each peek is effectively a new test — your true false positive rate compounds well above 5%.
Pre-Commit to Sample Size
Calculate the required sample size before you start. Run to that number regardless of how things look at interim check-ins. Ship only then.
Beyond peeking, the other major mistake is running tests without accounting for novelty effects. New UI elements get extra attention simply because they are new. Wait at least two weeks for behavior to stabilize before reading results as representative of steady-state usage.
A final trap: testing too many variations simultaneously. Each additional variant requires proportionally more traffic and increases your false positive risk. Two variants is the default. Three is the maximum for most product teams.
Frequently Asked Questions
What is statistical significance in A/B testing? Statistical significance (p < 0.05) means there is less than a 5% probability of seeing this large a difference if the null hypothesis (no effect) were true. It does not mean you have 95% confidence the treatment is better.
How long should an A/B test run? Run until you reach the pre-calculated sample size, regardless of interim results. At minimum, run for two full business cycles (usually 2 weeks) to account for day-of-week effects and novelty bias.
What is the difference between p-value and confidence interval? A p-value gives a binary significant/not-significant judgment. A confidence interval gives the range of plausible values for the true effect size. Confidence intervals are more useful for business decisions because they convey effect magnitude, not just statistical significance.
Data-driven decisions start with rigorous experimentation. Get the skills.
The 2-day in-person Precision AI Academy bootcamp. 5 cities. $1,490. 40 seats max. June–October 2026 (Thu–Fri).
Reserve Your SeatMost teams run A/B tests wrong — the stopping problem is the whole game.
The statistical mechanics in this guide are correct, but the single failure mode that costs companies the most isn't sample size calculation — it's peeking. Analysts check dashboards daily, see a p-value cross 0.05 on day four of a fourteen-day test, and call it. That practice inflates false positive rates to 30–40%, not the nominal 5%. Optimizely's own research team documented this systematically a decade ago and it still happens everywhere.
The fix isn't willpower — it's sequential testing methods like CUPED (Controlled-experiment Using Pre-Experiment Data) or Bayesian updating with proper stopping rules. Netflix, Booking.com, and Airbnb all publish internal A/B testing infrastructure that implements these. Booking.com processes over 1,000 concurrent experiments at any given time and has written publicly about how sequential testing and variance reduction cut their experiment durations in half without inflating false positive rates.
For anyone learning experimentation for product or data roles, we'd argue the statistical theory matters less than building the discipline of pre-registration: write down your hypothesis, your primary metric, your sample size, and your stopping date before you touch the data. That habit alone separates analysts who generate actionable insights from those who generate confirmation bias with extra steps.