Key Takeaways
- Calculate sample size before starting: Before running a test, calculate the required sample size based on your baseline conversion rate and minimum detectable effect. Stopping early because results look significant inflates false positive rates dramatically.
- p-value is not the probability you are right: A p-value of 0.05 means there is a 5% chance of seeing this large a difference if there is truly no effect. It does NOT mean you have 95% confidence the treatment is better.
- Run tests for full business cycles: Run for at least two full weeks (or two business cycles) to account for day-of-week effects and novelty bias. New UI elements get extra attention simply because they are new.
- Analyze segments after significance: A 0% overall lift can mask a 5% lift in mobile and -5% in desktop. Always segment post-test results by device, geography, and acquisition channel.
A/B testing replaces opinion with evidence. Done correctly, it tells you whether a change to your product causes a measurable improvement. Done incorrectly, it produces statistically meaningless results that make bad decisions look scientific.
Core Statistical Concepts
Null hypothesis: The change has no effect. Your test tries to gather evidence to reject this.
p-value: The probability of seeing a result this extreme if the null hypothesis were true. p < 0.05 means less than 5% chance this happened by random chance. A p-value is NOT the probability your result is correct.
Statistical significance: Result is significant when p < alpha (usually 0.05). The difference is unlikely to be random — but may still be too small to matter practically.
Statistical power: The probability of detecting a real effect when one exists. Standard power is 0.80 (80% chance of detecting a true effect). Underpowered tests miss real effects.
Sample Size Calculation
from scipy.stats import norm import numpy as np def sample_size(baseline_rate, mde, alpha=0.05, power=0.80): z_alpha = norm.ppf(1 - alpha/2) # 1.96 for alpha=0.05 z_beta = norm.ppf(power) # 0.842 for power=0.80 p2 = baseline_rate + mde p_pooled = (baseline_rate + p2) / 2 n = (z_alpha + z_beta)**2 * p_pooled * (1 - p_pooled) / mde**2 return int(np.ceil(n)) # 5% baseline, want to detect 1% absolute lift n = sample_size(0.05, 0.01) print(f"Need {n} users per group") # ~3,620
If the required sample size takes too long to collect, reconsider: either the effect you are trying to detect is too small to be worth testing for, or you need to accept lower statistical power (higher false negative risk).
Analyzing Results
from scipy.stats import chi2_contingency import numpy as np # Two-proportion z-test via chi-squared control_conv, control_n = 450, 10000 treatment_conv, treatment_n = 520, 10000 contingency = [[control_conv, control_n - control_conv], [treatment_conv, treatment_n - treatment_conv]] chi2, p_value, _, _ = chi2_contingency(contingency) ctrl_rate = control_conv / control_n trt_rate = treatment_conv / treatment_n lift = (trt_rate - ctrl_rate) / ctrl_rate * 100 print(f"Control: {ctrl_rate:.3f}, Treatment: {trt_rate:.3f}") print(f"Relative lift: {lift:.1f}") print(f"p-value: {p_value:.4f}") print(f"Significant: {p_value < 0.05}")
Common Mistakes
- Peeking at results early: Stopping as soon as results look significant. Each peek is effectively a separate test, compounding false positive rate.
- Too many simultaneous variations: Each additional variant requires proportionally more traffic and increases false positive risk.
- Not accounting for novelty effects: New UI gets extra attention because it is new. Wait 2+ weeks for behavior to stabilize.
- Ignoring pre-experiment bias: Run an A/A test (same experience for both groups) to verify your randomization and measurement system is working correctly before running a real experiment.
Frequently Asked Questions
What is statistical significance in A/B testing?
Statistical significance (p < 0.05) means the difference between control and treatment is unlikely to be due to random chance — specifically, there is less than a 5% probability of seeing this large a difference if the null hypothesis (no effect) were true.
How do I calculate sample size for an A/B test?
Sample size depends on baseline conversion rate, minimum detectable effect (smaller effects require more samples), significance level (alpha, 0.05), and statistical power (0.80). Never start a test without calculating sample size — stopping when results look good dramatically inflates false positive rates.
How long should an A/B test run?
Run until you reach the pre-calculated sample size, regardless of interim results. At minimum, run for two full business cycles (usually 2 weeks) to account for day-of-week effects and novelty bias.
What is the difference between p-value and confidence interval?
A p-value gives a binary significant/not-significant judgment. A confidence interval gives the range of plausible values for the true effect size (e.g., 'the true conversion lift is between 0.5% and 2.3%'). Confidence intervals are more useful for business decisions because they convey effect magnitude, not just statistical significance.
Data-driven product decisions start with rigorous experimentation. Get the skills.
Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.
Reserve Your SeatNote: Information reflects early 2026.