A/B Testing Guide 2026: Statistical Methods That Ship

How to run experiments that produce trustworthy results — sample sizes, p-values, and the mistakes that make data-driven teams look worse than gut-feel ones.

Week 1 Week 2 Week 3 Control Treatment p = 0.023
0.05
Alpha threshold
0.80
Standard power
3–5%
MOOC completion rate
2wk
Minimum test duration

A/B testing replaces opinion with evidence. Done correctly, it tells you whether a change to your product causes a measurable improvement. Done incorrectly, it produces statistically meaningless results that make bad decisions look scientific — and that is far more dangerous than having no data at all.

I have seen teams at federal agencies and tech companies make expensive mistakes because they "ran an A/B test" without understanding the math underneath it. This guide gives you the foundation to run experiments you can actually trust.

Key Takeaways

01

Core Statistical Concepts

Every A/B test rests on the same four concepts. Understanding them is not optional — they determine whether your results mean anything.

H₀

Null Hypothesis

The change has no effect. Your test is an attempt to gather enough evidence to reject this assumption.

Your test tries to disprove this
p

p-value

The probability of seeing a result this extreme if the null hypothesis were true. p < 0.05 means less than 5% chance this happened by random chance. Not the probability your result is correct.

Lower is stronger evidence
α

Statistical Significance

Result is significant when p < alpha (usually 0.05). The difference is unlikely to be random — but may still be too small to matter practically.

Significance ≠ importance
β

Statistical Power

The probability of detecting a real effect when one exists. Standard power is 0.80 (80% chance of detecting a true effect). Underpowered tests miss real effects.

0.80 is the standard target
02

Sample Size Calculation

The single most common mistake in A/B testing is starting without calculating sample size. If you stop when results look significant, you are running as many tests as it takes to get lucky — which is not science.

sample_size.py
Python
from scipy.stats import norm
import numpy as np

def sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
    z_alpha = norm.ppf(1 - alpha/2)  # 1.96 for alpha=0.05
    z_beta  = norm.ppf(power)          # 0.842 for power=0.80
    p2 = baseline_rate + mde
    p_pooled = (baseline_rate + p2) / 2
    n = (z_alpha + z_beta)**2 * p_pooled * (1 - p_pooled) / mde**2
    return int(np.ceil(n))

# 5% baseline, want to detect 1% absolute lift
n = sample_size(0.05, 0.01)
print(f"Need {n} users per group")  # ~3,620

If the required sample size takes too long to collect, the effect you are trying to detect is probably too small to be worth testing for. Accept lower power or redesign the test.

03

Analyzing Results

Once you have hit your sample size target, analysis is straightforward. A two-proportion z-test (or chi-squared) tells you whether the conversion rate difference is statistically significant.

ab_analysis.py
Python
from scipy.stats import chi2_contingency
import numpy as np

control_conv,  control_n  = 450, 10000
treatment_conv, treatment_n = 520, 10000

contingency = [[control_conv,  control_n  - control_conv],
               [treatment_conv, treatment_n - treatment_conv]]
chi2, p_value, _, _ = chi2_contingency(contingency)

ctrl_rate = control_conv  / control_n
trt_rate  = treatment_conv / treatment_n
lift = (trt_rate - ctrl_rate) / ctrl_rate * 100

print(f"Control: {ctrl_rate:.3f}, Treatment: {trt_rate:.3f}")
print(f"Relative lift: {lift:.1f}%")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")
04

Common Mistakes That Invalidate Tests

× Wrong

Peeking & Stopping Early

Stopping the test as soon as results look significant. Each peek is effectively a new test — your true false positive rate compounds well above 5%.

✓ Right

Pre-Commit to Sample Size

Calculate the required sample size before you start. Run to that number regardless of how things look at interim check-ins. Ship only then.

Beyond peeking, the other major mistake is running tests without accounting for novelty effects. New UI elements get extra attention simply because they are new. Wait at least two weeks for behavior to stabilize before reading results as representative of steady-state usage.

A final trap: testing too many variations simultaneously. Each additional variant requires proportionally more traffic and increases your false positive risk. Two variants is the default. Three is the maximum for most product teams.

05

Frequently Asked Questions

What is statistical significance in A/B testing? Statistical significance (p < 0.05) means there is less than a 5% probability of seeing this large a difference if the null hypothesis (no effect) were true. It does not mean you have 95% confidence the treatment is better.

How long should an A/B test run? Run until you reach the pre-calculated sample size, regardless of interim results. At minimum, run for two full business cycles (usually 2 weeks) to account for day-of-week effects and novelty bias.

What is the difference between p-value and confidence interval? A p-value gives a binary significant/not-significant judgment. A confidence interval gives the range of plausible values for the true effect size. Confidence intervals are more useful for business decisions because they convey effect magnitude, not just statistical significance.

The Verdict
A/B testing done right is one of the most powerful tools in product development. But the math is not optional. Calculate your sample size before you start, run to completion, and segment your results. Anything less is theater, not science.

Data-driven decisions start with rigorous experimentation. Get the skills.

The 2-day in-person Precision AI Academy bootcamp. 5 cities. $1,490. 40 seats max. June–October 2026 (Thu–Fri).

Reserve Your Seat
PA
Our Take

Most teams run A/B tests wrong — the stopping problem is the whole game.

The statistical mechanics in this guide are correct, but the single failure mode that costs companies the most isn't sample size calculation — it's peeking. Analysts check dashboards daily, see a p-value cross 0.05 on day four of a fourteen-day test, and call it. That practice inflates false positive rates to 30–40%, not the nominal 5%. Optimizely's own research team documented this systematically a decade ago and it still happens everywhere.

The fix isn't willpower — it's sequential testing methods like CUPED (Controlled-experiment Using Pre-Experiment Data) or Bayesian updating with proper stopping rules. Netflix, Booking.com, and Airbnb all publish internal A/B testing infrastructure that implements these. Booking.com processes over 1,000 concurrent experiments at any given time and has written publicly about how sequential testing and variance reduction cut their experiment durations in half without inflating false positive rates.

For anyone learning experimentation for product or data roles, we'd argue the statistical theory matters less than building the discipline of pre-registration: write down your hypothesis, your primary metric, your sample size, and your stopping date before you touch the data. That habit alone separates analysts who generate actionable insights from those who generate confirmation bias with extra steps.

PA

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu–Fri Cohorts