A/B Testing Guide: Statistical Methods for Product Decisions

Key Takeaways

A/B testing replaces opinion with evidence. Done correctly, it tells you whether a change to your product causes a measurable improvement. Done incorrectly, it produces statistically meaningless results that make bad decisions look scientific.

Core Statistical Concepts

Null hypothesis: The change has no effect. Your test tries to gather evidence to reject this.

p-value: The probability of seeing a result this extreme if the null hypothesis were true. p < 0.05 means less than 5% chance this happened by random chance. A p-value is NOT the probability your result is correct.

Statistical significance: Result is significant when p < alpha (usually 0.05). The difference is unlikely to be random — but may still be too small to matter practically.

Statistical power: The probability of detecting a real effect when one exists. Standard power is 0.80 (80% chance of detecting a true effect). Underpowered tests miss real effects.

Sample Size Calculation

from scipy.stats import norm
import numpy as np

def sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
    z_alpha = norm.ppf(1 - alpha/2)  # 1.96 for alpha=0.05
    z_beta = norm.ppf(power)          # 0.842 for power=0.80
    p2 = baseline_rate + mde
    p_pooled = (baseline_rate + p2) / 2
    n = (z_alpha + z_beta)**2 * p_pooled * (1 - p_pooled) / mde**2
    return int(np.ceil(n))

# 5% baseline, want to detect 1% absolute lift
n = sample_size(0.05, 0.01)
print(f"Need {n} users per group")  # ~3,620

If the required sample size takes too long to collect, reconsider: either the effect you are trying to detect is too small to be worth testing for, or you need to accept lower statistical power (higher false negative risk).

Analyzing Results

from scipy.stats import chi2_contingency
import numpy as np

# Two-proportion z-test via chi-squared
control_conv, control_n = 450, 10000
treatment_conv, treatment_n = 520, 10000

contingency = [[control_conv, control_n - control_conv],
               [treatment_conv, treatment_n - treatment_conv]]
chi2, p_value, _, _ = chi2_contingency(contingency)

ctrl_rate = control_conv / control_n
trt_rate = treatment_conv / treatment_n
lift = (trt_rate - ctrl_rate) / ctrl_rate * 100

print(f"Control: {ctrl_rate:.3f}, Treatment: {trt_rate:.3f}")
print(f"Relative lift: {lift:.1f}")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

Common Mistakes

Frequently Asked Questions

What is statistical significance in A/B testing?

Statistical significance (p < 0.05) means the difference between control and treatment is unlikely to be due to random chance — specifically, there is less than a 5% probability of seeing this large a difference if the null hypothesis (no effect) were true.

How do I calculate sample size for an A/B test?

Sample size depends on baseline conversion rate, minimum detectable effect (smaller effects require more samples), significance level (alpha, 0.05), and statistical power (0.80). Never start a test without calculating sample size — stopping when results look good dramatically inflates false positive rates.

How long should an A/B test run?

Run until you reach the pre-calculated sample size, regardless of interim results. At minimum, run for two full business cycles (usually 2 weeks) to account for day-of-week effects and novelty bias.

What is the difference between p-value and confidence interval?

A p-value gives a binary significant/not-significant judgment. A confidence interval gives the range of plausible values for the true effect size (e.g., 'the true conversion lift is between 0.5% and 2.3%'). Confidence intervals are more useful for business decisions because they convey effect magnitude, not just statistical significance.

Data-driven product decisions start with rigorous experimentation. Get the skills.

Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.

Reserve Your Seat

Note: Information reflects early 2026.

BP

Bo Peng

AI Instructor & Founder, Precision AI Academy

Bo has trained 400+ professionals in applied AI across federal agencies and Fortune 500 companies.