Day 3 of 5
⏱ ~60 minutes
Statistics for Data Science — Day 3

Hypothesis Testing — Making Decisions with Data

Understand null hypotheses, p-values, t-tests, and how to run rigorous A/B tests that produce trustworthy conclusions.

Hypothesis Testing: Making Reliable Decisions

A/B tests. Drug trials. Feature experiments. They all use hypothesis testing. The framework is simple: state what you'd expect by chance (the null hypothesis), measure what actually happened, and calculate how surprised you should be.

The Framework

  1. Null hypothesis (H₀): No effect. "The change made no difference."
  2. Alternative hypothesis (H₁): There is an effect. "The change improved conversion."
  3. Significance level (α): How much false-positive risk you accept. Usually 0.05.
  4. p-value: Probability of seeing results this extreme if H₀ is true.
  5. Decision: If p-value < α, reject H₀. Otherwise, fail to reject H₀.
⚠️
p-value ≠ probability the null is true. p = 0.03 means: "If there were truly no effect, we'd see results this extreme 3% of the time." It does not mean there's a 97% chance the effect is real.

Two-Sample t-Test in Python

Python — A/B test analysis
from scipy import stats
import numpy as np

np.random.seed(42)

# Control group: old checkout flow
control = np.random.normal(loc=50, scale=15, size=200)  # avg $50 order

# Treatment group: new checkout flow
treatment = np.random.normal(loc=54, scale=15, size=200)  # avg $54 order

# Two-sample t-test (independent groups, unequal variances)
t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=False)

print(f"Control mean:   ${control.mean():.2f}")
print(f"Treatment mean: ${treatment.mean():.2f}")
print(f"Difference:     ${treatment.mean() - control.mean():.2f}")
print(f"t-statistic:    {t_stat:.3f}")
print(f"p-value:        {p_value:.4f}")
print(f"Significant:    {p_value < 0.05}")

# Effect size (Cohen's d)
pooled_std = np.sqrt((control.std()**2 + treatment.std()**2) / 2)
cohens_d = (treatment.mean() - control.mean()) / pooled_std
print(f"Cohen's d:      {cohens_d:.3f}")  # 0.2=small, 0.5=medium, 0.8=large

Statistical Power and Sample Size

Python — Required sample size
from statsmodels.stats.power import TTestIndPower

# How many samples do I need to detect a meaningful effect?
analysis = TTestIndPower()
n = analysis.solve_power(
    effect_size=0.3,    # minimum detectable effect (Cohen's d)
    alpha=0.05,         # significance level
    power=0.80          # 80% chance of detecting the effect if real
)
print(f"Required sample size per group: {n:.0f}")
💡
Calculate sample size before the experiment, not after. Stopping an A/B test early when it looks significant is p-hacking — it inflates your false positive rate substantially.
Day 3 Exercise
Run a Complete A/B Test Analysis
  1. Simulate control and treatment groups (or use real data)
  2. Run a two-sample t-test and interpret the p-value correctly
  3. Calculate Cohen's d to quantify the effect size
  4. Calculate what sample size you would have needed for 80% power
  5. Write a 4-sentence summary for a product manager: result, confidence, recommendation

Day 3 Summary

  • p-value is the probability of seeing your data if the null is true — not the other way around
  • α = 0.05 means you accept a 5% false positive rate — choose based on business stakes
  • Always calculate sample size before starting — stopping early invalidates the test
  • Cohen's d quantifies practical significance; p-value only says "probably not random"
  • Statistical significance ≠ practical significance — a 0.01% lift can be significant but meaningless
Finished this lesson?