Day 2 of 5
⏱ ~60 minutes
Statistics for Data Science — Day 2

Probability Fundamentals — Thinking in Likelihoods

Build intuition for probability: distributions, conditional probability, Bayes' theorem, and how probability underpins every AI model.

Probability: The Language of Uncertainty

Every machine learning model outputs a probability — whether it shows it to you or not. Classification confidence scores, recommendation rankings, anomaly detection thresholds — they're all probability statements. Understanding probability means understanding what your model is actually saying.

Basic Probability Rules

Two events A and B:

Python — Basic probability simulation
import numpy as np

np.random.seed(42)
n = 100_000  # simulate many trials

# P(two fair coins both heads)
coins = np.random.choice([0, 1], size=(n, 2))
p_both_heads = np.mean(coins.sum(axis=1) == 2)
print(f"P(HH): {p_both_heads:.4f}")  # ~0.25

# Conditional: P(second head | first head)
first_head = coins[coins[:, 0] == 1]
p_second_given_first = np.mean(first_head[:, 1] == 1)
print(f"P(H2|H1): {p_second_given_first:.4f}")  # ~0.50 (independent)

Bayes' Theorem

Bayes' theorem is how you update a probability estimate when you get new evidence: P(A|B) = P(B|A) × P(A) / P(B)

Python — Medical test example
# Medical test for rare disease
# Disease prevalence: 1% of population
# Test accuracy: 99% true positive rate, 1% false positive rate

p_disease = 0.01          # prior
p_positive_given_disease = 0.99  # sensitivity
p_positive_given_healthy = 0.01  # false positive rate

# P(positive test)
p_positive = (p_positive_given_disease * p_disease +
              p_positive_given_healthy * (1 - p_disease))

# Bayes: P(disease | positive test)
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print(f"P(disease | positive test): {p_disease_given_positive:.1%}")
# ~50% — even with a 99% accurate test, only 50% of positives are true
# Because the disease is rare (1%)
ℹ️
This is the base rate fallacy — human intuition ignores the prior (how rare the disease is). Bayes forces you to account for it. This matters enormously for fraud detection, spam filtering, and medical diagnosis models.

Key Distributions

Normal: Bell curve, symmetric. Good for height, measurement errors. Binomial: Count of successes in N trials. Good for click-through rates, pass/fail. Poisson: Count of events per time period. Good for arrivals, bug rates, rare events.

Python — Distribution sampling
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
np.random.seed(42)

axes[0].hist(np.random.normal(0, 1, 1000), bins=30)
axes[0].set_title('Normal(μ=0, σ=1)')

axes[1].hist(np.random.binomial(100, 0.3, 1000), bins=30)
axes[1].set_title('Binomial(n=100, p=0.3)')

axes[2].hist(np.random.poisson(5, 1000), bins=30)
axes[2].set_title('Poisson(λ=5)')

plt.tight_layout()
plt.savefig('distributions.png')
Day 2 Exercise
Apply Bayes to a Business Problem
  1. Define a business scenario with a rare positive event (fraud: 0.5%, churn: 5%)
  2. Assume a detection model with 90% sensitivity and 10% false positive rate
  3. Calculate P(true positive | model flags positive) using Bayes' theorem
  4. Simulate the same calculation with 10,000 trials in Python
  5. Explain the result to a non-technical stakeholder in 2 sentences

Day 2 Summary

  • Conditional probability P(A|B) is the foundation of machine learning predictions
  • Bayes' theorem updates beliefs based on new evidence — always include the prior
  • Base rate fallacy: ignoring how rare an event is leads to wildly overconfident predictions
  • Normal, Binomial, Poisson cover most real-world count and measurement data
  • Every ML model output is a probability statement — statistics tells you what it means
Finished this lesson?