Learn the core descriptive statistics — mean, median, variance, percentiles — and when to use each one without misleading your audience.
Large language models can generate analysis, but they can't replace the judgment to know when the numbers are lying. Statistics is that judgment. It's the difference between a data scientist who ships accurate models and one who ships confident-sounding wrong answers.
This course teaches statistics the way it connects to real data work — not as abstract math, but as thinking tools.
The three ways to describe "the middle" of a dataset each capture something different:
import numpy as np
import pandas as pd
data = [12, 15, 14, 10, 13, 14, 100] # 100 is an outlier
print(f"Mean: {np.mean(data):.1f}") # 25.4 — pulled by outlier
print(f"Median: {np.median(data):.1f}") # 14.0 — robust
print(f"Mode: {pd.Series(data).mode()[0]}") # 14
# With pandas describe()
s = pd.Series(data)
print(s.describe())
# count 7.000000
# mean 25.428571
# std 32.307...
# min 10.000000
# 25% 12.500000
# 50% 14.000000 <- median
# 75% 14.500000
# max 100.000000Variance measures average squared distance from the mean. Standard deviation is its square root — same units as the data. Interquartile range (IQR) is the range from 25th to 75th percentile — robust to outliers.
data = np.array([12, 15, 14, 10, 13, 14, 100])
print(f"Variance: {np.var(data, ddof=1):.1f}") # ddof=1 for sample
print(f"Std dev: {np.std(data, ddof=1):.1f}")
print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25):.1f}")
# Rule of thumb: data more than 1.5*IQR from Q1 or Q3 is an outlier
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
outliers = data[(data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr)]
print(f"Outliers: {outliers}") # [100].describe() on all numeric columns