Day 1 of 5
⏱ ~60 minutes
Statistics for Data Science — Day 1

Descriptive Statistics — Summarizing Data You Can Trust

Learn the core descriptive statistics — mean, median, variance, percentiles — and when to use each one without misleading your audience.

Why Statistics Still Matters in the Age of AI

Large language models can generate analysis, but they can't replace the judgment to know when the numbers are lying. Statistics is that judgment. It's the difference between a data scientist who ships accurate models and one who ships confident-sounding wrong answers.

This course teaches statistics the way it connects to real data work — not as abstract math, but as thinking tools.

Measures of Central Tendency

The three ways to describe "the middle" of a dataset each capture something different:

Python — Descriptive stats with numpy/pandas
import numpy as np
import pandas as pd

data = [12, 15, 14, 10, 13, 14, 100]  # 100 is an outlier

print(f"Mean:   {np.mean(data):.1f}")    # 25.4 — pulled by outlier
print(f"Median: {np.median(data):.1f}")  # 14.0 — robust
print(f"Mode:   {pd.Series(data).mode()[0]}")  # 14

# With pandas describe()
s = pd.Series(data)
print(s.describe())
# count    7.000000
# mean    25.428571
# std     32.307...
# min     10.000000
# 25%     12.500000
# 50%     14.000000  <- median
# 75%     14.500000
# max    100.000000

Measures of Spread

Variance measures average squared distance from the mean. Standard deviation is its square root — same units as the data. Interquartile range (IQR) is the range from 25th to 75th percentile — robust to outliers.

Python — Spread metrics
data = np.array([12, 15, 14, 10, 13, 14, 100])

print(f"Variance: {np.var(data, ddof=1):.1f}")  # ddof=1 for sample
print(f"Std dev:  {np.std(data, ddof=1):.1f}")
print(f"IQR:      {np.percentile(data, 75) - np.percentile(data, 25):.1f}")

# Rule of thumb: data more than 1.5*IQR from Q1 or Q3 is an outlier
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
outliers = data[(data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr)]
print(f"Outliers: {outliers}")  # [100]

When to Use Each Statistic

💡
The golden rule: Before reporting any average, look at the distribution. A mean salary of $75K could mean everyone makes $75K, or it could mean 90% make $30K and 10% make $420K.
Day 1 Exercise
Describe a Real Dataset
  1. Find a dataset on Kaggle or use a pandas built-in (sns.load_dataset)
  2. Run .describe() on all numeric columns
  3. Identify columns where mean and median differ significantly — why?
  4. Find outliers using the 1.5 × IQR rule
  5. Write a 3-sentence data summary for a non-technical audience

Day 1 Summary

  • Mean is sensitive to outliers; median is robust — use the right one
  • Standard deviation and IQR both measure spread; IQR is more robust
  • pandas .describe() gives you 8 key statistics in one line
  • Outlier detection: values beyond Q3 + 1.5×IQR or Q1 - 1.5×IQR
  • Always look at the distribution before reporting a single-number summary
Finished this lesson?