Day 4 of 5
⏱ ~60 minutes
Statistics for Data Science — Day 4

Regression Analysis — Finding Relationships in Data

Build linear and logistic regression models, interpret coefficients, and understand what makes a regression result meaningful vs spurious.

Regression Analysis

Regression is the workhorse of data analysis. It answers the fundamental question: what is the relationship between this variable and that outcome? Linear regression predicts continuous outcomes. Logistic regression predicts probability of a binary outcome. Together they cover 80% of real-world prediction problems.

Linear Regression

Python — Linear regression with sklearn
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
import statsmodels.api as sm

np.random.seed(42)
n = 500

# Generate synthetic house price data
sqft = np.random.uniform(800, 4000, n)
bedrooms = np.random.randint(1, 6, n)
age = np.random.uniform(0, 50, n)
price = 100 + 0.15 * sqft + 20 * bedrooms - 1.5 * age + np.random.normal(0, 30, n)

X = np.column_stack([sqft, bedrooms, age])
y = price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred):.0f}K")
print(f"Coefficients: sqft={model.coef_[0]:.2f}, beds={model.coef_[1]:.2f}, age={model.coef_[2]:.2f}")

Interpreting Regression Coefficients

The coefficient for sqft (say, 0.15) means: holding all else constant, each additional square foot is associated with a $150 increase in price. This "all else constant" assumption is critical — and it's where most misinterpretations happen.

⚠️
Correlation is not causation. Regression finds association, not cause. Ice cream sales and drowning rates are correlated (summer). Adding ice cream to your diet won't help you swim better.

Logistic Regression for Classification

Python — Logistic regression (churn prediction)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Simulate churn data
np.random.seed(42)
n = 1000
tenure = np.random.uniform(1, 60, n)
monthly_charge = np.random.uniform(20, 120, n)
support_calls = np.random.poisson(2, n)
churn_prob = 1 / (1 + np.exp(-(-2 + 0.05 * monthly_charge - 0.03 * tenure + 0.3 * support_calls)))
churn = np.random.binomial(1, churn_prob)

X = np.column_stack([tenure, monthly_charge, support_calls])
X_train, X_test, y_train, y_test = train_test_split(X, churn, test_size=0.2)

clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.3f}")
Day 4 Exercise
Build and Interpret a Regression Model
  1. Find or create a dataset with a continuous outcome (prices, scores)
  2. Build a linear regression model with at least 3 features
  3. Interpret each coefficient in plain English
  4. Check model assumptions: residual plot (should be random noise)
  5. Build a logistic regression on a binary outcome in the same dataset

Day 4 Summary

  • Linear regression: continuous outcome — interpret coefficients as "per-unit change"
  • Logistic regression: binary outcome — outputs probability, not raw prediction
  • R² measures variance explained; AUC-ROC measures classification quality
  • Always check residuals — patterns mean your model is missing something
  • Regression finds associations, not causes — be precise in how you communicate results
Finished this lesson?