Day 4 of 5
⏱ ~60 minutes
R Programming in 5 Days — Day 4

Statistical Analysis & Modeling

R's core strength is statistical analysis. Today you apply hypothesis tests, build regression models, and use the broom package to work with model outputs in a tidy, consistent way.

Hypothesis Testing

t.test() compares means: t.test(x, y) tests if two groups have different means. prop.test() compares proportions. chisq.test() tests independence of categorical variables. cor.test() tests correlation significance. The output includes the test statistic, p-value, confidence interval, and degrees of freedom. R's output is verbose and human-readable but not tidy — that is what broom is for.

Linear Regression with lm()

lm(y ~ x1 + x2, data=df) fits a linear model. summary(model) shows coefficients, standard errors, p-values, and R-squared. Interaction terms: y ~ x1 * x2. Polynomial terms: y ~ poly(x, 2). predict(model, newdata) makes predictions. Regression diagnostics: plot(model) shows residuals, Q-Q plot, scale-location, and Cook's distance. Check assumptions: linearity, homoscedasticity, normality of residuals.

Broom: Tidy Model Outputs

broom converts model output to tidy data frames. tidy(model) returns one row per coefficient with estimate, std.error, statistic, p.value. glance(model) returns one row of model-level statistics: R-squared, AIC, BIC, F-statistic. augment(model, data) adds fitted values and residuals to the original data. This makes it easy to plot model results with ggplot2 and compare multiple models with dplyr.

r
library(broom)

# t-test: do groups differ?
group_a <- c(82, 85, 88, 92, 95, 78, 90)
group_b <- c(74, 78, 82, 79, 85, 71, 83)

test_result <- t.test(group_a, group_b)
tidy(test_result)  # tidy data frame output
# estimate  p.value  conf.low  conf.high
#   9.71    0.0087    2.85      16.57

# Linear regression
model <- lm(mpg ~ wt + hp + factor(cyl),
            data = mtcars)
tidy(model)    # coefficient table
glance(model)  # R2, AIC, F-statistic

# Predict new observations
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5),
                       hp = c(100, 150, 200),
                       cyl = c(4, 6, 8))
predict(model, newdata = new_cars,
        interval = 'confidence')

# Visualize model
augment(model) |>
  ggplot(aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  labs(title = 'Residuals vs Fitted')
💡
Always check your regression assumptions with plot(model) before interpreting coefficients. Non-random residual patterns indicate model misspecification. A Q-Q plot shows whether residuals are normally distributed.
📝 Day 4 Exercise
Build a Regression Model
  1. Fit a linear model predicting house prices from size, bedrooms, and location
  2. Use summary(model) and tidy(model) to compare outputs
  3. Check model assumptions with plot(model) — look at the residuals vs fitted plot
  4. Use predict() to estimate the price of a new house with specific attributes
  5. Compare two nested models using anova(model1, model2) to test if adding predictors improves fit

Day 4 Summary

  • t.test, chisq.test, cor.test are the core hypothesis testing functions
  • lm(y ~ x1 + x2, data) fits linear regression; summary() shows full results
  • broom::tidy() converts model output to a tidy data frame for ggplot2 integration
  • Regression diagnostics: residuals vs fitted, Q-Q, Cook's distance
  • predict(model, newdata) generates predictions with optional confidence intervals
Challenge

Analyze the built-in 'Boston' housing dataset from MASS package. Fit multiple regression models adding predictors one at a time. Compare models with AIC/BIC using broom::glance(). Identify the most parsimonious model that explains the most variance in median home values.

Finished this lesson?