Day 4: Statistical Analysis & Modeling | R Programming in 5 Days

Today's Objective

broom converts model output to tidy data frames.

Hypothesis Testing

t.test() compares means: t.test(x, y) tests if two groups have different means. prop.test() compares proportions. chisq.test() tests independence of categorical variables. cor.test() tests correlation significance. The output includes the test statistic, p-value, confidence interval, and degrees of freedom. R's output is verbose and human-readable but not tidy — that is what broom is for.

Linear Regression with lm()

lm(y ~ x1 + x2, data=df) fits a linear model. summary(model) shows coefficients, standard errors, p-values, and R-squared. Interaction terms: y ~ x1 * x2. Polynomial terms: y ~ poly(x, 2). predict(model, newdata) makes predictions. Regression diagnostics: plot(model) shows residuals, Q-Q plot, scale-location, and Cook's distance. Check assumptions: linearity, homoscedasticity, normality of residuals.

Broom: Tidy Model Outputs

broom converts model output to tidy data frames. tidy(model) returns one row per coefficient with estimate, std.error, statistic, p.value. glance(model) returns one row of model-level statistics: R-squared, AIC, BIC, F-statistic. augment(model, data) adds fitted values and residuals to the original data. This makes it easy to plot model results with ggplot2 and compare multiple models with dplyr.

library(broom)

# t-test: do groups differ?
group_a <- c(82, 85, 88, 92, 95, 78, 90)
group_b <- c(74, 78, 82, 79, 85, 71, 83)

test_result <- t.test(group_a, group_b)
tidy(test_result)  # tidy data frame output
# estimate  p.value  conf.low  conf.high
#   9.71    0.0087    2.85      16.57

# Linear regression
model <- lm(mpg ~ wt + hp + factor(cyl),
            data = mtcars)
tidy(model)    # coefficient table
glance(model)  # R2, AIC, F-statistic

# Predict new observations
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5),
                       hp = c(100, 150, 200),
                       cyl = c(4, 6, 8))
predict(model, newdata = new_cars,
        interval = 'confidence')

# Visualize model
augment(model) |>
  ggplot(aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  labs(title = 'Residuals vs Fitted')

Statistical Analysis & Modeling

Today's Objective

Hypothesis Testing

Linear Regression with lm()

Broom: Tidy Model Outputs

Supporting References & Reading

Go deeper with these external resources.

Day 4 Checkpoint