Day 2: Data Manipulation with dplyr & tidyr

Today's Objective

dplyr's join functions mirror SQL: left_join(x, y, by='key') keeps all rows from x and matching rows from y; inner_join keeps only matching rows; right_join and full_join handle the other cases.

dplyr: The Grammar of Data Manipulation

dplyr's five core verbs cover most data manipulation: filter() keeps rows meeting a condition, select() keeps specified columns, mutate() creates new columns, summarize() reduces rows to summary statistics, arrange() sorts rows. The pipe |> (or %>%) chains operations. group_by() + summarize() performs split-apply-combine — the most common data analysis pattern. All dplyr functions take a data frame first and return a data frame.

tidyr: Reshaping Data

Tidy data: each variable is a column, each observation is a row, each value is a cell. pivot_longer() transforms wide to long format (many columns to key-value pairs). pivot_wider() transforms long to wide. separate() splits one column into multiple. unite() combines columns. fill() propagates non-NA values forward/backward. Tidy data works directly with ggplot2 and dplyr without reformatting.

Joins: Combining Data Frames

dplyr's join functions mirror SQL: left_join(x, y, by='key') keeps all rows from x and matching rows from y; inner_join keeps only matching rows; right_join and full_join handle the other cases. anti_join(x, y) returns rows in x with no match in y — useful for finding unmatched records. Multiple join keys: by = c('id', 'date').

library(dplyr)
library(tidyr)

# dplyr: summarize sales by region
sales <- data.frame(
  region  = c('East','East','West','West','North'),
  product = c('A','B','A','C','A'),
  revenue = c(100, 200, 150, 300, 80),
  units   = c(5, 8, 6, 12, 3)
)

sales |>
  group_by(region) |>
  summarize(
    total_revenue = sum(revenue),
    avg_units     = mean(units),
    n_products    = n()
  ) |>
  arrange(desc(total_revenue))

# tidyr: pivot wide to long
scores_wide <- data.frame(
  student = c('Alice', 'Bob'),
  math    = c(90, 85),
  english = c(88, 92),
  science = c(95, 78)
)

scores_long <- scores_wide |>
  pivot_longer(cols = -student,
               names_to  = 'subject',
               values_to = 'score')
# student  subject  score
# Alice    math     90
# Alice    english  88
# ...

Data Manipulation with dplyr & tidyr

Today's Objective

dplyr: The Grammar of Data Manipulation

tidyr: Reshaping Data

Joins: Combining Data Frames

Supporting References & Reading

Go deeper with these external resources.

Day 2 Checkpoint