The tidyverse — especially dplyr and tidyr — transformed R into the most productive language for data manipulation. Today you master the grammar of data manipulation: select, filter, mutate, summarize, group_by, and reshape operations.
dplyr's five core verbs cover most data manipulation: filter() keeps rows meeting a condition, select() keeps specified columns, mutate() creates new columns, summarize() reduces rows to summary statistics, arrange() sorts rows. The pipe |> (or %>%) chains operations. group_by() + summarize() performs split-apply-combine — the most common data analysis pattern. All dplyr functions take a data frame first and return a data frame.
Tidy data: each variable is a column, each observation is a row, each value is a cell. pivot_longer() transforms wide to long format (many columns to key-value pairs). pivot_wider() transforms long to wide. separate() splits one column into multiple. unite() combines columns. fill() propagates non-NA values forward/backward. Tidy data works directly with ggplot2 and dplyr without reformatting.
dplyr's join functions mirror SQL: left_join(x, y, by='key') keeps all rows from x and matching rows from y; inner_join keeps only matching rows; right_join and full_join handle the other cases. anti_join(x, y) returns rows in x with no match in y — useful for finding unmatched records. Multiple join keys: by = c('id', 'date').
library(dplyr)
library(tidyr)
# dplyr: summarize sales by region
sales <- data.frame(
region = c('East','East','West','West','North'),
product = c('A','B','A','C','A'),
revenue = c(100, 200, 150, 300, 80),
units = c(5, 8, 6, 12, 3)
)
sales |>
group_by(region) |>
summarize(
total_revenue = sum(revenue),
avg_units = mean(units),
n_products = n()
) |>
arrange(desc(total_revenue))
# tidyr: pivot wide to long
scores_wide <- data.frame(
student = c('Alice', 'Bob'),
math = c(90, 85),
english = c(88, 92),
science = c(95, 78)
)
scores_long <- scores_wide |>
pivot_longer(cols = -student,
names_to = 'subject',
values_to = 'score')
# student subject score
# Alice math 90
# Alice english 88
# ...
Using the nycflights13 dataset, build a full analysis: which routes have the worst on-time performance? Which months are worst? Do delays compound (does a late departure predict a late arrival)? Produce a summary table and answer each question with dplyr code.