Day 2 of 5
⏱ ~60 minutes
R Programming in 5 Days — Day 2

Data Manipulation with dplyr & tidyr

The tidyverse — especially dplyr and tidyr — transformed R into the most productive language for data manipulation. Today you master the grammar of data manipulation: select, filter, mutate, summarize, group_by, and reshape operations.

dplyr: The Grammar of Data Manipulation

dplyr's five core verbs cover most data manipulation: filter() keeps rows meeting a condition, select() keeps specified columns, mutate() creates new columns, summarize() reduces rows to summary statistics, arrange() sorts rows. The pipe |> (or %>%) chains operations. group_by() + summarize() performs split-apply-combine — the most common data analysis pattern. All dplyr functions take a data frame first and return a data frame.

tidyr: Reshaping Data

Tidy data: each variable is a column, each observation is a row, each value is a cell. pivot_longer() transforms wide to long format (many columns to key-value pairs). pivot_wider() transforms long to wide. separate() splits one column into multiple. unite() combines columns. fill() propagates non-NA values forward/backward. Tidy data works directly with ggplot2 and dplyr without reformatting.

Joins: Combining Data Frames

dplyr's join functions mirror SQL: left_join(x, y, by='key') keeps all rows from x and matching rows from y; inner_join keeps only matching rows; right_join and full_join handle the other cases. anti_join(x, y) returns rows in x with no match in y — useful for finding unmatched records. Multiple join keys: by = c('id', 'date').

r
library(dplyr)
library(tidyr)

# dplyr: summarize sales by region
sales <- data.frame(
  region  = c('East','East','West','West','North'),
  product = c('A','B','A','C','A'),
  revenue = c(100, 200, 150, 300, 80),
  units   = c(5, 8, 6, 12, 3)
)

sales |>
  group_by(region) |>
  summarize(
    total_revenue = sum(revenue),
    avg_units     = mean(units),
    n_products    = n()
  ) |>
  arrange(desc(total_revenue))

# tidyr: pivot wide to long
scores_wide <- data.frame(
  student = c('Alice', 'Bob'),
  math    = c(90, 85),
  english = c(88, 92),
  science = c(95, 78)
)

scores_long <- scores_wide |>
  pivot_longer(cols = -student,
               names_to  = 'subject',
               values_to = 'score')
# student  subject  score
# Alice    math     90
# Alice    english  88
# ...
💡
Use n() inside summarize() to count rows per group. Use n_distinct(col) to count unique values. Add .groups = 'drop' to summarize() to automatically ungroup, avoiding unexpected behavior in subsequent operations.
📝 Day 2 Exercise
Analyze a Real Dataset
  1. Install tidyverse: install.packages('tidyverse') — this includes dplyr, tidyr, ggplot2
  2. Load the nycflights13 package and the flights dataset: library(nycflights13)
  3. Find the 5 airlines with the highest average departure delay using group_by + summarize
  4. Filter to flights more than 2 hours late and count by destination airport
  5. Use pivot_longer to reshape: make a long-format table of dep_delay and arr_delay by carrier

Day 2 Summary

  • dplyr's five verbs: filter, select, mutate, summarize, arrange
  • group_by() + summarize() implements split-apply-combine
  • pivot_longer/pivot_wider reshape between wide and long format
  • Tidy data format makes ggplot2 and dplyr work together without reformatting
  • left_join/inner_join/anti_join mirror SQL join semantics
Challenge

Using the nycflights13 dataset, build a full analysis: which routes have the worst on-time performance? Which months are worst? Do delays compound (does a late departure predict a late arrival)? Produce a summary table and answer each question with dplyr code.

Finished this lesson?