In This Guide
Key Takeaways
- R's strength: Statistical analysis, data visualization (ggplot2 is unmatched), and reproducible research. Dominant in academia, pharma, and public health.
- Learn tidyverse: dplyr + ggplot2 + tidyr covers 80% of practical R work. Everything is designed to work together.
- R vs Python: Use R for statistics and academic research. Use Python for ML production and general programming. Best data scientists know both.
- Reproducibility: Quarto/R Markdown creates reports that mix code, output, and prose — the standard for reproducible analysis in science and industry.
R is the language statisticians built for statisticians. It was designed with one primary purpose: statistical computing and data analysis. The result is a language where every major statistical method has a well-tested, well-documented implementation, where data visualization is first-class, and where reproducible research reports are a built-in workflow.
What R Is and Why It Exists
R is a free, open-source statistical computing language and environment. It was created in 1993 by Ross Ihaka and Robert Gentleman as an open-source implementation of S (a statistical language from Bell Labs). It is now maintained by the R Core Team and a massive community of statisticians, data scientists, and researchers.
R's design reflects its origin: it was built by statisticians for statistical work. This means excellent facilities for data frames (before pandas, there was R's data.frame), built-in statistical functions (lm() for linear regression, glm() for generalized linear models, t.test(), aov(), etc.), and a graphical system (base R graphics, then ggplot2) designed for analytical plots.
R vs Python: Honest Comparison
| Dimension | R | Python |
|---|---|---|
| Statistical analysis | Excellent — built-in | Good (scipy, statsmodels) |
| Machine learning | Good (caret, tidymodels) | Excellent (scikit-learn, PyTorch) |
| Visualization | Excellent — ggplot2 | Good (matplotlib, seaborn, plotly) |
| Production deployment | Limited | Excellent |
| Bioinformatics | Dominant (Bioconductor) | Growing |
| General programming | Awkward | Excellent |
| Academic research | Dominant in many fields | Growing |
The honest answer: Python has won the ML/AI competition. R has maintained dominance in statistical research, clinical trials, economics, and bioinformatics. A data scientist in industry should know Python well and R reasonably. A researcher in pharma, biostatistics, or economics should know R well and Python reasonably.
The Tidyverse: Modern R
The tidyverse is the Hadley Wickham-led collection of R packages that defines modern R programming. The core packages: dplyr (data manipulation), ggplot2 (visualization), tidyr (reshaping), readr (file I/O), purrr (functional programming), stringr (strings), and forcats (factors).
library(tidyverse)
# Pipe-based data transformation
mtcars |>
filter(cyl == 6) |> # Keep 6-cylinder cars
select(mpg, hp, wt) |> # Keep these columns
mutate(power_to_weight = hp / wt) |> # Create new column
arrange(desc(power_to_weight)) # Sort descending
The tidyverse pipe operator (|> in base R, %>% in magrittr) chains operations left to right, making data transformation pipelines readable. This influenced the design of similar pipes in other languages.
ggplot2: Publication-Quality Visualization
ggplot2 implements the Grammar of Graphics — a systematic framework for building visualizations by layering components: data, aesthetics (what maps to what), geometric objects (points, lines, bars), scales, facets, and themes.
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "MPG vs Weight by Cylinder Count",
x = "Weight (1000 lbs)", y = "Miles per Gallon",
color = "Cylinders") +
theme_minimal()
ggplot2 charts are consistently described as the best-looking data visualizations in any language by practitioners who have used multiple tools. The grammar-of-graphics mental model also helps you think clearly about what you're plotting and why.
Statistical Modeling in R
# Linear regression
model <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model) # Coefficients, R-squared, p-values, residuals
# Logistic regression
model2 <- glm(am ~ wt + hp, data = mtcars, family = binomial)
# t-test
t.test(mpg ~ am, data = mtcars)
# ANOVA
aov_model <- aov(mpg ~ factor(cyl), data = mtcars)
summary(aov_model)
The formula interface (y ~ x1 + x2) is one of R's best designs — a consistent syntax for specifying models that works across dozens of modeling functions.
Reproducible Research with Quarto
Quarto (the next-generation R Markdown) enables reproducible research documents — reports where code, outputs, and prose are woven together. When the data changes or analysis is updated, re-knitting the document regenerates everything automatically.
A Quarto document contains: YAML front matter (title, author, output format), R code chunks that execute and embed their output, and Markdown prose. It renders to HTML, PDF, Word, or presentation formats. This is the standard for academic papers, clinical trial reports, and analytical reports in many industries.
Where R Is Used in 2026
- Clinical trials and pharma: FDA statistical guidelines reference R. Many regulatory submissions include R scripts and outputs.
- Epidemiology and public health: CDC, NIH, and academic public health research runs heavily on R.
- Economics and social science: R is the dominant language in academic economics, sociology, and political science research.
- Bioinformatics: The Bioconductor project provides 2,000+ packages for genomics, proteomics, and single-cell analysis.
- Finance: Quantitative analysis, risk modeling, and portfolio optimization with packages like quantmod, PerformanceAnalytics, and xts.
Frequently Asked Questions
Should I learn R or Python for data science?
Both, ultimately. Python first for ML and general programming. R first if you work in life sciences, economics, or academia. The best data scientists know both.
What is the tidyverse?
A collection of R packages (dplyr, ggplot2, tidyr, purrr) designed around consistent philosophy and tidy data principles. Learning tidyverse is largely learning modern R.
What is R used for in 2026?
Statistical analysis, data visualization (ggplot2), bioinformatics (Bioconductor), clinical trials, economics research, epidemiology, and reproducible research documents.
Data is everywhere. Know how to analyze it.
The Precision AI Academy bootcamp covers data analysis, statistics, and applied AI. $1,490. October 2026.
Reserve Your Seat