Course OverviewAll CoursesBlog Reserve Bootcamp Seat
Pandas Masterclass · Day 3 of 5 ~75 minutes

Transformation — groupby, pivot, merge, apply

The four transformation tools you'll use in 90% of real data analysis work.

1
Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build Today

A sales analysis report: revenue by region and month using groupby, a cross-tab pivot showing product performance by region, a merged customer+orders dataset, and a custom function that categorizes order size.

1
groupby

Aggregating with groupby

groupby splits your data into groups, applies a function to each group, and combines the results. It's the pandas equivalent of SQL's GROUP BY.

python
# Total revenue by region
df.groupby("region")["revenue"].sum()

# Multiple aggregations at once
summary = df.groupby("region")["revenue"].agg(
    total="sum",
    average="mean",
    count="count",
    max_order="max"
).reset_index()

# Group by multiple columns
monthly = df.groupby(["region", "month"])["revenue"].sum().reset_index()

# Aggregate different columns differently
result = df.groupby("region").agg({
    "revenue": ["sum", "mean"],
    "quantity": "sum",
    "order_id": "count"
})

Always reset_index(): After groupby, the grouped column becomes the index. Call .reset_index() to turn it back into a regular column.

2
pivot_table

Reshaping with pivot_table

pivot_table turns rows into columns. It's how you create a cross-tab — like an Excel pivot table.

python
# Revenue by product (rows) and region (columns)
pivot = df.pivot_table(
    values="revenue",
    index="product",
    columns="region",
    aggfunc="sum",
    fill_value=0          # fill NaN with 0
)

# Add row and column totals
pivot = df.pivot_table(
    values="revenue",
    index="product",
    columns="region",
    aggfunc="sum",
    margins=True,           # adds "All" totals
    margins_name="Total"
)
3
merge

Combining DataFrames with merge and concat

merge — join on a key column

python
# Inner join — only rows with matching keys in both
merged = orders.merge(customers, on="customer_id")

# Left join — keep all orders, add customer info where available
merged = orders.merge(customers, on="customer_id", how="left")

# Join on columns with different names
merged = orders.merge(
    customers,
    left_on="cust_id",
    right_on="customer_id",
    how="left"
)

concat — stack DataFrames

python
# Stack multiple DataFrames vertically (same columns)
all_data = pd.concat([jan, feb, mar], ignore_index=True)

# Combine multiple CSVs from a folder
import glob

files = glob.glob("data/monthly/*.csv")
all_months = pd.concat(
    [pd.read_csv(f) for f in files],
    ignore_index=True
)
4
apply

Applying custom functions

apply() runs a function on each row or column. Use it when no built-in method does exactly what you need.

python
# Apply to a single column
def order_size(revenue):
    if revenue >= 10000: return "large"
    elif revenue >= 1000: return "medium"
    else: return "small"

df["order_size"] = df["revenue"].apply(order_size)

# Lambda for simple one-liners
df["revenue_k"] = df["revenue"].apply(lambda x: round(x / 1000, 1))

# Apply across multiple columns (axis=1 = row by row)
def profit_margin(row):
    if row["revenue"] == 0: return 0
    return (row["revenue"] - row["cost"]) / row["revenue"]

df["margin"] = df.apply(profit_margin, axis=1)

Performance note: apply() is slower than vectorized operations. If you can express the same logic with arithmetic on columns (df["margin"] = (df["revenue"] - df["cost"]) / df["revenue"]), do that instead. More on this in Day 5.

Practice Challenge

Build a sales report from scratch

  • Create a groupby summary: total and average revenue by category
  • Create a pivot table: revenue by category (rows) and month (columns)
  • Merge two DataFrames — orders + products — on a shared ID column
  • Add a derived column using apply() or vectorized math

What You Learned Today

  • groupby with single and multiple columns, using agg() for multiple stats
  • pivot_table for cross-tab analysis with row/column totals
  • merge() for joining DataFrames on keys (inner, left, right)
  • concat() for stacking DataFrames vertically
  • apply() for custom row-level functions
Course Progress
Day 3 of 5 — 60%
Day 3 Complete

Day 4: Time Series

Tomorrow: parsing dates, resampling, rolling averages, and DatetimeIndex.

Start Day 4
Finished this lesson?