A sales analysis report: revenue by region and month using groupby, a cross-tab pivot showing product performance by region, a merged customer+orders dataset, and a custom function that categorizes order size.
Aggregating with groupby
groupby splits your data into groups, applies a function to each group, and combines the results. It's the pandas equivalent of SQL's GROUP BY.
# Total revenue by region
df.groupby("region")["revenue"].sum()
# Multiple aggregations at once
summary = df.groupby("region")["revenue"].agg(
total="sum",
average="mean",
count="count",
max_order="max"
).reset_index()
# Group by multiple columns
monthly = df.groupby(["region", "month"])["revenue"].sum().reset_index()
# Aggregate different columns differently
result = df.groupby("region").agg({
"revenue": ["sum", "mean"],
"quantity": "sum",
"order_id": "count"
})
Always reset_index(): After groupby, the grouped column becomes the index. Call .reset_index() to turn it back into a regular column.
Reshaping with pivot_table
pivot_table turns rows into columns. It's how you create a cross-tab — like an Excel pivot table.
# Revenue by product (rows) and region (columns)
pivot = df.pivot_table(
values="revenue",
index="product",
columns="region",
aggfunc="sum",
fill_value=0 # fill NaN with 0
)
# Add row and column totals
pivot = df.pivot_table(
values="revenue",
index="product",
columns="region",
aggfunc="sum",
margins=True, # adds "All" totals
margins_name="Total"
)
Combining DataFrames with merge and concat
merge — join on a key column
# Inner join — only rows with matching keys in both
merged = orders.merge(customers, on="customer_id")
# Left join — keep all orders, add customer info where available
merged = orders.merge(customers, on="customer_id", how="left")
# Join on columns with different names
merged = orders.merge(
customers,
left_on="cust_id",
right_on="customer_id",
how="left"
)
concat — stack DataFrames
# Stack multiple DataFrames vertically (same columns)
all_data = pd.concat([jan, feb, mar], ignore_index=True)
# Combine multiple CSVs from a folder
import glob
files = glob.glob("data/monthly/*.csv")
all_months = pd.concat(
[pd.read_csv(f) for f in files],
ignore_index=True
)
Applying custom functions
apply() runs a function on each row or column. Use it when no built-in method does exactly what you need.
# Apply to a single column
def order_size(revenue):
if revenue >= 10000: return "large"
elif revenue >= 1000: return "medium"
else: return "small"
df["order_size"] = df["revenue"].apply(order_size)
# Lambda for simple one-liners
df["revenue_k"] = df["revenue"].apply(lambda x: round(x / 1000, 1))
# Apply across multiple columns (axis=1 = row by row)
def profit_margin(row):
if row["revenue"] == 0: return 0
return (row["revenue"] - row["cost"]) / row["revenue"]
df["margin"] = df.apply(profit_margin, axis=1)
Performance note: apply() is slower than vectorized operations. If you can express the same logic with arithmetic on columns (df["margin"] = (df["revenue"] - df["cost"]) / df["revenue"]), do that instead. More on this in Day 5.
Build a sales report from scratch
- Create a groupby summary: total and average revenue by category
- Create a pivot table: revenue by category (rows) and month (columns)
- Merge two DataFrames — orders + products — on a shared ID column
- Add a derived column using apply() or vectorized math
What You Learned Today
- groupby with single and multiple columns, using agg() for multiple stats
- pivot_table for cross-tab analysis with row/column totals
- merge() for joining DataFrames on keys (inner, left, right)
- concat() for stacking DataFrames vertically
- apply() for custom row-level functions
Day 4: Time Series
Tomorrow: parsing dates, resampling, rolling averages, and DatetimeIndex.
Start Day 4