Load a real-world dataset (a public sales CSV), perform filtering and grouping analysis on it, build two visualizations with matplotlib, and write out 3 genuine insights as a summary. The workflow data analysts do manually every day — automated.
Installing pandas and matplotlib
$ pip install pandas matplotlib
Then grab a dataset to work with. The Superstore sales dataset is a classic — it has orders, regions, categories, and revenue, which makes it ideal for practicing analysis.
import pandas as pd
# Load a CSV into a DataFrame
df = pd.read_csv("superstore.csv")
# First look at the data
print(df.head()) # first 5 rows
print(df.shape) # (rows, columns)
print(df.columns.tolist()) # column names
print(df.dtypes) # data type of each column
print(df.describe()) # stats: mean, std, min, max
# Check for missing values
print(df.isnull().sum())
Filtering, Grouping, and Aggregating
import pandas as pd
df = pd.read_csv("superstore.csv")
# Filter rows
tech_orders = df[df["Category"] == "Technology"]
big_orders = df[df["Sales"] > 1000]
# Multiple conditions
west_tech = df[(df["Region"] == "West") & (df["Category"] == "Technology")]
# GroupBy: total sales by region
regional_sales = df.groupby("Region")["Sales"].sum().sort_values(ascending=False)
print(regional_sales)
# Multiple aggregations
summary = df.groupby("Category").agg({
"Sales": ["sum", "mean"],
"Profit": "sum",
"Order ID": "count"
})
print(summary)
# Top 10 most profitable products
top_products = df.groupby("Product Name")["Profit"].sum().nlargest(10)
print(top_products)
# Profit margin by segment
df["profit_margin"] = df["Profit"] / df["Sales"]
print(df.groupby("Segment")["profit_margin"].mean())
Basic Visualization with Matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("superstore.csv")
# Bar chart: sales by category
category_sales = df.groupby("Category")["Sales"].sum()
plt.figure(figsize=(8, 5))
category_sales.plot(kind="bar", color="#1e3a5f")
plt.title("Total Sales by Category")
plt.xlabel("Category")
plt.ylabel("Sales ($)")
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig("sales_by_category.png")
plt.show()
# Scatter: sales vs profit
plt.figure(figsize=(8, 5))
plt.scatter(df["Sales"], df["Profit"], alpha=0.3, color="#c4873e")
plt.title("Sales vs Profit")
plt.xlabel("Sales ($)")
plt.ylabel("Profit ($)")
plt.tight_layout()
plt.savefig("sales_vs_profit.png")
plt.show()
Tomorrow: You'll replace the manual "find insights" step with Claude. Instead of staring at charts, you'll feed the numbers to the API and get written analysis back automatically.
Analyze a Real Dataset — Find 3 Insights
- Download the Superstore CSV from Kaggle (free account needed) or use any CSV you have
- Load it with pandas and run
.describe()to understand the shape - Write 3 groupby queries that each answer a specific business question
- Build 2 charts using matplotlib and save them as PNG files
- Write your 3 insights as Python comments next to the code that revealed them
What You Learned Today
- Loading CSVs into pandas DataFrames and inspecting their structure
- Filtering rows with boolean conditions and multiple conditions
- Groupby + aggregate: sum, mean, count, nlargest — the core analysis toolkit
- Building bar charts and scatter plots with matplotlib
- Creating new computed columns from existing ones
Day 5: Build an AI-Powered Data Pipeline
Tomorrow you combine everything — files, APIs, pandas, and Claude — into one end-to-end pipeline that runs itself.
Start Day 5