A data cleaning and analysis script that handles messy real-world CSV data: missing values, inconsistent formats, duplicates. You'll clean it, then run a multi-level group analysis that would be complex to replicate in a pivot table.
Reading messy CSVs the right way
Real data is never clean. Here are the loading options you'll use constantly:
import pandas as pd
# Basic load
df = pd.read_csv("data.csv")
# Handle different encodings (common with exported Excel files)
df = pd.read_csv("data.csv", encoding="utf-8-sig")
# Skip bad rows, parse dates automatically
df = pd.read_csv("data.csv", parse_dates=["sale_date"], on_bad_lines="skip")
# Quick overview of what you loaded
print(df.info()) # column names, types, null counts
print(df.describe()) # stats for numeric columns
print(df.isnull().sum()) # count missing values per column
Handle missing values and bad data
import pandas as pd
df = pd.read_csv("sales.csv")
# Drop rows where critical columns are null
df = df.dropna(subset=["revenue", "rep"])
# Fill missing values instead of dropping
df["units"] = df["units"].fillna(0)
df["region"] = df["region"].fillna("Unknown")
# Standardize text columns (uppercase, strip whitespace)
df["region"] = df["region"].str.strip().str.title()
df["rep"] = df["rep"].str.strip()
# Remove duplicate rows
df = df.drop_duplicates()
# Rename messy column names
df = df.rename(columns={"Rev $": "revenue", "Rep Name": "rep"})
# Confirm it's clean
print(f"Rows after cleaning: {len(df)}")
print(f"Missing values: {df.isnull().sum().sum()}")
Rule of thumb: dropna() when missing data means the row is useless. fillna() when you can infer a sensible value. Use your domain knowledge — pandas doesn't know what your data means.
Select exactly the rows you need
# Single condition
east = df[df["region"] == "East"]
# Multiple conditions (use & for AND, | for OR)
big_east = df[(df["region"] == "East") & (df["revenue"] > 50000)]
# Filter by a list of values
tech_products = df[df["product"].isin(["Software", "SaaS"])]
# Exclude rows
no_hardware = df[df["product"] != "Hardware"]
# Text contains (case-insensitive search)
enterprise = df[df["rep"].str.contains("Enterprise", case=False, na=False)]
# Top N rows by a column
top5 = df.nlargest(5, "revenue")
Aggregate like a pivot table — but scriptable
import pandas as pd
df = pd.read_csv("sales.csv")
# Single group, single metric
region_revenue = df.groupby("region")["revenue"].sum()
# Multiple metrics at once
region_stats = df.groupby("region").agg(
total_revenue=("revenue", "sum"),
avg_deal=("revenue", "mean"),
num_deals=("revenue", "count"),
total_units=("units", "sum")
).sort_values("total_revenue", ascending=False)
print(region_stats)
# Multi-level group (region + product)
breakdown = df.groupby(["region", "product"])["revenue"].sum().unstack(fill_value=0)
print("\nRevenue by Region and Product:")
print(breakdown)
# Export to CSV
region_stats.to_csv("region_summary.csv")
print("\nExported to region_summary.csv")
The agg() pattern is the one to memorize. It lets you compute multiple metrics in a single pass and give them clean names. This is your pivot table replacement.
Day 2 Complete
- Loaded CSVs with proper encoding and date handling
- Cleaned missing values with dropna() and fillna()
- Filtered rows with conditions, lists, and text search
- Grouped data and aggregated multiple metrics with agg()
- Exported clean results to CSV
Next: AI-Powered Analysis
Day 3 connects the Claude API to your DataFrames. You'll send data summaries and get executive-ready narrative insights back — automatically.
Go to Day 3