Course Overview All Courses Blog Reserve Bootcamp Seat
Data Analysis · Day 2 of 5 ~75 minutes

Pandas Fundamentals — Load, Clean, Filter, Group

The four operations that cover 90% of everything analysts actually do with data.

Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build Today

A data cleaning and analysis script that handles messy real-world CSV data: missing values, inconsistent formats, duplicates. You'll clean it, then run a multi-level group analysis that would be complex to replicate in a pivot table.

1
Loading Data

Reading messy CSVs the right way

Real data is never clean. Here are the loading options you'll use constantly:

python
import pandas as pd

# Basic load
df = pd.read_csv("data.csv")

# Handle different encodings (common with exported Excel files)
df = pd.read_csv("data.csv", encoding="utf-8-sig")

# Skip bad rows, parse dates automatically
df = pd.read_csv("data.csv", parse_dates=["sale_date"], on_bad_lines="skip")

# Quick overview of what you loaded
print(df.info())    # column names, types, null counts
print(df.describe()) # stats for numeric columns
print(df.isnull().sum()) # count missing values per column
2
Cleaning Data

Handle missing values and bad data

pythonclean.py
import pandas as pd

df = pd.read_csv("sales.csv")

# Drop rows where critical columns are null
df = df.dropna(subset=["revenue", "rep"])

# Fill missing values instead of dropping
df["units"] = df["units"].fillna(0)
df["region"] = df["region"].fillna("Unknown")

# Standardize text columns (uppercase, strip whitespace)
df["region"] = df["region"].str.strip().str.title()
df["rep"] = df["rep"].str.strip()

# Remove duplicate rows
df = df.drop_duplicates()

# Rename messy column names
df = df.rename(columns={"Rev $": "revenue", "Rep Name": "rep"})

# Confirm it's clean
print(f"Rows after cleaning: {len(df)}")
print(f"Missing values: {df.isnull().sum().sum()}")

Rule of thumb: dropna() when missing data means the row is useless. fillna() when you can infer a sensible value. Use your domain knowledge — pandas doesn't know what your data means.

3
Filtering Data

Select exactly the rows you need

python
# Single condition
east = df[df["region"] == "East"]

# Multiple conditions (use & for AND, | for OR)
big_east = df[(df["region"] == "East") & (df["revenue"] > 50000)]

# Filter by a list of values
tech_products = df[df["product"].isin(["Software", "SaaS"])]

# Exclude rows
no_hardware = df[df["product"] != "Hardware"]

# Text contains (case-insensitive search)
enterprise = df[df["rep"].str.contains("Enterprise", case=False, na=False)]

# Top N rows by a column
top5 = df.nlargest(5, "revenue")
4
Grouping Data

Aggregate like a pivot table — but scriptable

pythongroup_analysis.py
import pandas as pd

df = pd.read_csv("sales.csv")

# Single group, single metric
region_revenue = df.groupby("region")["revenue"].sum()

# Multiple metrics at once
region_stats = df.groupby("region").agg(
    total_revenue=("revenue", "sum"),
    avg_deal=("revenue", "mean"),
    num_deals=("revenue", "count"),
    total_units=("units", "sum")
).sort_values("total_revenue", ascending=False)

print(region_stats)

# Multi-level group (region + product)
breakdown = df.groupby(["region", "product"])["revenue"].sum().unstack(fill_value=0)
print("\nRevenue by Region and Product:")
print(breakdown)

# Export to CSV
region_stats.to_csv("region_summary.csv")
print("\nExported to region_summary.csv")

The agg() pattern is the one to memorize. It lets you compute multiple metrics in a single pass and give them clean names. This is your pivot table replacement.

Day 2 Complete

  • Loaded CSVs with proper encoding and date handling
  • Cleaned missing values with dropna() and fillna()
  • Filtered rows with conditions, lists, and text search
  • Grouped data and aggregated multiple metrics with agg()
  • Exported clean results to CSV
Day 2 Done

Next: AI-Powered Analysis

Day 3 connects the Claude API to your DataFrames. You'll send data summaries and get executive-ready narrative insights back — automatically.

Go to Day 3
Finished this lesson?