Course Overview All Courses Blog Reserve Bootcamp Seat
Pandas Masterclass · Day 2 of 5 ~60 minutes

Cleaning — Missing Values, Duplicates, dtypes

Real data is always messy. Learn to systematically clean it so your analysis is built on solid ground.

1
Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build Today

A reusable cleaning pipeline function that takes a raw DataFrame and returns a cleaned one: nulls handled, duplicates removed, dtypes corrected, string columns normalized. Drop it into any project.

1
Missing Values

Handling nulls with dropna and fillna

Every real dataset has missing values. The question isn't whether to handle them — it's how.

Detecting nulls

python
# Count nulls per column
df.isnull().sum()

# Percentage of nulls per column
(df.isnull().sum() / len(df) * 100).round(1)

# Which rows have ANY null
df[df.isnull().any(axis=1)]

dropna — remove rows or columns

python
# Drop any row with at least one null
df_clean = df.dropna()

# Drop rows only if a specific column is null
df_clean = df.dropna(subset=["order_id", "revenue"])

# Drop columns that are >50% null
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)

fillna — replace nulls with a value

python
# Fill with a fixed value
df["notes"] = df["notes"].fillna("N/A")

# Fill numeric nulls with the column median
df["revenue"] = df["revenue"].fillna(df["revenue"].median())

# Forward fill — use previous row's value (good for time series)
df["price"] = df["price"].fillna(method="ffill")

# Fill multiple columns at once with a dict
df = df.fillna({"region": "Unknown", "discount": 0})

Which to use? Drop rows only when nulls are random and you have enough data. Fill with median/mean for numeric columns when nulls are rare. Fill with a placeholder for categorical columns when you need to keep the row.

2
Duplicates

Finding and removing duplicates

python
# How many duplicate rows?
print(df.duplicated().sum())

# See the duplicate rows
print(df[df.duplicated(keep=False)])

# Remove duplicates — keep first occurrence
df = df.drop_duplicates()

# Duplicates by specific columns only
# (same order_id = duplicate, even if other cols differ)
df = df.drop_duplicates(subset=["order_id"], keep="first")
3
Data Types

Fixing wrong dtypes

pandas guesses types when reading files. It often guesses wrong. A column that should be a number might load as a string, and dates almost always load as strings.

python
# Check current types
print(df.dtypes)

# Convert a column to numeric (coerce turns bad values to NaN)
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

# Convert to datetime
df["order_date"] = pd.to_datetime(df["order_date"])

# Convert to string / category
df["region"] = df["region"].astype("category")
df["order_id"] = df["order_id"].astype(str)

# Convert multiple at once (faster)
df = df.astype({"quantity": int, "price": float})

Category type: If a string column has fewer than ~50 unique values (like "region" or "status"), convert it to category. It uses 5–10x less memory and speeds up groupby operations significantly.

4
Pipeline

Building a reusable cleaning function

Put it all together. Write this once and reuse it across projects:

pythonclean.py
import pandas as pd

def clean_dataframe(df, required_cols=None):
    """
    Clean a raw DataFrame:
    - Drop rows missing required columns
    - Remove full duplicates
    - Strip whitespace from string columns
    - Normalize column names
    """
    original_shape = df.shape

    # Normalize column names: lowercase, replace spaces with underscores
    df.columns = (
        df.columns
        .str.lower()
        .str.strip()
        .str.replace(" ", "_")
        .str.replace("[^a-z0-9_]", "", regex=True)
    )

    # Drop rows missing any required column
    if required_cols:
        df = df.dropna(subset=required_cols)

    # Remove full duplicates
    df = df.drop_duplicates()

    # Strip leading/trailing whitespace from all string columns
    str_cols = df.select_dtypes(include="object").columns
    df[str_cols] = df[str_cols].apply(lambda x: x.str.strip())

    print(f"Cleaned: {original_shape} → {df.shape}")
    return df

# Usage
df_raw = pd.read_csv("sales.csv")
df = clean_dataframe(df_raw, required_cols=["order_id", "revenue"])
Practice Challenge

Clean a real messy dataset

  • Download a messy CSV from Kaggle (search "messy data" or use any real dataset)
  • Run df.isnull().sum() — identify your worst columns for nulls
  • Decide: drop or fill? Apply your strategy and verify nulls are gone
  • Check dtypes. Fix any dates or numbers that loaded as strings
  • Report the before/after shape

What You Learned Today

  • Detect nulls by column and percentage
  • Drop rows with dropna() using subset and threshold
  • Fill nulls with fixed values, median, or forward-fill
  • Detect and remove duplicate rows
  • Fix dtypes with astype(), to_numeric(), and to_datetime()
  • Build a reusable cleaning function you can drop into any project
Course Progress
Day 2 of 5 — 40%
Day 2 Complete

Day 3: Transformation

Tomorrow: groupby aggregations, pivot tables, merging DataFrames, and applying custom functions.

Start Day 3
Finished this lesson?