A reusable cleaning pipeline function that takes a raw DataFrame and returns a cleaned one: nulls handled, duplicates removed, dtypes corrected, string columns normalized. Drop it into any project.
Handling nulls with dropna and fillna
Every real dataset has missing values. The question isn't whether to handle them — it's how.
Detecting nulls
# Count nulls per column
df.isnull().sum()
# Percentage of nulls per column
(df.isnull().sum() / len(df) * 100).round(1)
# Which rows have ANY null
df[df.isnull().any(axis=1)]
dropna — remove rows or columns
# Drop any row with at least one null
df_clean = df.dropna()
# Drop rows only if a specific column is null
df_clean = df.dropna(subset=["order_id", "revenue"])
# Drop columns that are >50% null
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)
fillna — replace nulls with a value
# Fill with a fixed value
df["notes"] = df["notes"].fillna("N/A")
# Fill numeric nulls with the column median
df["revenue"] = df["revenue"].fillna(df["revenue"].median())
# Forward fill — use previous row's value (good for time series)
df["price"] = df["price"].fillna(method="ffill")
# Fill multiple columns at once with a dict
df = df.fillna({"region": "Unknown", "discount": 0})
Which to use? Drop rows only when nulls are random and you have enough data. Fill with median/mean for numeric columns when nulls are rare. Fill with a placeholder for categorical columns when you need to keep the row.
Finding and removing duplicates
# How many duplicate rows?
print(df.duplicated().sum())
# See the duplicate rows
print(df[df.duplicated(keep=False)])
# Remove duplicates — keep first occurrence
df = df.drop_duplicates()
# Duplicates by specific columns only
# (same order_id = duplicate, even if other cols differ)
df = df.drop_duplicates(subset=["order_id"], keep="first")
Fixing wrong dtypes
pandas guesses types when reading files. It often guesses wrong. A column that should be a number might load as a string, and dates almost always load as strings.
# Check current types
print(df.dtypes)
# Convert a column to numeric (coerce turns bad values to NaN)
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
# Convert to datetime
df["order_date"] = pd.to_datetime(df["order_date"])
# Convert to string / category
df["region"] = df["region"].astype("category")
df["order_id"] = df["order_id"].astype(str)
# Convert multiple at once (faster)
df = df.astype({"quantity": int, "price": float})
Category type: If a string column has fewer than ~50 unique values (like "region" or "status"), convert it to category. It uses 5–10x less memory and speeds up groupby operations significantly.
Building a reusable cleaning function
Put it all together. Write this once and reuse it across projects:
import pandas as pd
def clean_dataframe(df, required_cols=None):
"""
Clean a raw DataFrame:
- Drop rows missing required columns
- Remove full duplicates
- Strip whitespace from string columns
- Normalize column names
"""
original_shape = df.shape
# Normalize column names: lowercase, replace spaces with underscores
df.columns = (
df.columns
.str.lower()
.str.strip()
.str.replace(" ", "_")
.str.replace("[^a-z0-9_]", "", regex=True)
)
# Drop rows missing any required column
if required_cols:
df = df.dropna(subset=required_cols)
# Remove full duplicates
df = df.drop_duplicates()
# Strip leading/trailing whitespace from all string columns
str_cols = df.select_dtypes(include="object").columns
df[str_cols] = df[str_cols].apply(lambda x: x.str.strip())
print(f"Cleaned: {original_shape} → {df.shape}")
return df
# Usage
df_raw = pd.read_csv("sales.csv")
df = clean_dataframe(df_raw, required_cols=["order_id", "revenue"])
Clean a real messy dataset
- Download a messy CSV from Kaggle (search "messy data" or use any real dataset)
- Run
df.isnull().sum()— identify your worst columns for nulls - Decide: drop or fill? Apply your strategy and verify nulls are gone
- Check dtypes. Fix any dates or numbers that loaded as strings
- Report the before/after shape
What You Learned Today
- Detect nulls by column and percentage
- Drop rows with dropna() using subset and threshold
- Fill nulls with fixed values, median, or forward-fill
- Detect and remove duplicate rows
- Fix dtypes with astype(), to_numeric(), and to_datetime()
- Build a reusable cleaning function you can drop into any project
Day 3: Transformation
Tomorrow: groupby aggregations, pivot tables, merging DataFrames, and applying custom functions.
Start Day 3