Course OverviewAll CoursesBlog Reserve Bootcamp Seat
Pandas Masterclass · Day 5 of 5 ~75 minutes

Performance — Vectorization, Chunking, Large Datasets

The skills that separate junior from senior data practitioners. Make your code fast enough to use in production.

1
Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build Today

A high-performance data pipeline that processes a 5GB CSV in chunks, uses vectorized operations throughout, and finishes in seconds instead of minutes. Includes a benchmarking function that compares apply() vs. vectorized speed.

1
Vectorization

Stop looping. Start vectorizing.

Every time you write a for loop or apply() over a DataFrame column, you're leaving performance on the table. Vectorized operations run on entire columns at once using optimized C code — they're often 10–100x faster.

The wrong way vs. the right way

pythonbenchmark.py
import pandas as pd
import numpy as np
import time

df = pd.DataFrame({"revenue": np.random.rand(1_000_000) * 10000})

# SLOW: apply() with a Python function
start = time.time()
df["tax_apply"] = df["revenue"].apply(lambda x: x * 0.08)
print(f"apply(): {time.time() - start:.2f}s")   # ~1.2s

# FAST: vectorized arithmetic
start = time.time()
df["tax_vec"] = df["revenue"] * 0.08
print(f"vectorized: {time.time() - start:.4f}s")  # ~0.004s

np.where — vectorized if/else

python
import numpy as np

# Slow: apply with a conditional
df["tier"] = df["revenue"].apply(lambda x: "high" if x > 5000 else "low")

# Fast: np.where — same result, 50x faster
df["tier"] = np.where(df["revenue"] > 5000, "high", "low")

# Multiple conditions with np.select
conditions = [
    df["revenue"] >= 10000,
    df["revenue"] >= 5000,
    df["revenue"] >= 1000,
]
choices = ["enterprise", "mid-market", "small"]
df["segment"] = np.select(conditions, choices, default="micro")

Rule of thumb: If you can express the logic as column arithmetic or np.where/np.select, do that. Only use apply() when the logic is genuinely complex and can't be vectorized.

2
Memory

Reducing memory usage

pythonmemory.py
# Check current memory usage
print(df.memory_usage(deep=True).sum() / 1e6, "MB")

# Downcast numeric types
df["quantity"] = pd.to_numeric(df["quantity"], downcast="integer")
df["price"]    = pd.to_numeric(df["price"], downcast="float")

# Convert low-cardinality strings to category
for col in df.select_dtypes("object").columns:
    if df[col].nunique() / len(df) < 0.05:  # < 5% unique
        df[col] = df[col].astype("category")

print(df.memory_usage(deep=True).sum() / 1e6, "MB")  # often 3-5x smaller
3
Chunking

Processing files larger than RAM

If a file won't fit in memory, don't try to load it all at once. Process it in chunks.

pythonchunked.py
import pandas as pd

# Process a 5GB file 100,000 rows at a time
CHUNK_SIZE = 100_000
results = []

for chunk in pd.read_csv("huge_file.csv", chunksize=CHUNK_SIZE):
    # Process each chunk — only this chunk is in memory
    chunk = chunk.dropna(subset=["revenue"])
    chunk["revenue"] = pd.to_numeric(chunk["revenue"], errors="coerce")

    # Aggregate this chunk
    chunk_summary = chunk.groupby("region")["revenue"].sum().reset_index()
    results.append(chunk_summary)

# Combine all chunk results
final = pd.concat(results).groupby("region")["revenue"].sum().reset_index()
print(final.sort_values("revenue", ascending=False))

Pattern: Process each chunk → aggregate it → collect the summaries → combine at the end. You never hold the full dataset in memory, but you get the complete result.

4
Only needed columns

Load only what you need

python
# Load only the columns you actually need
df = pd.read_csv(
    "huge_file.csv",
    usecols=["order_id", "revenue", "region", "order_date"],
    dtype={"order_id": "int32", "revenue": "float32"},
    parse_dates=["order_date"]
)

# Specify dtypes at load time — prevents pandas from guessing wrong
# and uses less memory

What You Learned Today

  • Why vectorized operations are 10–100x faster than apply()
  • np.where and np.select for vectorized conditionals
  • Reduce memory by downcasting types and using category dtype
  • Process files larger than RAM with chunking
  • Load only needed columns with usecols and specify dtypes at load time
Course Progress
Course Complete!
Pandas Masterclass Complete

You know pandas. Now build something.

You can load, clean, transform, analyze time series, and process large datasets. That's the full stack. Put it to work on a real project this week.

Finished this lesson?