A high-performance data pipeline that processes a 5GB CSV in chunks, uses vectorized operations throughout, and finishes in seconds instead of minutes. Includes a benchmarking function that compares apply() vs. vectorized speed.
Stop looping. Start vectorizing.
Every time you write a for loop or apply() over a DataFrame column, you're leaving performance on the table. Vectorized operations run on entire columns at once using optimized C code — they're often 10–100x faster.
The wrong way vs. the right way
import pandas as pd
import numpy as np
import time
df = pd.DataFrame({"revenue": np.random.rand(1_000_000) * 10000})
# SLOW: apply() with a Python function
start = time.time()
df["tax_apply"] = df["revenue"].apply(lambda x: x * 0.08)
print(f"apply(): {time.time() - start:.2f}s") # ~1.2s
# FAST: vectorized arithmetic
start = time.time()
df["tax_vec"] = df["revenue"] * 0.08
print(f"vectorized: {time.time() - start:.4f}s") # ~0.004s
np.where — vectorized if/else
import numpy as np
# Slow: apply with a conditional
df["tier"] = df["revenue"].apply(lambda x: "high" if x > 5000 else "low")
# Fast: np.where — same result, 50x faster
df["tier"] = np.where(df["revenue"] > 5000, "high", "low")
# Multiple conditions with np.select
conditions = [
df["revenue"] >= 10000,
df["revenue"] >= 5000,
df["revenue"] >= 1000,
]
choices = ["enterprise", "mid-market", "small"]
df["segment"] = np.select(conditions, choices, default="micro")
Rule of thumb: If you can express the logic as column arithmetic or np.where/np.select, do that. Only use apply() when the logic is genuinely complex and can't be vectorized.
Reducing memory usage
# Check current memory usage
print(df.memory_usage(deep=True).sum() / 1e6, "MB")
# Downcast numeric types
df["quantity"] = pd.to_numeric(df["quantity"], downcast="integer")
df["price"] = pd.to_numeric(df["price"], downcast="float")
# Convert low-cardinality strings to category
for col in df.select_dtypes("object").columns:
if df[col].nunique() / len(df) < 0.05: # < 5% unique
df[col] = df[col].astype("category")
print(df.memory_usage(deep=True).sum() / 1e6, "MB") # often 3-5x smaller
Processing files larger than RAM
If a file won't fit in memory, don't try to load it all at once. Process it in chunks.
import pandas as pd
# Process a 5GB file 100,000 rows at a time
CHUNK_SIZE = 100_000
results = []
for chunk in pd.read_csv("huge_file.csv", chunksize=CHUNK_SIZE):
# Process each chunk — only this chunk is in memory
chunk = chunk.dropna(subset=["revenue"])
chunk["revenue"] = pd.to_numeric(chunk["revenue"], errors="coerce")
# Aggregate this chunk
chunk_summary = chunk.groupby("region")["revenue"].sum().reset_index()
results.append(chunk_summary)
# Combine all chunk results
final = pd.concat(results).groupby("region")["revenue"].sum().reset_index()
print(final.sort_values("revenue", ascending=False))
Pattern: Process each chunk → aggregate it → collect the summaries → combine at the end. You never hold the full dataset in memory, but you get the complete result.
Load only what you need
# Load only the columns you actually need
df = pd.read_csv(
"huge_file.csv",
usecols=["order_id", "revenue", "region", "order_date"],
dtype={"order_id": "int32", "revenue": "float32"},
parse_dates=["order_date"]
)
# Specify dtypes at load time — prevents pandas from guessing wrong
# and uses less memory
What You Learned Today
- Why vectorized operations are 10–100x faster than apply()
- np.where and np.select for vectorized conditionals
- Reduce memory by downcasting types and using category dtype
- Process files larger than RAM with chunking
- Load only needed columns with usecols and specify dtypes at load time
You know pandas. Now build something.
You can load, clean, transform, analyze time series, and process large datasets. That's the full stack. Put it to work on a real project this week.