A preprocessing pipeline using scikit-learn's Pipeline and ColumnTransformer that handles mixed-type data — ready to drop into any ML project with messy real-world data.
Handle Missing Values
Real data is always missing values. You have three choices: drop rows, drop columns, or impute (fill in estimated values). The right choice depends on how much is missing and why.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# Simulate messy data
df = pd.DataFrame({
'age': [25, np.nan, 35, 28, np.nan],
'salary': [50000, 60000, np.nan, 55000, 70000],
'department': ['Eng', 'Sales', np.nan, 'Eng', 'HR'],
'churned': [0, 1, 0, 1, 0]
})
# See missing count per column
print(df.isnull().sum())
# Impute numeric: replace NaN with column median
num_imputer = SimpleImputer(strategy='median')
df[['age', 'salary']] = num_imputer.fit_transform(df[['age', 'salary']])
# Impute categorical: replace NaN with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['department']] = cat_imputer.fit_transform(df[['department']])Encode Categorical Features
ML models need numbers. Categorical columns like 'department' must be encoded. Two main approaches: One-Hot Encoding (for nominal categories) and Label Encoding (for ordinal categories).
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# One-Hot Encoding (nominal: no order)
# 'department' becomes: dept_Eng, dept_HR, dept_Sales
ohe = OneHotEncoder(sparse_output=False, drop='first')
dept_encoded = ohe.fit_transform(df[['department']])
dept_df = pd.DataFrame(dept_encoded, columns=ohe.get_feature_names_out())
df = pd.concat([df.drop('department', axis=1), dept_df], axis=1)
# Label Encoding (ordinal: has order like Low/Med/High)
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
# Low=0, Medium=1, High=2Why drop='first'? With 3 categories, you only need 2 columns — the third is implied. This avoids multicollinearity (the dummy variable trap).
Build a sklearn Pipeline
Pipelines chain preprocessing and model training into one object. This prevents data leakage and makes deployment much cleaner — you call .fit() once and .predict() anywhere.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
numeric_features = ['age', 'salary']
categorical_features = ['department']
# Numeric: impute then scale
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical: impute then encode
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Combine both
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline: preprocessing + model
pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train on raw data — pipeline handles all preprocessing
pipe.fit(X_train, y_train)
print(f"Accuracy: {pipe.score(X_test, y_test):.1%}")Feature Engineering
Feature engineering is creating new columns from existing ones that give the model more predictive signal. This is where domain knowledge beats raw algorithm power.
# Common feature engineering patterns
# 1. Ratios
df['salary_per_year_experience'] = df['salary'] / (df['years_exp'] + 1)
# 2. Interaction features
df['age_x_salary'] = df['age'] * df['salary']
# 3. Date features
df['join_date'] = pd.to_datetime(df['join_date'])
df['tenure_days'] = (pd.Timestamp.now() - df['join_date']).dt.days
df['join_month'] = df['join_date'].dt.month
# 4. Binning continuous to categories
df['age_group'] = pd.cut(df['age'],
bins=[0, 25, 35, 50, 100],
labels=['young', 'mid', 'senior', 'veteran']
)
# 5. Log transform for skewed distributions
df['log_salary'] = np.log1p(df['salary']) # log1p handles zerosWhat You Learned Today
- Handled missing values using median imputation for numeric and most_frequent for categorical
- Applied One-Hot Encoding to nominal categories avoiding the dummy variable trap
- Built a full sklearn Pipeline with ColumnTransformer that preprocesses and trains in one step
- Engineered new features using ratios, date extraction, and log transforms
Go Further on Your Own
- Add a custom transformer to your pipeline using sklearn's FunctionTransformer
- Try removing engineered features one at a time and see how accuracy changes
- Build a pipeline for the Titanic dataset from Kaggle (passenger survival prediction)
Nice work. Keep going.
Day 3 is ready when you are.
Continue to Day 3Want live instruction and hands-on projects? Join the AI bootcamp — 3 days, 5 cities.