Courses Curriculum Cities Blog Enroll Now
ML Fundamentals · Day 2 of 5 ~40 minutes

Day 2: Data Preprocessing and Feature Engineering

Handle real-world messy data: missing values, categorical encoding, feature scaling, and creating new features that improve model performance.

1
Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build

A preprocessing pipeline using scikit-learn's Pipeline and ColumnTransformer that handles mixed-type data — ready to drop into any ML project with messy real-world data.

1
Section 1 · 10 min

Handle Missing Values

Real data is always missing values. You have three choices: drop rows, drop columns, or impute (fill in estimated values). The right choice depends on how much is missing and why.

pythonpreprocess.py
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Simulate messy data
df = pd.DataFrame({
    'age': [25, np.nan, 35, 28, np.nan],
    'salary': [50000, 60000, np.nan, 55000, 70000],
    'department': ['Eng', 'Sales', np.nan, 'Eng', 'HR'],
    'churned': [0, 1, 0, 1, 0]
})

# See missing count per column
print(df.isnull().sum())

# Impute numeric: replace NaN with column median
num_imputer = SimpleImputer(strategy='median')
df[['age', 'salary']] = num_imputer.fit_transform(df[['age', 'salary']])

# Impute categorical: replace NaN with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['department']] = cat_imputer.fit_transform(df[['department']])
2
Section 2 · 10 min

Encode Categorical Features

ML models need numbers. Categorical columns like 'department' must be encoded. Two main approaches: One-Hot Encoding (for nominal categories) and Label Encoding (for ordinal categories).

pythonpreprocess.py (continued)
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding (nominal: no order)
# 'department' becomes: dept_Eng, dept_HR, dept_Sales
ohe = OneHotEncoder(sparse_output=False, drop='first')
dept_encoded = ohe.fit_transform(df[['department']])
dept_df = pd.DataFrame(dept_encoded, columns=ohe.get_feature_names_out())
df = pd.concat([df.drop('department', axis=1), dept_df], axis=1)

# Label Encoding (ordinal: has order like Low/Med/High)
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
# Low=0, Medium=1, High=2

Why drop='first'? With 3 categories, you only need 2 columns — the third is implied. This avoids multicollinearity (the dummy variable trap).

3
Section 3 · 10 min

Build a sklearn Pipeline

Pipelines chain preprocessing and model training into one object. This prevents data leakage and makes deployment much cleaner — you call .fit() once and .predict() anywhere.

pythonpipeline.py
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

numeric_features = ['age', 'salary']
categorical_features = ['department']

# Numeric: impute then scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical: impute then encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline: preprocessing + model
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train on raw data — pipeline handles all preprocessing
pipe.fit(X_train, y_train)
print(f"Accuracy: {pipe.score(X_test, y_test):.1%}")
4
Section 4 · 10 min

Feature Engineering

Feature engineering is creating new columns from existing ones that give the model more predictive signal. This is where domain knowledge beats raw algorithm power.

pythonfeatures.py
# Common feature engineering patterns

# 1. Ratios
df['salary_per_year_experience'] = df['salary'] / (df['years_exp'] + 1)

# 2. Interaction features
df['age_x_salary'] = df['age'] * df['salary']

# 3. Date features
df['join_date'] = pd.to_datetime(df['join_date'])
df['tenure_days'] = (pd.Timestamp.now() - df['join_date']).dt.days
df['join_month'] = df['join_date'].dt.month

# 4. Binning continuous to categories
df['age_group'] = pd.cut(df['age'],
    bins=[0, 25, 35, 50, 100],
    labels=['young', 'mid', 'senior', 'veteran']
)

# 5. Log transform for skewed distributions
df['log_salary'] = np.log1p(df['salary'])  # log1p handles zeros

What You Learned Today

  • Handled missing values using median imputation for numeric and most_frequent for categorical
  • Applied One-Hot Encoding to nominal categories avoiding the dummy variable trap
  • Built a full sklearn Pipeline with ColumnTransformer that preprocesses and trains in one step
  • Engineered new features using ratios, date extraction, and log transforms
Your Challenge

Go Further on Your Own

  • Add a custom transformer to your pipeline using sklearn's FunctionTransformer
  • Try removing engineered features one at a time and see how accuracy changes
  • Build a pipeline for the Titanic dataset from Kaggle (passenger survival prediction)
Day 2 Complete

Nice work. Keep going.

Day 3 is ready when you are.

Continue to Day 3
Course Progress
40%

Want live instruction and hands-on projects? Join the AI bootcamp — 3 days, 5 cities.

Finished this lesson?