Feature Engineering Guide: Transform Raw Data into ML Features

Key Takeaways

Feature engineering is the process of transforming raw data into the input features that ML algorithms use. It is arguably the highest-leverage skill in applied ML. This guide covers the core techniques with Python code.

Encoding Categoricals

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# One-hot encoding for nominal categories (no order)
enc = OneHotEncoder(drop='first', sparse_output=False)
encoded = enc.fit_transform(df[['category', 'region']])

# Target encoding for high-cardinality categories
# (replace category with its mean target value)
target_means = df.groupby('city')['churn'].mean()
df['city_churn_rate'] = df['city'].map(target_means)

# Ordinal encoding for truly ordered categories only
size_map = {'small': 0, 'medium': 1, 'large': 2}
df['size_encoded'] = df['size'].map(size_map)

Date Features

df['signup_date'] = pd.to_datetime(df['signup_date'])

# Extract temporal components
df['signup_month'] = df['signup_date'].dt.month
df['signup_dayofweek'] = df['signup_date'].dt.dayofweek
df['signup_quarter'] = df['signup_date'].dt.quarter
df['is_weekend'] = (df['signup_date'].dt.dayofweek >= 5).astype(int)

# Duration features
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days
df['days_since_last_login'] = (pd.Timestamp.now() - df['last_login']).dt.days

Scaling and Transformations

from sklearn.preprocessing import StandardScaler, RobustScaler
import numpy as np

# StandardScaler: mean=0, std=1. Use for linear models, neural nets
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)   # fit on train only
X_test_scaled = scaler.transform(X_test)  # transform with same scaler

# RobustScaler: uses median/IQR. Better when outliers are present
robust = RobustScaler()

# Log transform for right-skewed distributions (revenue, prices)
df['log_revenue'] = np.log1p(df['revenue'])  # log(x+1) handles zeros

# Binning: continuous to categorical
df['age_group'] = pd.cut(df['age'], bins=[0,25,35,50,100],
                         labels=['young','adult','middle','senior'])

Interaction Features

# Domain-knowledge interaction features
df['revenue_per_visit'] = df['total_revenue'] / (df['num_visits'] + 1)
df['avg_order_size'] = df['total_revenue'] / (df['num_orders'] + 1)
df['engagement_score'] = df['logins'] * df['features_used']

# Polynomial features for linear models
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X[numeric_features])

For a churn model, days_since_last_login / account_age captures engagement decay. For fraud, transaction_amount / average_account_transaction captures unusual activity. These ratios encode domain knowledge that the algorithm cannot learn from raw features alone.

Frequently Asked Questions

What is feature engineering?

Feature engineering transforms raw data columns into the input variables that maximize ML model performance. It includes encoding categoricals, extracting date components, scaling numerics, creating ratio and interaction features, and aggregating related records. Good feature engineering often improves model accuracy more than switching to a more complex algorithm.

When should I use one-hot vs label encoding?

Use one-hot encoding for nominal categories (no inherent order: color, city, product type). Use label/ordinal encoding only for truly ordinal categories (small < medium < large, low < medium < high). Applying label encoding to nominal categories creates false numerical relationships that can mislead algorithms.

Does feature scaling matter for all algorithms?

Tree-based algorithms (Random Forest, XGBoost, LightGBM) are invariant to feature scaling. Algorithms using distance or gradient optimization (Logistic Regression, SVM, KNN, neural networks) are sensitive to scale. Always scale for these algorithms, fitting the scaler on training data only to prevent leakage.

How do I identify which features are most important?

For tree-based models: use feature_importances_ (RandomForest, XGBoost) or permutation_importance. For any model: use SHAP values from the shap library, which provides consistent feature importance across all model types and explains individual predictions.

Great features are the foundation of great models. Get the skills.

Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.

Reserve Your Seat

Note: Information reflects early 2026.

BP

Bo Peng

AI Instructor & Founder, Precision AI Academy

Bo has trained 400+ professionals in applied AI across federal agencies and Fortune 500 companies.