Key Takeaways
- Features beat algorithms: Better features improve model performance more than switching to a more complex algorithm. A simple logistic regression with great features often outperforms a neural network with poor features.
- Encoding requires care: Never use label encoding for nominal categories — it implies order that does not exist. Use one-hot encoding for low-cardinality categories, target encoding for high-cardinality.
- Scale after splitting: Always fit scalers on training data only, then apply to test data. Fitting on the full dataset before splitting is data leakage.
- Domain knowledge drives the best features: Ratio features, interaction features, and domain-specific aggregations (days since last purchase, average order size) often have more predictive power than any individual raw feature.
Feature engineering is the process of transforming raw data into the input features that ML algorithms use. It is arguably the highest-leverage skill in applied ML. This guide covers the core techniques with Python code.
Encoding Categoricals
from sklearn.preprocessing import OneHotEncoder import pandas as pd # One-hot encoding for nominal categories (no order) enc = OneHotEncoder(drop='first', sparse_output=False) encoded = enc.fit_transform(df[['category', 'region']]) # Target encoding for high-cardinality categories # (replace category with its mean target value) target_means = df.groupby('city')['churn'].mean() df['city_churn_rate'] = df['city'].map(target_means) # Ordinal encoding for truly ordered categories only size_map = {'small': 0, 'medium': 1, 'large': 2} df['size_encoded'] = df['size'].map(size_map)
Date Features
df['signup_date'] = pd.to_datetime(df['signup_date']) # Extract temporal components df['signup_month'] = df['signup_date'].dt.month df['signup_dayofweek'] = df['signup_date'].dt.dayofweek df['signup_quarter'] = df['signup_date'].dt.quarter df['is_weekend'] = (df['signup_date'].dt.dayofweek >= 5).astype(int) # Duration features df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days df['days_since_last_login'] = (pd.Timestamp.now() - df['last_login']).dt.days
Scaling and Transformations
from sklearn.preprocessing import StandardScaler, RobustScaler import numpy as np # StandardScaler: mean=0, std=1. Use for linear models, neural nets scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) # fit on train only X_test_scaled = scaler.transform(X_test) # transform with same scaler # RobustScaler: uses median/IQR. Better when outliers are present robust = RobustScaler() # Log transform for right-skewed distributions (revenue, prices) df['log_revenue'] = np.log1p(df['revenue']) # log(x+1) handles zeros # Binning: continuous to categorical df['age_group'] = pd.cut(df['age'], bins=[0,25,35,50,100], labels=['young','adult','middle','senior'])
Interaction Features
# Domain-knowledge interaction features df['revenue_per_visit'] = df['total_revenue'] / (df['num_visits'] + 1) df['avg_order_size'] = df['total_revenue'] / (df['num_orders'] + 1) df['engagement_score'] = df['logins'] * df['features_used'] # Polynomial features for linear models from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X[numeric_features])
For a churn model, days_since_last_login / account_age captures engagement decay. For fraud, transaction_amount / average_account_transaction captures unusual activity. These ratios encode domain knowledge that the algorithm cannot learn from raw features alone.
Frequently Asked Questions
What is feature engineering?
Feature engineering transforms raw data columns into the input variables that maximize ML model performance. It includes encoding categoricals, extracting date components, scaling numerics, creating ratio and interaction features, and aggregating related records. Good feature engineering often improves model accuracy more than switching to a more complex algorithm.
When should I use one-hot vs label encoding?
Use one-hot encoding for nominal categories (no inherent order: color, city, product type). Use label/ordinal encoding only for truly ordinal categories (small < medium < large, low < medium < high). Applying label encoding to nominal categories creates false numerical relationships that can mislead algorithms.
Does feature scaling matter for all algorithms?
Tree-based algorithms (Random Forest, XGBoost, LightGBM) are invariant to feature scaling. Algorithms using distance or gradient optimization (Logistic Regression, SVM, KNN, neural networks) are sensitive to scale. Always scale for these algorithms, fitting the scaler on training data only to prevent leakage.
How do I identify which features are most important?
For tree-based models: use feature_importances_ (RandomForest, XGBoost) or permutation_importance. For any model: use SHAP values from the shap library, which provides consistent feature importance across all model types and explains individual predictions.
Great features are the foundation of great models. Get the skills.
Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.
Reserve Your SeatNote: Information reflects early 2026.