Key Takeaways
- The consistent API: Every scikit-learn estimator uses fit(X_train, y_train) to train and predict(X_test) to make predictions. Learn this pattern once, apply it to every algorithm.
- Pipelines prevent data leakage: Always use Pipeline to chain preprocessing and modeling. This ensures preprocessing is fit only on training data, preventing test set information from leaking into the model.
- Cross-validation for reliable estimates: A single train/test split can be misleading. Use 5-fold cross_val_score for more reliable performance estimates before reporting results.
- Start with Random Forest: RandomForestClassifier and RandomForestRegressor are excellent default starting algorithms. They handle mixed feature types, provide feature importance, and rarely overfit badly.
Scikit-learn is the most important machine learning library in the Python ecosystem. It provides a clean, consistent API for dozens of algorithms — plus tools for preprocessing, model selection, and evaluation.
The Complete ML Workflow
import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report, accuracy_score # 1. Load and prepare data df = pd.read_csv('customer_churn.csv') X = df.drop('churned', axis=1) y = df['churned'] # 2. Train/test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y) # 3. Build pipeline (preprocessing + model) pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', RandomForestClassifier(n_estimators=200, random_state=42)) ]) # 4. Train pipeline.fit(X_train, y_train) # 5. Evaluate y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred)) # 6. Cross-validate for reliable estimate scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1') print(f"CV F1: {scores.mean():.3f} (+/- {scores.std()*2:.3f}")
Evaluation Metrics
Classification: Accuracy (overall correct rate), Precision (of predicted positives, how many are correct), Recall (of actual positives, how many did we find), F1 (harmonic mean of precision and recall), ROC-AUC (overall classifier performance).
Regression: MAE (mean absolute error, interpretable in original units), RMSE (penalizes large errors more), R-squared (fraction of variance explained).
For imbalanced classes (e.g., 5% churn rate), accuracy is misleading. Use F1 score or ROC-AUC. Set class_weight='balanced' in the classifier or use SMOTE oversampling.
Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV param_grid = { 'model__n_estimators': [100, 200, 300], 'model__max_depth': [None, 10, 20], 'model__min_samples_split': [2, 5, 10] } search = RandomizedSearchCV( pipeline, param_grid, cv=5, scoring='f1', n_iter=20, random_state=42, n_jobs=-1 ) search.fit(X_train, y_train) print("Best params:", search.best_params_) print("Best CV F1:", search.best_score_)
Algorithm Cheat Sheet
| Algorithm | Best For | Scaling Needed? | Handles Nulls? |
|---|---|---|---|
| Logistic Regression | Binary classification, interpretable models | Yes | No |
| Random Forest | Tabular data, feature importance | No | No |
| Gradient Boosting | Best accuracy on tabular data | No | No (XGBoost yes) |
| SVM | High-dimensional data, text classification | Yes | No |
| KNN | Simple problems, small datasets | Yes | No |
| Linear Regression | Continuous prediction, interpretable | Yes | No |
Frequently Asked Questions
What is the best algorithm to start with in scikit-learn?
RandomForestClassifier for classification and RandomForestRegressor for regression are the best starting algorithms for tabular data. They handle mixed feature types, provide feature importance scores, are robust to hyperparameter choices, and rarely overfit badly. Once you have a baseline, try XGBoost or LightGBM for potentially higher accuracy.
What is data leakage and how do I prevent it?
Data leakage occurs when information from the test set is used during model training, producing inflated performance estimates. The most common form: fitting preprocessing (StandardScaler, imputer) on the entire dataset including test rows. Prevent with scikit-learn Pipelines — fit the full pipeline on X_train only, then transform X_test with the same fitted pipeline.
How do I handle imbalanced classes in scikit-learn?
For imbalanced datasets: set class_weight='balanced' in the classifier (adjusts sample weights), use SMOTE from the imbalanced-learn library to oversample the minority class, or use under-sampling of the majority class. Evaluate with F1 score or ROC-AUC rather than accuracy. A model that predicts the majority class 100% of the time will have high accuracy but is useless.
What is cross-validation and why does it matter?
Cross-validation evaluates model performance by training and testing on multiple different splits of the data. 5-fold CV splits the data into 5 parts, trains on 4 and tests on 1, rotates 5 times, and averages the scores. This gives a more reliable estimate of real-world performance than a single train/test split, which can be lucky or unlucky depending on which observations land in the test set.
Build your first model today with scikit-learn. Get the skills.
Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.
Reserve Your SeatNote: Information reflects early 2026.