Scikit-Learn Tutorial [2026]: Your First ML Model

Key Takeaways

Scikit-learn is the most important machine learning library in the Python ecosystem. It provides a clean, consistent API for dozens of algorithms — plus tools for preprocessing, model selection, and evaluation.

The Complete ML Workflow

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

# 1. Load and prepare data
df = pd.read_csv('customer_churn.csv')
X = df.drop('churned', axis=1)
y = df['churned']

# 2. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# 3. Build pipeline (preprocessing + model)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

# 4. Train
pipeline.fit(X_train, y_train)

# 5. Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# 6. Cross-validate for reliable estimate
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
print(f"CV F1: {scores.mean():.3f} (+/- {scores.std()*2:.3f}")

Evaluation Metrics

Classification: Accuracy (overall correct rate), Precision (of predicted positives, how many are correct), Recall (of actual positives, how many did we find), F1 (harmonic mean of precision and recall), ROC-AUC (overall classifier performance).

Regression: MAE (mean absolute error, interpretable in original units), RMSE (penalizes large errors more), R-squared (fraction of variance explained).

For imbalanced classes (e.g., 5% churn rate), accuracy is misleading. Use F1 score or ROC-AUC. Set class_weight='balanced' in the classifier or use SMOTE oversampling.

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5, 10]
}
search = RandomizedSearchCV(
    pipeline, param_grid, cv=5, scoring='f1',
    n_iter=20, random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV F1:", search.best_score_)

Algorithm Cheat Sheet

AlgorithmBest ForScaling Needed?Handles Nulls?
Logistic RegressionBinary classification, interpretable modelsYesNo
Random ForestTabular data, feature importanceNoNo
Gradient BoostingBest accuracy on tabular dataNoNo (XGBoost yes)
SVMHigh-dimensional data, text classificationYesNo
KNNSimple problems, small datasetsYesNo
Linear RegressionContinuous prediction, interpretableYesNo

Frequently Asked Questions

What is the best algorithm to start with in scikit-learn?

RandomForestClassifier for classification and RandomForestRegressor for regression are the best starting algorithms for tabular data. They handle mixed feature types, provide feature importance scores, are robust to hyperparameter choices, and rarely overfit badly. Once you have a baseline, try XGBoost or LightGBM for potentially higher accuracy.

What is data leakage and how do I prevent it?

Data leakage occurs when information from the test set is used during model training, producing inflated performance estimates. The most common form: fitting preprocessing (StandardScaler, imputer) on the entire dataset including test rows. Prevent with scikit-learn Pipelines — fit the full pipeline on X_train only, then transform X_test with the same fitted pipeline.

How do I handle imbalanced classes in scikit-learn?

For imbalanced datasets: set class_weight='balanced' in the classifier (adjusts sample weights), use SMOTE from the imbalanced-learn library to oversample the minority class, or use under-sampling of the majority class. Evaluate with F1 score or ROC-AUC rather than accuracy. A model that predicts the majority class 100% of the time will have high accuracy but is useless.

What is cross-validation and why does it matter?

Cross-validation evaluates model performance by training and testing on multiple different splits of the data. 5-fold CV splits the data into 5 parts, trains on 4 and tests on 1, rotates 5 times, and averages the scores. This gives a more reliable estimate of real-world performance than a single train/test split, which can be lucky or unlucky depending on which observations land in the test set.

Build your first model today with scikit-learn. Get the skills.

Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. October 2026. Seats are limited.

Reserve Your Seat

Note: Information reflects early 2026.

BP

Bo Peng

AI Instructor & Founder, Precision AI Academy

Bo has trained 400+ professionals in applied AI across federal agencies and Fortune 500 companies.