Scikit-Learn Tutorial [2026]: Your First ML Model

Key Takeaways

The consistent API: Every scikit-learn estimator uses fit(X_train, y_train) to train and predict(X_test) to make predictions. Learn this pattern once, apply it to every algorithm.
Pipelines prevent data leakage: Always use Pipeline to chain preprocessing and modeling. This ensures preprocessing is fit only on training data, preventing test set information from leaking into the model.
Cross-validation for reliable estimates: A single train/test split can be misleading. Use 5-fold cross_val_score for more reliable performance estimates before reporting results.
Start with Random Forest: RandomForestClassifier and RandomForestRegressor are excellent default starting algorithms. They handle mixed feature types, provide feature importance, and rarely overfit badly.

Scikit-learn is the most important machine learning library in the Python ecosystem. It provides a clean, consistent API for dozens of algorithms — plus tools for preprocessing, model selection, and evaluation.

The Complete ML Workflow

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

# 1. Load and prepare data
df = pd.read_csv('customer_churn.csv')
X = df.drop('churned', axis=1)
y = df['churned']

# 2. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# 3. Build pipeline (preprocessing + model)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

# 4. Train
pipeline.fit(X_train, y_train)

# 5. Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# 6. Cross-validate for reliable estimate
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
print(f"CV F1: {scores.mean():.3f} (+/- {scores.std()*2:.3f}")

Evaluation Metrics

Classification: Accuracy (overall correct rate), Precision (of predicted positives, how many are correct), Recall (of actual positives, how many did we find), F1 (harmonic mean of precision and recall), ROC-AUC (overall classifier performance).

Regression: MAE (mean absolute error, interpretable in original units), RMSE (penalizes large errors more), R-squared (fraction of variance explained).

For imbalanced classes (e.g., 5% churn rate), accuracy is misleading. Use F1 score or ROC-AUC. Set class_weight='balanced' in the classifier or use SMOTE oversampling.

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5, 10]
}
search = RandomizedSearchCV(
    pipeline, param_grid, cv=5, scoring='f1',
    n_iter=20, random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV F1:", search.best_score_)

Algorithm Cheat Sheet

Algorithm	Best For	Scaling Needed?	Handles Nulls?
Logistic Regression	Binary classification, interpretable models	Yes	No
Random Forest	Tabular data, feature importance	No	No
Gradient Boosting	Best accuracy on tabular data	No	No (XGBoost yes)
SVM	High-dimensional data, text classification	Yes	No
KNN	Simple problems, small datasets	Yes	No
Linear Regression	Continuous prediction, interpretable	Yes	No

Frequently Asked Questions

What is the best algorithm to start with in scikit-learn?

RandomForestClassifier for classification and RandomForestRegressor for regression are the best starting algorithms for tabular data. They handle mixed feature types, provide feature importance scores, are robust to hyperparameter choices, and rarely overfit badly. Once you have a baseline, try XGBoost or LightGBM for potentially higher accuracy.

What is data leakage and how do I prevent it?

Data leakage occurs when information from the test set is used during model training, producing inflated performance estimates. The most common form: fitting preprocessing (StandardScaler, imputer) on the entire dataset including test rows. Prevent with scikit-learn Pipelines — fit the full pipeline on X_train only, then transform X_test with the same fitted pipeline.

How do I handle imbalanced classes in scikit-learn?

For imbalanced datasets: set class_weight='balanced' in the classifier (adjusts sample weights), use SMOTE from the imbalanced-learn library to oversample the minority class, or use under-sampling of the majority class. Evaluate with F1 score or ROC-AUC rather than accuracy. A model that predicts the majority class 100% of the time will have high accuracy but is useless.

What is cross-validation and why does it matter?

Cross-validation evaluates model performance by training and testing on multiple different splits of the data. 5-fold CV splits the data into 5 parts, trains on 4 and tests on 1, rotates 5 times, and averages the scores. This gives a more reliable estimate of real-world performance than a single train/test split, which can be lucky or unlucky depending on which observations land in the test set.

Note: Information reflects early 2026.

Bo Peng

AI Instructor & Founder, Precision AI Academy

Bo has trained 400+ professionals in applied AI across federal agencies and Fortune 500 companies.

Scikit-Learn Tutorial [2026]: Your First ML Model

Key Takeaways

The Complete ML Workflow

Evaluation Metrics

Hyperparameter Tuning

Algorithm Cheat Sheet

Frequently Asked Questions

What is the best algorithm to start with in scikit-learn?

What is data leakage and how do I prevent it?

How do I handle imbalanced classes in scikit-learn?

What is cross-validation and why does it matter?

Bo Peng

Build Real Skills. In Person. This October.

Scikit-learn is still the right place to start, and the reasons are not what you think.

Published By

Precision AI Academy

Scikit-Learn Tutorial [2026]: Your First ML Model

Key Takeaways

The Complete ML Workflow

Evaluation Metrics

Hyperparameter Tuning

Algorithm Cheat Sheet

Frequently Asked Questions

What is the best algorithm to start with in scikit-learn?

What is data leakage and how do I prevent it?

How do I handle imbalanced classes in scikit-learn?

What is cross-validation and why does it matter?

Bo Peng

Build Real Skills. In Person. This October.

Scikit-learn is still the right place to start, and the reasons are not what you think.

Published By

Precision AI Academy

Keep Reading

The Complete AI Guide for Beginners

How to Build an AI Agent in 2026

Best AI Bootcamps of 2026