A trained scikit-learn classifier that predicts whether a customer will churn, evaluated with accuracy, precision, and recall — plus a reusable training pipeline you can drop into any project.
The Machine Learning Workflow
Every ML project follows the same lifecycle: get data, explore it, preprocess it, train a model, evaluate it, and deploy it. Today we'll complete that full loop on a real dataset.
Install and Load Data
We'll use scikit-learn's built-in datasets so you don't need to download anything. The breast cancer dataset is a real medical classification problem with 30 features.
pip install scikit-learn pandas numpy matplotlibimport numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
print(f"Shape: {df.shape}") # (569, 31)
print(f"Target counts:
{df.target.value_counts()}")
print(df.describe()) # statistics for each column
# Check for missing values
print(f"Missing: {df.isnull().sum().sum()}") # 0Preprocess, Train, and Evaluate
Split the data, scale features (ML algorithms work better when all features are on the same scale), then train a Random Forest classifier.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Features and target
X = df.drop('target', axis=1)
y = df['target']
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit AND transform train
X_test = scaler.transform(X_test) # ONLY transform test
# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# precision recall f1-score support
# 0 0.96 0.95 0.96 42
# 1 0.97 0.98 0.97 72
# accuracy 0.96 114Critical rule: Only call fit_transform() on training data. On test data, call transform() only — using the scaler fitted on train. If you fit on test data, you're leaking information.
Save and Load the Model
A trained model is useless if you have to retrain every time. Save it with joblib so you can load it in your API or app.
import joblib
# Save model and scaler together
pipeline = {'model': model, 'scaler': scaler, 'features': list(X.columns)}
joblib.dump(pipeline, 'model_pipeline.pkl')
print("Model saved to model_pipeline.pkl")
# Load and predict in a different script
pipeline = joblib.load('model_pipeline.pkl')
model = pipeline['model']
scaler = pipeline['scaler']
# Predict a single sample
sample = X_test[0:1] # already scaled
pred = model.predict(sample)
prob = model.predict_proba(sample)
print(f"Prediction: {pred[0]} (confidence: {prob[0].max():.1%})")What You Learned Today
- Understood the full ML workflow from data loading to evaluation
- Trained a Random Forest classifier achieving 96% accuracy
- Applied StandardScaler correctly — fit on train, transform only on test
- Saved and loaded a model pipeline with joblib
Go Further on Your Own
- Try replacing RandomForestClassifier with LogisticRegression and compare accuracy
- Add feature importance plotting: model.feature_importances_ shows which columns matter most
- Wrap the prediction in a FastAPI endpoint that accepts JSON and returns a prediction
Nice work. Keep going.
Day 2 is ready when you are.
Continue to Day 2Want live instruction and hands-on projects? Join the AI bootcamp — 3 days, 5 cities.