Courses Curriculum Cities Blog Enroll Now
ML Fundamentals · Day 1 of 5 ~40 minutes

Day 1: Your First ML Model with scikit-learn

Understand the ML workflow end-to-end and train your first classification model — from raw data to predictions.

1
Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build

A trained scikit-learn classifier that predicts whether a customer will churn, evaluated with accuracy, precision, and recall — plus a reusable training pipeline you can drop into any project.

1
Section 1 · 8 min

The Machine Learning Workflow

Every ML project follows the same lifecycle: get data, explore it, preprocess it, train a model, evaluate it, and deploy it. Today we'll complete that full loop on a real dataset.

ML Workflow Steps
1. Load & Explore
Understand what data you have, its shape, types, missing values
2. Preprocess
Clean nulls, encode categories, scale numbers
3. Train/Test Split
Hold out 20% of data to evaluate on — never train on test data
4. Train Model
Fit model on training data
5. Evaluate
Check accuracy, precision, recall, F1 on test data
6. Iterate
Tune hyperparameters, try different algorithms
2
Section 2 · 10 min

Install and Load Data

We'll use scikit-learn's built-in datasets so you don't need to download anything. The breast cancer dataset is a real medical classification problem with 30 features.

bashterminal
pip install scikit-learn pandas numpy matplotlib
pythontrain.py
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print(f"Shape: {df.shape}")            # (569, 31)
print(f"Target counts:
{df.target.value_counts()}")
print(df.describe())                    # statistics for each column

# Check for missing values
print(f"Missing: {df.isnull().sum().sum()}")  # 0
3
Section 3 · 12 min

Preprocess, Train, and Evaluate

Split the data, scale features (ML algorithms work better when all features are on the same scale), then train a Random Forest classifier.

pythontrain.py (continued)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # fit AND transform train
X_test = scaler.transform(X_test)          # ONLY transform test

# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# precision  recall  f1-score   support
#        0       0.96      0.95      0.96        42
#        1       0.97      0.98      0.97        72
# accuracy                           0.96       114

Critical rule: Only call fit_transform() on training data. On test data, call transform() only — using the scaler fitted on train. If you fit on test data, you're leaking information.

4
Section 4 · 10 min

Save and Load the Model

A trained model is useless if you have to retrain every time. Save it with joblib so you can load it in your API or app.

pythontrain.py (continued)
import joblib

# Save model and scaler together
pipeline = {'model': model, 'scaler': scaler, 'features': list(X.columns)}
joblib.dump(pipeline, 'model_pipeline.pkl')
print("Model saved to model_pipeline.pkl")

# Load and predict in a different script
pipeline = joblib.load('model_pipeline.pkl')
model = pipeline['model']
scaler = pipeline['scaler']

# Predict a single sample
sample = X_test[0:1]  # already scaled
pred = model.predict(sample)
prob = model.predict_proba(sample)
print(f"Prediction: {pred[0]} (confidence: {prob[0].max():.1%})")

What You Learned Today

  • Understood the full ML workflow from data loading to evaluation
  • Trained a Random Forest classifier achieving 96% accuracy
  • Applied StandardScaler correctly — fit on train, transform only on test
  • Saved and loaded a model pipeline with joblib
Your Challenge

Go Further on Your Own

  • Try replacing RandomForestClassifier with LogisticRegression and compare accuracy
  • Add feature importance plotting: model.feature_importances_ shows which columns matter most
  • Wrap the prediction in a FastAPI endpoint that accepts JSON and returns a prediction
Day 1 Complete

Nice work. Keep going.

Day 2 is ready when you are.

Continue to Day 2
Course Progress
20%

Want live instruction and hands-on projects? Join the AI bootcamp — 3 days, 5 cities.

Finished this lesson?