Courses Curriculum Cities Blog Enroll Now
ML Fundamentals · Day 4 of 5 ~40 minutes

Day 4: Model Evaluation and Interpretation

Go beyond accuracy — understand what your model actually learned, find where it fails, and explain predictions to non-technical stakeholders.

1
Day 1
2
Day 2
3
Day 3
4
Day 4
5
Day 5
What You'll Build

A full model evaluation report with confusion matrix, ROC curve, precision-recall curve, feature importance chart, and SHAP values — the kind of analysis that gets models approved in enterprise settings.

1
Section 1 · 10 min

Confusion Matrix and Classification Metrics

Accuracy is often misleading. If 95% of transactions are legit, a model that always predicts "legit" has 95% accuracy — but catches zero fraud. Use precision, recall, and F1.

pythonevaluate.py
from sklearn.metrics import (
    confusion_matrix, classification_report,
    ConfusionMatrixDisplay, roc_auc_score
)
import matplotlib.pyplot as plt

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# [[TN  FP]
#  [FN  TP]]

# Plot it
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                  display_labels=['Benign', 'Malignant'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png', dpi=150)

# Full report
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}")
2
Section 2 · 10 min

ROC Curve and Precision-Recall Curve

ROC-AUC measures how well the model separates classes across all possible decision thresholds. PR-AUC is better for imbalanced datasets. Plot both to understand the tradeoffs.

pythonevaluate.py (continued)
from sklearn.metrics import roc_curve, precision_recall_curve, auc

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
ax1.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
ax1.plot([0, 1], [0, 1], '--', color='gray', label='Random')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curve')
ax1.legend()

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)
ax2.plot(recall, precision, label=f'PR-AUC = {pr_auc:.3f}')
ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curve')
ax2.legend()

plt.tight_layout()
plt.savefig('curves.png', dpi=150)
3
Section 3 · 10 min

Feature Importance and SHAP Values

Feature importance tells you which columns drive predictions. SHAP (SHapley Additive exPlanations) goes further — it explains individual predictions, which is critical for medical, legal, and financial models.

bashterminal
pip install shap
pythonexplain.py
import shap

# Built-in feature importance (Random Forest)
importances = pd.Series(model.feature_importances_, index=feature_names)
importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Feature Importances')
plt.savefig('feature_importance.png')

# SHAP values (model-agnostic explanation)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature impact
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)

# Explain one prediction
shap.force_plot(explainer.expected_value[1],
               shap_values[1][0],
               X_test[0],
               feature_names=feature_names,
               matplotlib=True)

SHAP for trust: When a doctor asks "why did your model say cancer?", you need SHAP. A list of the features that pushed the prediction up or down gives actionable, defensible explanations.

4
Section 4 · 10 min

Threshold Tuning

By default, classifiers use a 0.5 probability threshold. But you can adjust it. In fraud detection you might prefer lower precision to catch more fraud (lower threshold). In medical screening, you want high recall.

pythonthreshold.py
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for t in thresholds:
    y_pred_t = (y_prob >= t).astype(int)
    results.append({
        'threshold': round(t, 2),
        'precision': precision_score(y_test, y_pred_t),
        'recall': recall_score(y_test, y_pred_t),
        'f1': f1_score(y_test, y_pred_t)
    })

print(pd.DataFrame(results).to_string())

# For a cancer model, choose threshold that maximizes recall
# (better to have false alarms than miss a real cancer)

What You Learned Today

  • Built a complete evaluation suite: confusion matrix, ROC curve, and precision-recall curve
  • Computed SHAP values to explain individual predictions and overall feature impact
  • Tuned the decision threshold to optimize for precision vs recall based on business requirements
  • Understood when ROC-AUC vs PR-AUC is the right metric (imbalanced vs balanced datasets)
Your Challenge

Go Further on Your Own

  • Add calibration curves to check if your model's probability estimates are reliable
  • Try LIME (pip install lime) as an alternative to SHAP for explaining predictions
  • Build a Streamlit dashboard that shows all evaluation charts interactively
Day 4 Complete

Nice work. Keep going.

Day 5 is ready when you are.

Continue to Day 5
Course Progress
80%

Want live instruction and hands-on projects? Join the AI bootcamp — 3 days, 5 cities.

Finished this lesson?