Model Evaluation and Interpretation

Go beyond accuracy — understand what your model actually learned, find where it fails, and explain predictions to non-technical stakeholders.

~1 hour Hands-on Precision AI Academy

Today’s Objective

Go beyond accuracy — understand what your model actually learned, find where it fails, and explain predictions to non-technical stakeholders.

A full model evaluation report with confusion matrix, ROC curve, precision-recall curve, feature importance chart, and SHAP values — the kind of analysis that gets models approved in enterprise settings.

Confusion Matrix and Classification Metrics

Accuracy is often misleading. If 95% of transactions are legit, a model that always predicts "legit" has 95% accuracy — but catches zero fraud. Use precision, recall, and F1.

evaluate.py

PYTHON

from sklearn.metrics import ( confusion_matrix, classification_report, ConfusionMatrixDisplay, roc_auc_score
)
import matplotlib.pyplot as plt

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# [[TN  FP]
#  [FN  TP]]

# Plot it
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Benign', 'Malignant'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png', dpi=150)

# Full report
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}")

ROC Curve and Precision-Recall Curve

ROC-AUC measures how well the model separates classes across all possible decision thresholds. PR-AUC is better for imbalanced datasets. Plot both to understand the tradeoffs.

evaluate.py (continued)

PYTHON

from sklearn.metrics import roc_curve, precision_recall_curve, auc

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
ax1.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
ax1.plot([0, 1], [0, 1], '--', color='gray', label='Random')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curve')
ax1.legend()

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)
ax2.plot(recall, precision, label=f'PR-AUC = {pr_auc:.3f}')
ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curve')
ax2.legend()

plt.tight_layout()
plt.savefig('curves.png', dpi=150)

Feature Importance and SHAP Values

Feature importance tells you which columns drive predictions. SHAP (SHapley Additive exPlanations) goes further — it explains individual predictions, which is critical for medical, legal, and financial models.

terminal

BASH

pip install shap

explain.py

PYTHON

import shap

# Built-in feature importance (Random Forest)
importances = pd.Series(model.feature_importances_, index=feature_names)
importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Feature Importances')
plt.savefig('feature_importance.png')

# SHAP values (model-agnostic explanation)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature impact
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)

# Explain one prediction
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0], feature_names=feature_names, matplotlib=True)

SHAP for trust: When a doctor asks "why did your model say cancer?", you need SHAP. A list of the features that pushed the prediction up or down gives actionable, defensible explanations.

Threshold Tuning

By default, classifiers use a 0.5 probability threshold. But you can adjust it. In fraud detection you might prefer lower precision to catch more fraud (lower threshold). In medical screening, you want high recall.

threshold.py

PYTHON

import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for t in thresholds: y_pred_t = (y_prob >= t).astype(int) results.append({ 'threshold': round(t, 2), 'precision': precision_score(y_test, y_pred_t), 'recall': recall_score(y_test, y_pred_t), 'f1': f1_score(y_test, y_pred_t) })

print(pd.DataFrame(results).to_string())

# For a cancer model, choose threshold that maximizes recall
# (better to have false alarms than miss a real cancer)

Day 4 Checkpoint

Before moving on, make sure you can answer these without looking:

What is the core concept introduced in this lesson, and why does it matter?
What problem does Model solve that simpler approaches cannot?
Can you trace through the main code example in this lesson and explain each step?
What are the most common mistakes made when first learning this concept?
How would you explain today’s topic to a colleague who has never seen it before?

Model Evaluation and Interpretation

Today’s Objective

Confusion Matrix and Classification Metrics

ROC Curve and Precision-Recall Curve

Feature Importance and SHAP Values

Threshold Tuning

Supporting Resources

Go deeper with these references.

Day 4 Checkpoint