A full model evaluation report with confusion matrix, ROC curve, precision-recall curve, feature importance chart, and SHAP values — the kind of analysis that gets models approved in enterprise settings.
Confusion Matrix and Classification Metrics
Accuracy is often misleading. If 95% of transactions are legit, a model that always predicts "legit" has 95% accuracy — but catches zero fraud. Use precision, recall, and F1.
from sklearn.metrics import (
confusion_matrix, classification_report,
ConfusionMatrixDisplay, roc_auc_score
)
import matplotlib.pyplot as plt
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# [[TN FP]
# [FN TP]]
# Plot it
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['Benign', 'Malignant'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png', dpi=150)
# Full report
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}")ROC Curve and Precision-Recall Curve
ROC-AUC measures how well the model separates classes across all possible decision thresholds. PR-AUC is better for imbalanced datasets. Plot both to understand the tradeoffs.
from sklearn.metrics import roc_curve, precision_recall_curve, auc
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
ax1.plot(fpr, tpr, label=f'AUC = {roc_auc:.3f}')
ax1.plot([0, 1], [0, 1], '--', color='gray', label='Random')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curve')
ax1.legend()
# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)
ax2.plot(recall, precision, label=f'PR-AUC = {pr_auc:.3f}')
ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curve')
ax2.legend()
plt.tight_layout()
plt.savefig('curves.png', dpi=150)Feature Importance and SHAP Values
Feature importance tells you which columns drive predictions. SHAP (SHapley Additive exPlanations) goes further — it explains individual predictions, which is critical for medical, legal, and financial models.
pip install shapimport shap
# Built-in feature importance (Random Forest)
importances = pd.Series(model.feature_importances_, index=feature_names)
importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Feature Importances')
plt.savefig('feature_importance.png')
# SHAP values (model-agnostic explanation)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Summary plot: global feature impact
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)
# Explain one prediction
shap.force_plot(explainer.expected_value[1],
shap_values[1][0],
X_test[0],
feature_names=feature_names,
matplotlib=True)SHAP for trust: When a doctor asks "why did your model say cancer?", you need SHAP. A list of the features that pushed the prediction up or down gives actionable, defensible explanations.
Threshold Tuning
By default, classifiers use a 0.5 probability threshold. But you can adjust it. In fraud detection you might prefer lower precision to catch more fraud (lower threshold). In medical screening, you want high recall.
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
thresholds = np.arange(0.1, 0.9, 0.05)
results = []
for t in thresholds:
y_pred_t = (y_prob >= t).astype(int)
results.append({
'threshold': round(t, 2),
'precision': precision_score(y_test, y_pred_t),
'recall': recall_score(y_test, y_pred_t),
'f1': f1_score(y_test, y_pred_t)
})
print(pd.DataFrame(results).to_string())
# For a cancer model, choose threshold that maximizes recall
# (better to have false alarms than miss a real cancer)What You Learned Today
- Built a complete evaluation suite: confusion matrix, ROC curve, and precision-recall curve
- Computed SHAP values to explain individual predictions and overall feature impact
- Tuned the decision threshold to optimize for precision vs recall based on business requirements
- Understood when ROC-AUC vs PR-AUC is the right metric (imbalanced vs balanced datasets)
Go Further on Your Own
- Add calibration curves to check if your model's probability estimates are reliable
- Try LIME (pip install lime) as an alternative to SHAP for explaining predictions
- Build a Streamlit dashboard that shows all evaluation charts interactively
Nice work. Keep going.
Day 5 is ready when you are.
Continue to Day 5Want live instruction and hands-on projects? Join the AI bootcamp — 3 days, 5 cities.