Explainable Credit Default Prediction with Tree Ensembles: A Leakage-Safe Pipeline on the Credit Risk
Students & Supervisors
Student Authors
Supervisors
Abstract
A machine learning pipeline that is explainable and leakage-safe was designed to predict consumer credit defaults based on the Kaggle Credit Risk Dataset. The research addresses three prevailing practical issues in credit-retail management: reconcile predictive accuracy and model interpretability, remove data leakage, and provide explanations of model behavior that are regulable by regulators. After thorough data-quality and plausibility checks, economically relevant features were designed, and all preprocessing activities were wrapped in a train-only Column Transformer pipeline. Regularized logistic regression was compared with tree-based ensembles, such as Random Forest, HistGradientBoosting, XGBoost, and LightGBM, optimized using stratified cross-validation based on ROC-AUC and threshold optimization. The high-quality gradient-boosting models obtained had test AUCs of about 0.95, with consistent confusion-matrix profiles and learning curves, a testament to strong generalization. Permutation importance analysis and SHAP analysis both indicated that indebtedness, income, loan terms, employment stability, and home ownership are the dominant risk factors, demonstrating that strong tree ensembles that are sensitive to leakage can achieve both high predictive accuracy and economically plausible explanations.
Keywords
Publication Details
- Type of Publication:
- Conference Name: International Conference on Business Innovation for Inclusive Development (ICBIID 2026)
- Date of Conference: 24/01/2026 - 24/01/2026
- Venue: International Islamic University Chittagong, Kumira, Chittagong, Bangladesh
- Organizer: International Islamic University Chittagong