Credit Card Defaults Part 1: Classification

Classification with Imbalanced Target

Summary

The goal of this project was to design predictive binary classification models to predict whether credit card account holders will default on their payments in the next month. The models address the imbalance in the target variable. Gradient Boosting and neural network models are highlighted. The paper and presentation walk through the data understanding and preparation, different models tested, methodology, evaluation and anticipated follow-up steps to the project.

Tools

  • Scikit-learn
  • Keras
  • Seaborn
  • Matplotlib
  • Numpy
  • Pandas
  • Scipy

Data

UCI

Models / Methods / Metrics

  • Gradient Boosting Classification
  • Artificial Neural Network
  • Random Forest
  • Logistic Regression / LASSO Logistic Regression
  • Receiver Operating Characteristic curve and Youden’s J statistic
  • Feature Selection:
    • Principal Component Analysis
    • ANOVA and Feature Importance Models
  • Log-Transformation and Scaling
  • GridSearch
  • Recall, Log-Loss and Binary Crossentropy Loss

Results

The Gradient Boosting Classification model had the best Recall and Log Loss Error scores. 62.43% of the actual default accounts were labeled as true positives. The Log Loss Error was .4545. The Artificial Neural Network had a Recall score of .6989 and a binary crossentropy loss of .5958. These scores resulted from addressing the imbalanced target variable.

Project Preview

Exploratory Data Analysis

The EDA shows there are distinctions between the default records and the non-default records.

ECD

PAY1

MEAN

Principal Component Analysis

PCA was implemented because of multicollinearity between groups of input variables.

PCA

Modeling

The imbalanced target variable was addressed by using predicted probabilities for positive outcome based on best classification threshold, and for the Artificial Neural Network, by weighting the binary target classes.

Evaluation

Gradient Boosting Classification, Logistic Regression and Random Forest Models:

RESULTS1

Artificial Neural Networks:

ANNRESULTS

The Complete Project: here.

Updated: