Fraud Prediction in R: Credit Card Transactions

Unsupervised Learning for Predicting Credit Card Fraud in R

Summary

The goal of this project was to use R to design unsupervised predictive binary classification models to predict whether credit card transactions are fraudulent transactions. One Class Support Vector Machine and k Means clustering models are highlighted. The paper and presentation walk through the data understanding and preparation, different models tested, methodology, evaluation and anticipated follow-up steps to the project.

R Libraries

e1071

NLP

dbscan

fpc

MLmetrics

NBClust

factoextra

corrplot

dplyr

ggplot2

caret

Data

data

Models / Methods / Metrics

One Class Support Vector Machine
K Means Clustering
Elbow Method
DBSCAN Clustering
Logistic Regression
Support Vector Machine (Two Classes)
Feature Selection:
- Near Zero Variance
- Learning Vector Quantification
- Recursive Feature Elimination
- Boruta
- LASSO Regression
ovun.sample (Random Under-Sampling)
Recall and F1

Results

The unsupervised One Class SVM model is a viable model for predicting credit card frauds. Utilizing the k means clusters as features or labels in supervised models are also viable methods.

Project Preview

Exploratory Data Analysis

The EDA shows there are distinctions between the fraud records and the non-fraud records.

Over

K Means Clusters

elbow

two

scatter

dbscan

ten

Modeling

The unsupervised learning methods I analyzed were One Class Support Vector Machine, k means clustering and DBSCAN clustering. I compared the results of the unsupervised learning models with supervised classification models. I tested the k means models with the clusters as input features and as classification labels in supervised models. In each model, I used a dataset with all the input features except three non-normal features that I dropped. I also tested each model with only the 5 features identified as the most significant features.

In the unsupervised models, I used the original highly imbalanced proportion of fraud vs non-fraud cases. However, the imbalance must be addressed in the supervised models used for comparison. Therefore, I under-sampled the training data for the supervised models.

Evaluation

One Class Support Vector Machine:

RESULTSSVM

K Means:

Clusters as input features in the supervised model:

RESULTSfeatures

Clusters as the classificaiton labels in the supervised model:

RESULTSlabels

The Complete Project: here.

Share on

Twitter Facebook Google+ LinkedIn

Mary Donovan Martello

Fraud Prediction in R: Credit Card Transactions

Unsupervised Learning for Predicting Credit Card Fraud in R

Project Preview

Exploratory Data Analysis

K Means Clusters

Modeling

Evaluation

The Complete Project: here.

Share on

You May Also Enjoy

Credit Card Defaults Part 2: Imbalanced Data and Deployment

Credit Card Defaults Part 1: Classification

SQL

Deep Learning Projects