Insurance Fraud in Python

Classification Prediction

Summary

The goal of this project was to use Python to identify significant features in fraudulent insurance claim transactions and to design predictive classification models to predict whether fraud was reported on the insurance claim transaction. The project addressed the imbalanced target variable by weighting the classes. Logistic Regression, Support Vector Machine Classification, and Random Forest models were tested. The paper walks through the data understanding and preparation, different models tested, methodology, and evaluation of the project.

Tools

  • Scikit-learn
  • Seaborn
  • Matplotlib
  • Yellowbrick
  • Numpy
  • Pandas
  • Scipy
  • Patsy
  • Tabulate
  • Counter

Data: claims

Methodology

Compared multiple versions of models that varied techniques for data-splitting, the imbalanced target variable, and feature selection.

Models / Methods / Metrics

  • Random Forest
  • Logistic Regression / LASSO Logistic Regression
  • Support Vector Classification
  • Principal Component Analysis
  • Log-Transformation and Scaling
  • GridSearch
  • Recall

Project Preview

Exploratory Data Analysis

This project utilized the EDA from this Exploratory Data Analysis and Hypothesis Testing project: EDA.

Principal Component Analysis

PCA was implemented because of multicollinearity between groups of input variables.

Evaluation

RESULTS1

The Complete Project: here.

Updated: