The dataset I used is this link: https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022?select=Base.csv It’s the base of the whole bank account fraud dataset suite. This is a tabular dataset with 1 million instances and 31 features.
- First, I did data understanding, and found there’s a column “device_fraud_count” just has one value for all instances, so I drop this attribute.
- Then I checked if there are some attributes’ values are mostly missing. I found “prev_address_months_count”, “intended_balcon_amount”, so I drop these two attributes.
- Then I impute the rest attributes with missing value. Some use -1 to represent missing values. Some use negative value as missing values. When impute numerical data, I use median. When impute categorical data, I use mode.
- After imputation, I do train-test split based on attribute “month”, [0:5] as training, and [6:7] as test.
- Because of the imbalance characteristic, I applied SMOTE oversampling techniques, and made two labels have equal quantity.
- Then I did feature selection using domain and correlation.
- After that, I did 1-in-100 systematic sampling.
- After sampling, I used time-series validation.
- To do modeling, I applied three techniques: Decision Tree, Random Forest, and Logistic Regression.
- About measures, I use confusion matrix, Precision, Recall, F1-score, ROC_AUC, Matthew’s correlation coefficient to do comparison for effectiveness.
- For Efficiency, I compared each model’s execution time.
- For stability, I changed seed to 10, 500, 5000 to check the change of the metrics’ results.
All python codes are in Code folder.
Please feel free to request the full and updated version of the report by sending your inquiry to my email address: k10lu@torontomu.ca. I encourage open communication, and I am here to assist with any information or queries you may have. Your engagement is greatly appreciated.