Skip to content

The complete pipeline for fraud detection. Includes data extraction, model training, API deployment.

Notifications You must be signed in to change notification settings

saurabhbatra96/wmf-fraud-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WMF Fraud Pipeline

A very hacky but working implementation of data extraction, data pre-processing and model training pipeline for WMF fraud detection.

Requirements

  • Python 2.7
  • Pip
  • Libraries as described in requirements.txt. Strongly recommended that you use a virtualenv before doing pip install.

Pipeline Steps - Importing a fresh/updated version of the model into the API

  • We need to obtain 2 kinds of data for the classifier to work correctly - fraudulent and genuine.

  • On frdev, run the queries given in the data-extraction folder using the following -

    1. Run $ mysql < fraud-query.sql | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > ../data/fraud-data.csv
    2. Get the number of fraud rows returned by the command $ wc -l ../data/fraud-data.csv
    3. Open genuine-query.sql in a text editor, replace "$num" in the last line (limit clause) with the number of fraud rows.
    4. Save the file and now run $ mysql < genuine-query.sql | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > ../data/genuine-data.csv
    5. Shift to the data folder, $ cd ../data
    6. Concatenate the 2 data files using - $ head -n 1 fraud-data.csv > orig-data.csv; tail -n+2 -q genuine-data.csv >> orig-data.csv; tail -n+2 -q fraud-data.csv >> orig-data.csv
  • Once we have the combined data file, we need to pre-process it . Run $ python feature-eng.py This should generate a file called data-eng.csv.

  • Navigate to the model-training folder - $ cd ../model-training. Run $ python model-train-gb.py.

  • Copy the contents of the private folder to the private folder in the API. Our new model is loaded!

Running the API

Research and Findings

A complete project summary can be found on my blog here.

About

The complete pipeline for fraud detection. Includes data extraction, model training, API deployment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages