A very hacky but working implementation of data extraction, data pre-processing and model training pipeline for WMF fraud detection.
- Python 2.7
- Pip
- Libraries as described in requirements.txt. Strongly recommended that you use a virtualenv before doing pip install.
-
We need to obtain 2 kinds of data for the classifier to work correctly - fraudulent and genuine.
-
On
frdev
, run the queries given in thedata-extraction
folder using the following -- Run
$ mysql < fraud-query.sql | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > ../data/fraud-data.csv
- Get the number of fraud rows returned by the command
$ wc -l ../data/fraud-data.csv
- Open
genuine-query.sql
in a text editor, replace "$num" in the last line (limit clause) with the number of fraud rows. - Save the file and now run
$ mysql < genuine-query.sql | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > ../data/genuine-data.csv
- Shift to the data folder,
$ cd ../data
- Concatenate the 2 data files using -
$ head -n 1 fraud-data.csv > orig-data.csv; tail -n+2 -q genuine-data.csv >> orig-data.csv; tail -n+2 -q fraud-data.csv >> orig-data.csv
- Run
-
Once we have the combined data file, we need to pre-process it . Run
$ python feature-eng.py
This should generate a file calleddata-eng.csv
. -
Navigate to the model-training folder -
$ cd ../model-training
. Run$ python model-train-gb.py
. -
Copy the contents of the private folder to the private folder in the API. Our new model is loaded!
- The steps are covered in the API README - https://github.com/saurabhbatra96/wmf-fd-api/blob/master/README.md#installation-and-usage
A complete project summary can be found on my blog here.