Hi all! I would like to present my project for MLOps course organized by DataTalksClub.
If you are reviewer, please look at the checklist for convenience.
In this project, I used Kaggle Fake News data to classify an article whether it's fake given the title, author and text.
I built a small web service using Flask that identifies fake news.
Preprocessing:
Firstly, we must clean data like by removing words with little meaning like prepositions. For that, I utilized the stop words list in nltk library. Also, I used stemming to reduce word to its root.
Since we have a text data, we must represent a document as a numerical data. To do so, I used TfidVectorizer in sklearn library because it converts a text into a matrix of tf-idf features.
Model and training:
For training I selected Logistic Regression as method to classify the data. In this project, I used a model which is already implemented in sklearn.
I trained the model using different hyperparameters and KFold, and logged the metrics to MLFlow. Best model was selected according to the F1 score.
For basic orchestration I used Prefect 2.3.1, but it is not fully deployed.
I built a simple web service on Flask that listens on port 9000. It is containerized by Docker and deployed locally.
It supports two simple routes:
-
"/": On "GET" request, you can access a form where you can write your own test case. If you press "Submit" button on this form, "POST" request will be sent to the same route, where model will make a prediction using preprocessors and model. Prediction consists of predicted label, and probability of it. Also, data and prediction will be stored locally for monitoring.
-
"/monitor": This page is designed as a simple monitoring service using evidently. After gathering enough data, the dashboard with DataDriftTab will be created to analyze the prediction labels and their probability. The report will be saved as a html file, then can be viewed on "GET" request.
- Download a data from kaggle or google drive, then place train.csv and test.csv into
data/
folder. - Make sure you have pipenv installed. Then, on the project folder, enter
pipenv install
andpipenv shell
to install packages in Pipfile and activate the working environment. - Run MLFlow and Prefect on different terminal windows using commands below, and make sure they won't be shut down during the training.
Please use the commands below for correct reproducibility because mlflow is not operating on the usual 5000th port but on 6677th.
make mlflow
make prefect
-
Enter
make train
command. You might open mlflow and prefect on localhost to view training logs, parameteres, etc.Reminder: MLFlow is operating on 6677 port of localhost.
- At this point, you can also run some tests before deploying model.
make test
- At this point, you can also run some tests before deploying model.
-
At this point, we want to deploy our model. Type
make deploy
. It will build a docker image and run docker container that will operate the Flask server described above. -
After that, I want to simulate traffic by sending test cases from
data/test.csv
by one to the server. To do so, typemake simulate
-
Now you can check
localhost:9000
andlocalhost:9000/monitor
to see the simple form for POST request and evidently report respectively.
- Cloud is not used. All things run locally.
- For experiment tracking and registry, I used MLFlow and for each experiment I log metrics, model artifact. After train runs finish, best run with the highest F1 score is registered, and it's weight is saved locally in artifacts folder. Also, previous model is moved to "Staging". For the details check train.py
- I used basic Prefect flows and tasks for model orchestration, but it is not deployed.
- For deployment, I used Flask server and Docker. It uses locally saved models, and writes predictions in the same folder. For details, check the Dockerfile and deploy_model.py
- For monitoring, I used evidently to analyze data drift. It has a basic functionality, so it just calculates metrics and saves report in html format so we can see it in the browser. For details, take a look at deploy_model.py
- Reproducibility. If you have some problems with the instructions, contact me on telegram or on slack of DataTalksClub (@Slava Shen).
- Best practices:
- Unit tests: check
tests
folder - Integration tests
- Linter or Code formatter: I used pylint, black, isort. Check the pyproject.toml file
- Makefile: I used it to run some of important commands. Check Makefile
- Pre-commit hooks: Take a look at .pre-commit-config.yaml. It formats code, run tests.
- CI/CD
- Unit tests: check