MLOps project

Hi all! I would like to present my project for MLOps course organized by DataTalksClub.

If you are reviewer, please look at the checklist for convenience.

Project description

In this project, I used Kaggle Fake News data to classify an article whether it's fake given the title, author and text.

I built a small web service using Flask that identifies fake news.

Training process

Preprocessing:

Firstly, we must clean data like by removing words with little meaning like prepositions. For that, I utilized the stop words list in nltk library. Also, I used stemming to reduce word to its root.

Since we have a text data, we must represent a document as a numerical data. To do so, I used TfidVectorizer in sklearn library because it converts a text into a matrix of tf-idf features.

Model and training:

For training I selected Logistic Regression as method to classify the data. In this project, I used a model which is already implemented in sklearn.

I trained the model using different hyperparameters and KFold, and logged the metrics to MLFlow. Best model was selected according to the F1 score.

For basic orchestration I used Prefect 2.3.1, but it is not fully deployed.

Deployment and Monitoring

I built a simple web service on Flask that listens on port 9000. It is containerized by Docker and deployed locally.

It supports two simple routes:

"/": On "GET" request, you can access a form where you can write your own test case. If you press "Submit" button on this form, "POST" request will be sent to the same route, where model will make a prediction using preprocessors and model. Prediction consists of predicted label, and probability of it. Also, data and prediction will be stored locally for monitoring.
"/monitor": This page is designed as a simple monitoring service using evidently. After gathering enough data, the dashboard with DataDriftTab will be created to analyze the prediction labels and their probability. The report will be saved as a html file, then can be viewed on "GET" request.

Reproducibility

Download a data from kaggle or google drive, then place train.csv and test.csv into data/ folder.
Make sure you have pipenv installed. Then, on the project folder, enter pipenv install and pipenv shell to install packages in Pipfile and activate the working environment.
Run MLFlow and Prefect on different terminal windows using commands below, and make sure they won't be shut down during the training.

Please use the commands below for correct reproducibility because mlflow is not operating on the usual 5000th port but on 6677th.

 make mlflow
 make prefect

Enter make train command. You might open mlflow and prefect on localhost to view training logs, parameteres, etc.

Reminder: MLFlow is operating on 6677 port of localhost.
- At this point, you can also run some tests before deploying model. make test
At this point, we want to deploy our model. Type make deploy . It will build a docker image and run docker container that will operate the Flask server described above.
After that, I want to simulate traffic by sending test cases from data/test.csv by one to the server. To do so, type make simulate
Now you can check localhost:9000 and localhost:9000/monitor to see the simple form for POST request and evidently report respectively.

Checklist for peer review

Cloud is not used. All things run locally.
For experiment tracking and registry, I used MLFlow and for each experiment I log metrics, model artifact. After train runs finish, best run with the highest F1 score is registered, and it's weight is saved locally in artifacts folder. Also, previous model is moved to "Staging". For the details check train.py
I used basic Prefect flows and tasks for model orchestration, but it is not deployed.
For deployment, I used Flask server and Docker. It uses locally saved models, and writes predictions in the same folder. For details, check the Dockerfile and deploy_model.py
For monitoring, I used evidently to analyze data drift. It has a basic functionality, so it just calculates metrics and saves report in html format so we can see it in the browser. For details, take a look at deploy_model.py
Reproducibility. If you have some problems with the instructions, contact me on telegram or on slack of DataTalksClub (@Slava Shen).
Best practices:
- Unit tests: check tests folder
- Integration tests
- Linter or Code formatter: I used pylint, black, isort. Check the pyproject.toml file
- Makefile: I used it to run some of important commands. Check Makefile
- Pre-commit hooks: Take a look at .pre-commit-config.yaml. It formats code, run tests.
- CI/CD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLOps project

Project description

Training process

Deployment and Monitoring

Reproducibility

Checklist for peer review

Updated on September 12, 2022.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
artifacts		artifacts
data		data
templates		templates
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
deploy_model.py		deploy_model.py
pyproject.toml		pyproject.toml
simulate_traffic.py		simulate_traffic.py
train.py		train.py
utils.py		utils.py

slavaheroes/mlops-zoomcamp-project

Folders and files

Latest commit

History

Repository files navigation

MLOps project

Project description

Training process

Deployment and Monitoring

Reproducibility

Checklist for peer review

Updated on September 12, 2022.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages