Skip to content

vipul-shinde/toxic-comment-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

forthebadge forthebadge forthebadge

Toxic Comment Classification using Flask & AWS 🔍

Status Open Source Love png1 GitHub license


This is a toxic comment classifier web application that uses a trained Logistic Regression model to predict the toxicity levels of a given text input.

Link to the web app: 👉🏻 Toxic Comment Classifier

Disclaimer: the dataset for this project contains text that may be considered profane, vulgar, or offensive.

📝 Table of Contents

🧐 About

This is a multi-label classification problem where the given input is a text comment and the output is list of the toxicity level it belongs to.

The input text data needs to be cleaned and pre-processed for it to be useful for the Machine Learning model.

📊 Dataset Overview

The dataset for this problem was taken from competetiion hosted by Jigsaw on Kaggle.

For preprocessing of the input data and text vectorization, both word and char based TF-IDF vectorizer's output are used as inputs to the model for better performance and minimum loss of input features.

The different types of target labels present are: toxic, severe-toxic, obscene, threat, insult and identity hate.

Click to view 👇:

forthebadge

🧠 Model Building

For building the classifier, we have used Logistic Regression and treated the multi-label problem as a binary problem. The reason for this approach instead of a OneVsRest Classifier is because of better model performance when the problem is treated as a binary one.

Since the data is unbalanced, just accuary in itself cannot be considered as a strong evaluater, therefore we have used F1-score along with it to evaluate the model.

Here are the results on validation and test datasets:

Validation Results 👇🏻

Validation Accuracy: 0.9828502793879577
Validation F1-Score: 0.9811947440446507
Test Results 👇🏻

Test Accuracy: 0.9752805651942854
Test F1-Score: 0.9747181660736461

Click to view 👇:

forthebadge

🎯 Getting Started

Project Structure:

Volume serial number is D8B2-80F9
D:.
├───data
│   ├───cleaned-data
│   └───raw-data
├───images
├───models
├───notebooks
├───static
│   └───css
├───templates
└───__pycache__

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Flask==1.1.2
joblib==1.0.1
nltk==3.6.1
numpy==1.20.1
pandas==1.2.4
scikit_learn==1.0.2
scipy==1.6.2
swifter==1.0.9

Installing

Use miniconda to download python 3.8 or higher and then

pip install -r requirements.txt

🎈 Usage

To run the website, navigate to main folder of the project

python app.py

The server will be at "localhost:5000".

Goto "localhost:5000" and after entering the comment click on classify to predict it's toxicity values.


🚀 Deployment

The model has been deployed on an EC2 instance on AWS. The IP has been made publicly accesible. Below is the link to the AWS webapp project portal:

Link: http://ec2-18-117-78-151.us-east-2.compute.amazonaws.com:8080/

🌟 Support

Please hit the ⭐button if you like this project. 😄

Thank you!

Releases

No releases published

Packages

No packages published