Reddit Explicit Text Classifier

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	pinned
reddit_text_classification_app	🐠	blue	green	gradio	3.13.0	app.py	false

Reddit Explicit Text Classifier

Demo

Link to Youtube demo:

Introduction

Reddit is a place where people come together to have a variety of conversations on the internet. However, the negative impacts of abusive language on users in online communities are severe. As students passionate about data science, we are interested in detecting inappropriate and unprofessional Reddit posts and warn users about explicit content in these posts.

In this project, we created a text classifier Hugging Face Spaces app and a Gradio interface that classifies not safe for work (NSFW) content, specifically text that is considered inappropriate and unprofessional. We used a pre-trained DistilBERT transformer model for the sentiment analysis. The model was fine-tuned on Reddit posts and predicts 2 classes - NSFW and safe for work (SFW).

Workflow

Get Reddit data

Data pulled in notebook reddit_data/reddit_new.ipynb to fine-tune Hugging Face model.

Verify GPU works in this repo

Run pytorch training test: python utils/quickstart_pytorch.py
Run pytorch CUDA test: python utils/verify_cuda_pytorch.py
Run tensorflow training test: python utils/quickstart_tf2.py
Run nvidia monitoring test: nvidia-smi -l 1

DistilBERT transformer model

Finetune text classifier model and upload to Hugging Face

In terminal, run huggingface-cli login
Run python fine_tune_berft.py to finetune the model on Reddit data
Run rename_labels.py to change the output labels of the classifier
Check out the fine-tuned model here

Gradio interface

In terminal, run python3 app.py
Open the browser
Put reddit URL in input_url and get output
Or directly check out the spaces app here

SAFE Reddit URL

WARNING Reddit URL

Reference

[1] “CADD_dataset,” GitHub, Sep. 26, 2022. https://github.com/nlpcl-lab/cadd_dataset

[2] H. Song, S. H. Ryu, H. Lee, and J. Park, “A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit,” ACLWeb, Nov. 01, 2021. https://aclanthology.org/2021.conll-1.43/ ‌ ‌

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
hugging-face		hugging-face
reddit_data		reddit_data
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
app.py		app.py
requirements.txt		requirements.txt
scrape_load.ipynb		scrape_load.ipynb
test_app.py		test_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Explicit Text Classifier

Demo

Introduction

Workflow

Get Reddit data

Verify GPU works in this repo

DistilBERT transformer model

Finetune text classifier model and upload to Hugging Face

Gradio interface

Reference

About

Releases

Packages

Contributors 4

Languages

License

YZhu0225/reddit_text_classification

Folders and files

Latest commit

History

Repository files navigation

Reddit Explicit Text Classifier

Demo

Introduction

Workflow

Get Reddit data

Verify GPU works in this repo

DistilBERT transformer model

Finetune text classifier model and upload to Hugging Face

Gradio interface

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages