spam_or_ham

This project is a spam or ham classification system built using machine learning techniques. The models are trained on a dataset of labeled emails and uses natural language processing techniques to convert the emails to vectors, and then classify them as spam or ham. This repository includes the scripts for training several models, which were then compared to each other. The 3 best models were used to be embedded in a simple chat application with a GUI that offers a spam detection service.

Dataset

The dataset used for training the models is the SMS Spam Collection Dataset from Kaggle. It contains 5,574 messages, tagged acording being ham (legitimate) or spam.

Data distribution before cleaning

Data distribution after cleaning:

Words per category

Results

Classifier	Tag	Precision	Recall	F1 Score	Support
Multinomial Naive Bayes	ham	0.97	1.0	0.98	685
Multinomial Naive Bayes	spam	0.98	0.86	0.92	140
Decision Trees	ham	0.97	0.97	0.97	685
Decision Trees	spam	0.87	0.87	0.87	140
Extra Trees	ham	0.98	0.99	0.99	685
Extra Trees	spam	0.95	0.91	0.93	140
Logistic Regression	ham	0.99	0.98	0.98	685
Logistic Regression	spam	0.91	0.93	0.92	140
Neural Network	ham	0.98	0.99	0.98	685
Neural Network	spam	0.95	0.89	0.92	140

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
models		models
plots		plots
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spam_or_ham

Dataset

Results

Confusion Matrices

Chat Application

About

Languages

dtotsila/spam_or_ham

Folders and files

Latest commit

History

Repository files navigation

spam_or_ham

Dataset

Results

Confusion Matrices

Chat Application

About

Topics

Resources

Stars

Watchers

Forks

Languages