This project is a Machine Learning application developed using Natural Language Processing (NLP) and Machine Learning techniques to classify emails as Spam or Not Spam (Ham). The application provides an interactive web interface using Streamlit for real-time email classification.
- Overview
- Technologies Used
- Features
- Setup and Installation
- How to Use
- Project Demo
- Directory Structure
- Future Enhancements
- Challenges Faced
- Acknowledgements
- License
This project demonstrates how NLP techniques such as text vectorization (using CountVectorizer
) and machine learning models can be used to classify emails as spam or ham. The application is deployed locally using Streamlit, allowing users to input email text and classify it on the fly.
- Programming Language: Python 3.8+
- Libraries:
Streamlit
: For creating the web interface.scikit-learn
: For text vectorization and machine learning model.pickle
: For saving and loading pre-trained models and vectorizers.pandas
: For data manipulation and analysis.
- Machine Learning Model: Trained using
CountVectorizer
andMultinomial Naive Bayes
.
- Email Classification: Enter email text to check whether it is spam or ham.
- Interactive Web Interface: Built using Streamlit for user-friendly interaction.
- Pre-trained Model: Leverages a pre-trained model for fast predictions.
- Real-time Feedback: Displays results instantly.
Follow these steps to run the project locally:
git clone https://github.com/AbdulSarban/P3-Spam-Email-Classification-Using-NLP-and-Machine-Learning.git
cd P3-Spam-Email-Classification-Using-NLP-and-Machine-Learning
Windows
python -m venv venv
venv\Scripts\activate
macOS/Linux
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
To train the model yourself, run:
python train_model.py
Start the Streamlit app:
streamlit run spamDetector.py
Visit the URL provided in the terminal (e.g., http://localhost:8501) to interact with the application.
- Launch the Streamlit app.
- Enter email text in the provided input box.
- Click the "Classify" button to classify the email.
- View the result displayed as "Spam" or "Ham."
- Tokenization and removal of punctuation.
- Conversion to lowercase.
- Stopword removal.
- Used
CountVectorizer
for bag-of-words representation.
- Trained using
Multinomial Naive Bayes
for high performance on text classification tasks.
- Achieved an accuracy of ~98% on the test dataset.
P3-Spam-Email-Classification-Using-NLP-and-Machine-Learning/
│
├── spamDetector.py # Main application script
├── train_model.py # Script for training the model
├── spam.pkl # Pre-trained machine learning model
├── vectorizer.pkl # Pre-trained CountVectorizer
├── requirements.txt # List of dependencies
├── README.md # Project documentation
├── screenshot.png # Screenshot of the web interface
└── spam.csv # Dataset used for training
- Deployment: Host the app on platforms like Heroku or AWS for broader accessibility.
- Multilingual Support: Extend the classification to handle multiple languages.
- Advanced Models: Incorporate deep learning models for improved accuracy.
- Explainability: Add visualizations to explain model decisions.
- User Authentication: Implement user logins to save classification history.
- Model Overfitting: Resolved using cross-validation and hyperparameter tuning.
- Text Preprocessing: Dealt with removing noisy data and handling special characters.
- Real-time Predictions: Ensured low latency while maintaining accuracy.