This project aims to classify text-based reviews as either positive or negative using machine learning models. It includes exploratory data analysis, data preprocessing, feature engineering, model training, and evaluation.
- Python 3.9+
- Virtual environment (Conda 3.9)
- Required libraries (specified in
requirements.txt
)
This project includes an Exploratory Data Analysis (EDA) Jupyter/Colab notebook located at EDA/EDA.ipynb. The EDA notebook generates a wordcloud.jpg and creates the following folders for models and metrics:
models: Contains trained machine learning models.
metric_pics: Contains metrics visualizations.
The project's root directory contains the following main components:
train.py: Generates models and metrics visuals, saving them to their respective folders.
inference.py: By default, this script uses the XGBoost Classifier model for predictions. It generates predictions and saves them in a file named test_labels_pred.csv with id and sentiment columns.
The utils folder includes two Python files:
utils_train.py: Contains various functions used in train.py.
utils_inference.py: Contains functions used in inference.py, ensuring no additional library uploads are required.
All additional files placed near the inference script and will be loaded automatically.
-
Drag and Drop test_reviews.csv to the root folder
-
Run the following commands from your virtual environment:
pip install -r requirements.txt
- Execute the inference script with the following command: (Linux and macOS)
python3 inference.py test_reviews.csv test_labels_pred.csv
or (for Windows)
python inference.py test_reviews.csv test_labels_pred.csv
- Take your generated test_labels_pred.csv file from the root folder.