This repository is about appying different machine learning techniques on amazon food review dataset.
Below are some data reduction, classification and regression techniques applied on amazon fine food review dataset.
- KNN
- Naive Bayes
- Logistic Regression
- Support Vector Machine
- Decision Tree
- Random Forest
- SGD
- T-SNE
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews
EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews: 568,454 Number of users: 256,059 Number of products: 74,258 Timespan: Oct 1999 - Oct 2012 Number of Attributes/Columns in data: 10
- Id
- ProductId - unique identifier for the product
- UserId - unqiue identifier for the user
- ProfileName
- HelpfulnessNumerator - number of users who found the review helpful
- HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
- Score - rating between 1 and 5
- Time - timestamp for the review
- Summary - brief summary of the review
- Text - text of the review
- Objective:
- Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).
- Begin by removing the html tags
- Remove any punctuations or limited set of special characters like , or . or # etc.
- Check if the word is made up of english letters and is not alpha-numeric
- Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
- Convert the word to lowercase
- Remove Stopwords
- Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)
- BAG OF WORDS
- Bi-Grams and n-Grams.
- TF-IDF
- Word2Vec