Skip to content

Latest commit

 

History

History
21 lines (14 loc) · 6.28 KB

README.md

File metadata and controls

21 lines (14 loc) · 6.28 KB

naiveBayes

This is a program that classifies a corpus of reviews from Amazon on whether the review was positive or negative. Given a set of training data and testing data, I used this to train a Naïve Bayes Classifier and then applied the classifier on the test data.

PREPROCESSING:

Before building the Naïve Bayes Classifier, the data given from the training.txt and testing.txt file has to be cleaned and organized in a data structure. I chose to represent the text with a bag-of-words model. Since the text has a classification of positive or negative for each document, I created a “bag of words” for a positive class and negative class. This “bag of words” has an underlying data structure of a hash map. Parsing through each document of the text, each word of every document was put into its respective class bag, with the String representation of the word serving as a key and an integer representing the number of times that specific word was put into the bag as its value. A single stop word was removed in hopes of improving the accuracy of the model—“a” as it is frequently used and offers little to the sentiment of an overall document. Therefore, if the word “a” was parsed in a document, it would be skipped and not added to the hash map. A count of the number of positive documents read and the number of negative documents read was also kept so that the overall class probability could be kept later.

ARCHITECTURE:

The underlying data structure for the “bag of words” is a hash map. By working with and reading values from the hash map, all the probabilities can be found with loops. A hash map is chosen over a multiset or vector due to hash map’s O(1) read and write time when given an arbitrary key. When a word from a training document is in the hash map, the hash function will merely calculate the string’s corresponding key allowing the key to be immediately accessed and update its value. If the string is not in the hash map yet, the structure will simply not find anything to access at the hash function given key allowing the new value to be written in at that key. The hash map’s fast read is also useful during the application of the Classifier onto the testing set, since the number of times a word from a test document appears in the bag of words can be read in O(1), just going to the key and returning the value.

MODEL BUILDING:

The Naïve Bayes Classifier assumes that given a class, each feature is independent of each other. Since the probability that a class occurs given a set of features is equal to the probability of the features given that class times the probability of the class divided by the probability of those features according to Bayes Law, we can simply compare which probability of class is more likely given the features to classify a set of features (a document). Since the probability of the set of features is equal in the denominator for both Bayes law comparisons, we can disregard it. Ultimately, we can then classify positive or negative documents based on the max of P(class)P(Xi| class) where X is the set of all words in a document. We train this model by finding these probabilities for the words in the training.txt. The probability that a document is positive is given by the number of positive lines parsed during the bag of words data entry over the total lines parsed. The probability that a document is negative is given by the number of negative lines parsed during the bag of words data entry over the total lines parsed. We then find the probabilities for each word in a document given either class by calculating how many times the word appeared in its respective class bag over the number of words total in the respective bag. In order to account for the case where a word in a test document does not occur in the bag of words created from the training.txt, we apply smoothing. This assumes that the word occurs at least once, adding one to both the conditional probabilities numerator and denominator, thus creating an insignificant probability but not canceling out the rest of the word probabilities by being zero. Finally, in order to account for floating point underflow if the probability gets too small from multiplying so many conditional probabilities together, we take the log of all the probabilities and sum them together before finding the argmax between P(positive|x) and P(negative|x).

RESULTS:

Applying the classifier trained from the training.txt onto the training.txt yields an accuracy of 89.4% correctly classified documents. The time it takes to train and build this is 2 seconds on my machine. Applying this classifier onto the testing.txt yields an accuracy of 85.4%. The time it takes to run this classifier on the testing set is also 2 seconds on my machine. The top 10 most predictive terms were calculated by finding the max of difference between the probability of that word appearing given a positive class and the probability of it appearing given a negative class. This indicates that the word being in a document is a strong indicator of being in one class over the other. They are: “excelente” “sigil” “refund” “excelent” “emerald” “rareware” “garbage” “bait” “enforced” “piracy”.

CHALLENGES:

I faced two big challenges in this machine problem. First, I didn’t understand what N represented in the denominator of the conditional probabilities. I initially had the denominator of each conditional probability as the number of documents of the respective class. I then had a realization that a probability of a feature given a class would be the number of times that word appeared in that class over the total number of words in that class since it is given and we know what class we are assuming the word has come from. My second challenge was getting my accuracy from .846 to past .850. I ultimately added a single stop word remover for “a”, and this bumped my score to .854.

WEAKNESSES:

Naïve Bayes Classifier has to make the assumption that each word is independent of each other given a class. This is not necessarily true. Seeing one word changes the probability of another word being in the document since English strings words together as phrases often. One way to account for this would to do a bigram or trigram model instead of individual words, storing two or more connected word phrases into the “bag”.