04_naive_bayes.Rmd

---
editor_options: 
  markdown: 
    wrap: 72
---

# Spam filter

## Naive Bayes: Spam or Ham?

```{julia chap_4_libraries, cache = FALSE, results = FALSE, echo = FALSE}
cd("./04_naive_bayes/") 
import Pkg
Pkg.activate(".")
using CSV
using DataFrames
using Distributions
using TextAnalysis
using Languages
using MLDataUtils
using Plots
using Random

Random.seed!(123)
```

Nobody likes spam emails. How can Bayes help? In this chapter, we'll
keep expanding our data science knowledge with a practical example. A
simple yet effective way of using Bayesian probability to create a spam
filter from scratch will be introduced. The filter will examine emails
and classify them as either spam or ham (the word for non-spam emails)
based on their content.

What we will be implementing here is a *supervised learning model*, in
other words, a classification model that has been trained on previously
classified data. Think of it like a machine to which you can give some
input, like an email, and will give you some label to that input, like
spam or ham. This machine has a lot of tiny knobs, and based on their
particular configuration it will output some label for each input.
Supervised learning involves iteratively finding the right configuration
of these knobs by letting the machine make a guess with some
pre-classified data, checking if the guess matches the true label, and
if not, tune the knobs in some controlled way. The way our machine will
make predictions is based on the underlying mathematical model. For a
spam filter, a *naive Bayes* approach has proven to be effective, and
you will have the opportunity to verify that yourself at the end of the
chapter. In a naive Bayes model, Bayes' theorem is the main tool for
classifying, and it is *naive* because we make very loose assumptions
about the data we are analyzing. This will be clearer once we dive into
the implementation.

## The Training Data

For the Bayesian spam filter to work correctly, we need to feed it some
good training data. In this context, that means having a large enough
corpus of emails that have been pre-classified as spam or ham. The
emails should be collected from a sufficiently heterogeneous group of
people. After all, spam is a somewhat subjective category: one person's
spam may be another person's ham. The proportion of spam vs. ham in our
data should also be somewhat representative of the real proportion of
emails we receive.

Fortunately, there are a lot of very good datasets available online.
We'll use the "Email Spam Classification Dataset CSV" from
[Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv),
a website where data science enthusiasts and practitioners publish
datasets, participate in competitions, and share their knowledge. The
dataset's description included online helps us make sense of its
contents:

> The .csv file contains 5,172 rows, one row for each email. There are
> 3,002 columns. The first column indicates Email name. The name has
> been set with numbers and not recipients' name to protect privacy. The
> last column has the labels for prediction: \lstinline{1} for spam,
> \lstinline{0} for not spam. The remaining 3,000 columns are the 3,000
> most common words in all the emails, after excluding the
> non-alphabetical characters/words. For each row, the count of each
> word(column) in that email(row) is stored in the respective cells.

Let's take a look at the data. The following code snippet outputs a view
of the first and last rows of the dataset.

```{julia}
raw_df = CSV.read("./04_naive_bayes/data/emails.csv", DataFrame)
```

As you can see, the output informs the amount of rows and columns and
the type of each column and allows us to see a sample of the data.

## Preprocessing the Data

Before we use the data to train our filter, we need to preprocess it a
little bit. First, we should filter out very common words, such as
articles and pronouns, which will most likely add noise rather than
information to our classification algorithm.

```{julia, results = FALSE}
all_words = names(raw_df)[2:end-1]

all_words_text = join(all_words, " ")
document = StringDocument(all_words_text)

prepare!(document, strip_articles)
prepare!(document, strip_pronouns)

vocabulary = split(TextAnalysis.text(document))
clean_words_df = raw_df[!, vocabulary]

data_matrix = Matrix(clean_words_df)'
```

In the first line, we create a variable *all_words* to store a list of
all the words present in the emails. As our dataset has a column for
each word, we do this by storing the names of every column with the
`names` function, except for the first and last column, which are for
email id and for the spam or ham label, respectively.

Let's move on to the second and third lines of the code. We would like
to filter out some words that are very common in the English language,
such as articles and pronouns, which will most likely add noise rather
than information to our classification algorithm. For this we will use
two Julia packages that are specially designed for working with texts of
any type. These are Languages.jl and TextAnalysis.jl. In the third line,
we create a StringDocument, which is a struct provided by
TextAnalysis.jl and we use its built-in methods to remove articles and
pronouns from the list of words we created before. This is done by
calling the *prepare* function two times, with two different flags:
`strip_articles` and `strip_pronouns`. What follows is just code to
recover our original DataFrame with only the relevant columns, i.e., the
words that were not filtered. A `clean_words_df` DataFrame is created
selecting those columns only. Finally, we turn our DataFrame into a
matrix with its rows and columns transposed. This is just the convention
used by the packages we are working with to make our analysis; each
column is one data realization.

Next, we need to divide the data in two: a training set and a testing
set. This is standard practice when working with models that learn from
data, like the one we're going to implement. We're going to train the
model on the training set, and then evaluate the model's accuracy by
having it make predictions on the testing set. In Julia, the package
MLDataUtils.jl has some nice functionalities for data manipulations like
this.

```{julia, results = FALSE}
labels = raw_df.Prediction
(x_train, y_train), (x_test, y_test) = splitobs(shuffleobs((data_matrix, labels)), at = 0.7)
```

The function `splitobs` splits our dataset into a training set and a
testing set, and `shuffleobs` randomizes the order of the data in the
split. We pass a `labels` array to our split function so it knows how to
properly split the dataset. Now we can turn our attention to building
the spam filter.

## The Naive Bayes Approach

As we mentioned, what we are facing here is a *classification* problem,
and we will code from scratch and use a *supervised learning* algorithm
to find a solution with the help of Bayes' theorem. We're going to use a
*naive Bayes* classifier to create our spam filter. We're going to use a
\emph{naive Bayes} classifier to create our spam filter. This method is
going to treat each email just as a collection of words, with no regard
for the order in which they appear. This means we won't take into
account semantic considerations like the particular relationship between
words and their context.

Our strategy will be to estimate a probability of an incoming email
being ham or spam and make a decision based on that. Our general
approach can be summarized as:

$P(spam|email) \propto P(email|spam)P(spam)$

$P(ham|email) \propto P(email|ham)P(ham)$

Notice we use the $\propto$ sign, meaning *proportional to*, instead of
the = sign because the denominator from Bayes's theorem is missing. In
this case, we won't need to calculate it, as it's the same for both
probabilities and all we're going to care about is a comparison of these
two probabilities.

In this naive approach, where semantics aren't taken into account and
each email is just a collection of words, the conditional probability
$P(email|spam)$ means the probability that a given email can be
generated with the collection of words that appear in the spam category
of our data. Let's take a quick example. Imagine for a moment that our
training set of emails consists just of these three emails, all labeled
as spam:

-   Email 1: 'Are you interested in buying my product?'
-   Email 2: 'Congratulations! You've won \$1000!'
-   Email 3: 'Check out this product!'

Also imagine we receive a new, unclassified email and we want to
discover $P(email|spam)$. The new email looks like this:

-   New email: 'Apply and win all these products!'

The new email contains the words *win* and *product*, which are rather
common in our example's training data. We would therefore expect
$P(email|spam)$, the probability of the new email being generated by the
words encountered in the training spam email set, to be relatively high.

(The word \\emph{win} appears in the form \\emph{won} in the training
set, but that's OK. The standard linguistic technique of
\\emph{lemmatization} groups together any related forms of a word and
treats them as the same word.)

Mathematically, the way to calculate $P(email|spam)$ is to take each
word in our target email, calculate the probability of it appearing in
spam emails based on our training set, and multiply those probabilties
together.

$P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)$

We use a similar calculation to determine $P(email|ham)$, the
probability of the new email being generated by the words encountered in
the training ham email set:

$P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)$

The multiplication of each of the probabilities associated with a
particular word here stems from the naive assumption that all the words
in the email are statistically independent. In reality, this assumption
isn't necessarily true. In fact, it's most likely false. Words in a
language are never independent from one another, but this simple
assumption seems to be enough for the level of complexity our problem
requires.

The probability of a given word $word_i$ being in a given category is
calculated like so:

$$P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}$$
$$P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}$$

These formulas tell us exactly what we have to calculate from our data.
We need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each
word, meaning the number of times that $word_i$ is used in the spam and
ham categories, respectively. $N_{spam}$ and $N_{ham}$ are the total
number of words used in the spam and ham categories (including all word
repetitions), and $N_{vocabulary}$ is the total number of unique words
in the dataset. The variable $\alpha$ is a smoothing parameter that
prevents the probability of a given word being in a given category from
going down to zero. If a given word hasn't appeared in the spam category
in our training dataset, for example, we don't want to assign it zero
probability of appearing in new spam emails.

As all of this information will be specific to our dataset, a clever way
to aggregate it is to use a Julia *struct*, with attributes for the
pieces of data we'll need to access over and over during the prediction
process. Here's the implementation:

```{julia, results = FALSE}
mutable struct BayesSpamFilter
    words_count_ham::Dict{String, Int64}
    words_count_spam::Dict{String, Int64}
    N_ham::Int64
    N_spam::Int64
    vocabulary::Array{String}
    BayesSpamFilter() = new()
end
```

The relevant attributes of the struct are `words_count_ham` and
`words_count_spam`, two dictionaries containing the frequency of
appearance of each word in the ham and spam datasets; `N_ham` and
`N_spam`, the total number of words appearing in each category; and
`vocabulary`, an array of all the unique words in the dataset.

The line `BayesSpamFilter() = new()` is the constructor of this struct.
Because the constructor is empty, all the attributes will be undefined
when we instantiate the filter. We'll have to define some functions to
fill these variables with values that are relevant to our particular
problem. First, here's a function `word_count` that counts the
occurrences of each word in the ham and spam categories.

Now we are going to define some functions that will be important for our
filter implementation.

```{julia, results = FALSE}
function words_count(word_data, vocabulary, labels, spam=0)
    count_dict = Dict{String,Int64}()
    n_emails = size(word_data)[2]
    for (i, word) in enumerate(vocabulary)
        count_dict[word] = sum([word_data[i, j] for j in 1:n_emails if labels[j] == spam])
    end
    return count_dict
end
```

The function *word_count* counts the occurrences of each word in the ham
and spam categories. One of its parameters is *word_data*, which we
defined before and is a matrix where each column is an email and each
row is a word.

Next, we'll define a `fit!` function for our spam filter struct. Notice
we're using the bang (`!`) convention here to indicate a function that
modifies its arguments in-place (in this case, the spam filter struct
itself). This function *fits* our model to the data, a typical procedure
in data science and machine learning areas.

```{julia, results = TRUE}
function fit!(model::BayesSpamFilter, x_train, y_train, voc)
    model.vocabulary = voc
    model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0)
    model.words_count_spam = words_count(x_train, model.vocabulary, y_train, 1)
    model.N_ham = sum(values(model.words_count_ham))
    model.N_spam = sum(values(model.words_count_spam))
    return
end
```

What we mean by fitting the model to the data is mainly filling all the
undefined parameters in our struct with values informed by the training
data. To do this, we use the `words_count` function we defined earlier.
Notice that we're only fitting the model to the training portion of the
data, since we're reserving the testing portion to evaluate the model's
accuracy.

## Training the Model

Now it's time to instantiate our spam filter and fit the model to the
training data. With the struct and helper functions we've defined, the
process is quite straightforward.

```{julia, results = FALSE}
spam_filter = BayesSpamFilter()
fit!(spam_filter, x_train, y_train, vocabulary)
```

We create an instance of our `BayesSpamFilter` struct and pass it to our
`fit!` function along with the data. Notice that we're only passing in
the training portion of the dataset, since we want to reserve the
testing portion to evaluate the model's accuracy later.

## Making Predictions

Now that we have our model, we can use it to make some spam vs. ham
predictions and assess its performance. We'll define a few more
functions to help with this process. First, we need a function
implementing the TAL formula that we discussed earlier.

```{julia, results = TRUE}
function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
    ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
    spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
    return ham_prob, spam_prob
end
```

This function calculates $P(word_i|spam)$ and $P(word_i|ham)$ for a
given word. We'll call it for each word of an incoming email within
another function, `spam_predict`, to calculate the probability of that
email being spam or ham.

```{julia, results = FALSE}
function spam_predict(email, model::BayesSpamFilter, α, tol=100)
    ngrams_email = ngrams(StringDocument(email))
    email_words = keys(ngrams_email)
    n_vocabulary = length(model.vocabulary)
    ham_prior = model.N_ham / (model.N_ham + model.N_spam)
    spam_prior = model.N_spam / (model.N_ham + model.N_spam)

    if length(email_words) > tol
        word_freq = values(ngrams_email)
        sort_idx = sortperm(collect(word_freq), rev=true)
        email_words = collect(email_words)[sort_idx][1:tol]
    end

    email_ham_probability = BigFloat(1)
    email_spam_probability = BigFloat(1)

    for word in intersect(email_words, model.vocabulary)
        word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α)
        email_ham_probability *= word_ham_prob
        email_spam_probability *= word_spam_prob
    end
    return ham_prior * email_ham_probability, spam_prior * email_spam_probability
end
```

This function takes as input a new email that we want to classify as
spam or ham, our fitted model, an $α$ value (which we've already
discussed), and a tolerance value `tol`. The latter sets the maximum
number of unique words in an email that we'll look at. We saw that the
calculations for $P(email|spam)$ and $P(email|ham)$ require the
multiplication of each $P(word_i|spam)$ and $P(word_i|ham)$ term. When
emails consist of a large number of words, this multiplication may lead
to very small probabilities, up to the point that the computer
interprets those probabilities as zero. This isn't desirable; we need
values of $P(email|spam)$ and $P(email|ham)$ that are larger than zero
in order to multiply them by $P(spam)$ and $P(ham)$, respectively, and
compare these values to make a prediction. To avoid probabilities of
zero, we'll only consider up to the `tol` most frequently used words in
the email.

Finally, we arrive to the point of actually testing our model. We create
another function to manage the process. This function classifies each
email into Ham (represented by the number 0) or Spam (represented by the
number 1)

```{julia, results = FALSE}
function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200)
    N = length(y_test)
    predictions = Array{Int64,1}(undef, N)
    for i in 1:N
        email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...)
        pham, pspam = spam_predict(email, model, α, tol)
        pred = argmax([pham, pspam]) - 1
        predictions[i] = pred
    end

    predictions
end
```

This function takes in the testing portion of the data and our trained
model. We call our `spam_predict` function for each email in the testing
data and use the maximum (`argmax`) of the two returned probability
values to predict (`pred`) if the email is spam or ham. We return the
predictions as an array of values, which will contain zeros for ham
emails, and ones for spam emails. Here we call the function to make
predictions about the test data:

```{julia, results = FALSE}
predictions = get_predictions(x_test, y_test, spam_filter, 1)
```

Let's take a look at the predicted classifications of just the first
five emails in the test data.

```{julia, results = TRUE}
predictions[1:5]
```

Of the first five emails, one (the third) was classified as spam, and
the rest were classified as ham.

## Evaluating the Accuracy

Looking at the predictions themselves is pretty meaningless; what we
really want to know is the model's accuracy. We'll define another
function to calculate this.

```{julia, results = FALSE}
function spam_filter_accuracy(predictions, actual)
    N = length(predictions)
    correct = sum(predictions .== actual)
    accuracy = correct / N
    accuracy
end
```

This function compares the predicted classifications with the actual
classifications of the test data, counts the number of correct
predictions, and divides this number by the total number of test emails,
giving us an accuracy measurement. Here we call the function:

```{julia, results = TRUE}
spam_filter_accuracy(predictions, y_test)
```

The output indicates our model is about 95 percent accurate. It appears
our model is performing very well! Such a high accuracy rate is quite
astonishing for a model so naive and simple. In fact, it may be a little
too good to be true, because we have to take into account one more
thing. Our model classifies emails into spam or ham, but the amount of
ham emails in our data set is considerably higher than the spam ones.
Let's see the percentages:

```{julia, results = TRUE}
sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction])
```

To calculate the proportion of spam to ham emails, we sum over the
`Prediction` column of the dataset remembering it only consists of 0s
and 1s, and then we divide by the total amount of emails.This type of
classification problem, where there's an unequal distribution of classes
in the dataset, is called *imbalanced*. With unbalanced data, a better
way to see how the model is performing is to construct a *confusion
matrix*, an $N \times N$ matrix, where $N$ is the number of target
classes (in our case, 2, for spam and ham). The matrix compares the
actual values for each class with those predicted by the model. Here's a
function that builds a confusion matrix for our spam filter:

```{julia, results = FALSE}
function spam_filter_confusion_matrix(y_test, predictions)
    # 2x2 matrix is instantiated with zeros
    confusion_matrix = zeros((2, 2))

    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))

    # Now we convert the confusion matrix into a DataFrame 
    confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))

    return confusion_df
end
```

Now let's call our function to build the confusion matrix for our model.

```{julia, results = TRUE}
confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions)
```

Row 1 of the confusion matrix shows us all the times our model
classified emails to be ham; 1,056 of those classifications were correct
and 36 were incorrect. Similarly, the `spam_mail` column shows us the
classifications for all the spam emails; 36 were misidentified as ham,
and 427 were correctly identified as spam.

Now that we have the confusion matrix, we can calculate the accuracy of
the model segmented by category.

```{julia, results = TRUE}
ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail])
```

```{julia, results = TRUE}
spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail])
```

With these values now we have a more fine-grained measure of the
accuracy of our model. Now we know that our spam filter doesn't have the
same degree of accuracy for spam and for ham emails. As a consequence of
the imbalance in our data, ham emails will be classified as such more
accurately than spam emails. Still, with both percentages above 90, the
accuracy is pretty good for a model so simple and naive. Models like
these can be used like a baseline for creating more complex ones on top
of them.

## Summary

In this chapter, we've used a naive Bayes approach to build a simple
email spam filter. We walked through the whole process of training,
testing, and evaluating a learning model. First, we obtained a dataset
of emails already classified as spam or ham and preprocessed the data.
Then we considered the theoretical framework for our naive analysis.
Using Bayes's theorem on the data available, we assigned a probability
of belonging to a spam or ham email to each word of the email dataset.
The probability of a new email being classified as spam is therefore the
product of the probabilities of each of its constituent words. We
defined a Julia `struct` for the spam filter object and created
functions to fit the spam filter object to the data. Finally, we made
predictions on new data and evaluated our model's performance by
calculating the accuracy and making a confusion matrix.

## Appendix - A little more about alpha

As we have seen, to calculate the probability of the email being a spam
email, we should use

$P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)$

where P(wordnp\|spam) stands for the probability of the word that is not
presen t in our dataset. What probability should be assigned to this
word? One way to handle this could be to simply ignore that term in the
multiplication. In other words, assigning P(wordnp\|spam)=1. Without
thinking about it too much, we can conclude that this doesn't make any
sense, since that would mean that the probability to find that word in a
spam (or ham, too) email would be equal to 1. A more logically
consistent approach would be to assign 0 probability to that word. But
there is a problem: with $P(wordnp\|spam)=0$

we can quickly see that

$P(word0|spam)P(word1|spam)...P(wordnp|spam)=0$

This is the motivation for introducing the smoothing parameter α

into our equation. In a real-word scenario, we should expect that words
not present in our training set will appear, and altough it makes sense
that they don't have a high probability, it can't be 0. When such a word
appears, the probability assigned for it will be simply

$P(wordnp|spam)=Nwordnp|spam+αNspam+αNvocabulary=0+αNspam+αNvocabulary=αNspam+αNvocabulary$

In summary, α is just a smoothing parameter, so that the probability of
finding a word that is not in our dataset, doesn't go down to 0. Since
we want to keep the probability for these words low enough, it makes
sense to use α=1