-
Notifications
You must be signed in to change notification settings - Fork 25
/
Copy path04_naive_bayes.Rmd
585 lines (478 loc) · 25.3 KB
/
04_naive_bayes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
---
editor_options:
markdown:
wrap: 72
---
# Spam filter
## Naive Bayes: Spam or Ham?
```{julia chap_4_libraries, cache = FALSE, results = FALSE, echo = FALSE}
cd("./04_naive_bayes/")
import Pkg
Pkg.activate(".")
using CSV
using DataFrames
using Distributions
using TextAnalysis
using Languages
using MLDataUtils
using Plots
using Random
Random.seed!(123)
```
Nobody likes spam emails. How can Bayes help? In this chapter, we'll
keep expanding our data science knowledge with a practical example. A
simple yet effective way of using Bayesian probability to create a spam
filter from scratch will be introduced. The filter will examine emails
and classify them as either spam or ham (the word for non-spam emails)
based on their content.
What we will be implementing here is a *supervised learning model*, in
other words, a classification model that has been trained on previously
classified data. Think of it like a machine to which you can give some
input, like an email, and will give you some label to that input, like
spam or ham. This machine has a lot of tiny knobs, and based on their
particular configuration it will output some label for each input.
Supervised learning involves iteratively finding the right configuration
of these knobs by letting the machine make a guess with some
pre-classified data, checking if the guess matches the true label, and
if not, tune the knobs in some controlled way. The way our machine will
make predictions is based on the underlying mathematical model. For a
spam filter, a *naive Bayes* approach has proven to be effective, and
you will have the opportunity to verify that yourself at the end of the
chapter. In a naive Bayes model, Bayes' theorem is the main tool for
classifying, and it is *naive* because we make very loose assumptions
about the data we are analyzing. This will be clearer once we dive into
the implementation.
## The Training Data
For the Bayesian spam filter to work correctly, we need to feed it some
good training data. In this context, that means having a large enough
corpus of emails that have been pre-classified as spam or ham. The
emails should be collected from a sufficiently heterogeneous group of
people. After all, spam is a somewhat subjective category: one person's
spam may be another person's ham. The proportion of spam vs. ham in our
data should also be somewhat representative of the real proportion of
emails we receive.
Fortunately, there are a lot of very good datasets available online.
We'll use the "Email Spam Classification Dataset CSV" from
[Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv),
a website where data science enthusiasts and practitioners publish
datasets, participate in competitions, and share their knowledge. The
dataset's description included online helps us make sense of its
contents:
> The .csv file contains 5,172 rows, one row for each email. There are
> 3,002 columns. The first column indicates Email name. The name has
> been set with numbers and not recipients' name to protect privacy. The
> last column has the labels for prediction: \lstinline{1} for spam,
> \lstinline{0} for not spam. The remaining 3,000 columns are the 3,000
> most common words in all the emails, after excluding the
> non-alphabetical characters/words. For each row, the count of each
> word(column) in that email(row) is stored in the respective cells.
Let's take a look at the data. The following code snippet outputs a view
of the first and last rows of the dataset.
```{julia}
raw_df = CSV.read("./04_naive_bayes/data/emails.csv", DataFrame)
```
As you can see, the output informs the amount of rows and columns and
the type of each column and allows us to see a sample of the data.
## Preprocessing the Data
Before we use the data to train our filter, we need to preprocess it a
little bit. First, we should filter out very common words, such as
articles and pronouns, which will most likely add noise rather than
information to our classification algorithm.
```{julia, results = FALSE}
all_words = names(raw_df)[2:end-1]
all_words_text = join(all_words, " ")
document = StringDocument(all_words_text)
prepare!(document, strip_articles)
prepare!(document, strip_pronouns)
vocabulary = split(TextAnalysis.text(document))
clean_words_df = raw_df[!, vocabulary]
data_matrix = Matrix(clean_words_df)'
```
In the first line, we create a variable *all_words* to store a list of
all the words present in the emails. As our dataset has a column for
each word, we do this by storing the names of every column with the
`names` function, except for the first and last column, which are for
email id and for the spam or ham label, respectively.
Let's move on to the second and third lines of the code. We would like
to filter out some words that are very common in the English language,
such as articles and pronouns, which will most likely add noise rather
than information to our classification algorithm. For this we will use
two Julia packages that are specially designed for working with texts of
any type. These are Languages.jl and TextAnalysis.jl. In the third line,
we create a StringDocument, which is a struct provided by
TextAnalysis.jl and we use its built-in methods to remove articles and
pronouns from the list of words we created before. This is done by
calling the *prepare* function two times, with two different flags:
`strip_articles` and `strip_pronouns`. What follows is just code to
recover our original DataFrame with only the relevant columns, i.e., the
words that were not filtered. A `clean_words_df` DataFrame is created
selecting those columns only. Finally, we turn our DataFrame into a
matrix with its rows and columns transposed. This is just the convention
used by the packages we are working with to make our analysis; each
column is one data realization.
Next, we need to divide the data in two: a training set and a testing
set. This is standard practice when working with models that learn from
data, like the one we're going to implement. We're going to train the
model on the training set, and then evaluate the model's accuracy by
having it make predictions on the testing set. In Julia, the package
MLDataUtils.jl has some nice functionalities for data manipulations like
this.
```{julia, results = FALSE}
labels = raw_df.Prediction
(x_train, y_train), (x_test, y_test) = splitobs(shuffleobs((data_matrix, labels)), at = 0.7)
```
The function `splitobs` splits our dataset into a training set and a
testing set, and `shuffleobs` randomizes the order of the data in the
split. We pass a `labels` array to our split function so it knows how to
properly split the dataset. Now we can turn our attention to building
the spam filter.
## The Naive Bayes Approach
As we mentioned, what we are facing here is a *classification* problem,
and we will code from scratch and use a *supervised learning* algorithm
to find a solution with the help of Bayes' theorem. We're going to use a
*naive Bayes* classifier to create our spam filter. We're going to use a
\emph{naive Bayes} classifier to create our spam filter. This method is
going to treat each email just as a collection of words, with no regard
for the order in which they appear. This means we won't take into
account semantic considerations like the particular relationship between
words and their context.
Our strategy will be to estimate a probability of an incoming email
being ham or spam and make a decision based on that. Our general
approach can be summarized as:
$P(spam|email) \propto P(email|spam)P(spam)$
$P(ham|email) \propto P(email|ham)P(ham)$
Notice we use the $\propto$ sign, meaning *proportional to*, instead of
the = sign because the denominator from Bayes's theorem is missing. In
this case, we won't need to calculate it, as it's the same for both
probabilities and all we're going to care about is a comparison of these
two probabilities.
In this naive approach, where semantics aren't taken into account and
each email is just a collection of words, the conditional probability
$P(email|spam)$ means the probability that a given email can be
generated with the collection of words that appear in the spam category
of our data. Let's take a quick example. Imagine for a moment that our
training set of emails consists just of these three emails, all labeled
as spam:
- Email 1: 'Are you interested in buying my product?'
- Email 2: 'Congratulations! You've won \$1000!'
- Email 3: 'Check out this product!'
Also imagine we receive a new, unclassified email and we want to
discover $P(email|spam)$. The new email looks like this:
- New email: 'Apply and win all these products!'
The new email contains the words *win* and *product*, which are rather
common in our example's training data. We would therefore expect
$P(email|spam)$, the probability of the new email being generated by the
words encountered in the training spam email set, to be relatively high.
(The word \\emph{win} appears in the form \\emph{won} in the training
set, but that's OK. The standard linguistic technique of
\\emph{lemmatization} groups together any related forms of a word and
treats them as the same word.)
Mathematically, the way to calculate $P(email|spam)$ is to take each
word in our target email, calculate the probability of it appearing in
spam emails based on our training set, and multiply those probabilties
together.
$P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)$
We use a similar calculation to determine $P(email|ham)$, the
probability of the new email being generated by the words encountered in
the training ham email set:
$P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)$
The multiplication of each of the probabilities associated with a
particular word here stems from the naive assumption that all the words
in the email are statistically independent. In reality, this assumption
isn't necessarily true. In fact, it's most likely false. Words in a
language are never independent from one another, but this simple
assumption seems to be enough for the level of complexity our problem
requires.
The probability of a given word $word_i$ being in a given category is
calculated like so:
$$P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}$$
$$P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}$$
These formulas tell us exactly what we have to calculate from our data.
We need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each
word, meaning the number of times that $word_i$ is used in the spam and
ham categories, respectively. $N_{spam}$ and $N_{ham}$ are the total
number of words used in the spam and ham categories (including all word
repetitions), and $N_{vocabulary}$ is the total number of unique words
in the dataset. The variable $\alpha$ is a smoothing parameter that
prevents the probability of a given word being in a given category from
going down to zero. If a given word hasn't appeared in the spam category
in our training dataset, for example, we don't want to assign it zero
probability of appearing in new spam emails.
As all of this information will be specific to our dataset, a clever way
to aggregate it is to use a Julia *struct*, with attributes for the
pieces of data we'll need to access over and over during the prediction
process. Here's the implementation:
```{julia, results = FALSE}
mutable struct BayesSpamFilter
words_count_ham::Dict{String, Int64}
words_count_spam::Dict{String, Int64}
N_ham::Int64
N_spam::Int64
vocabulary::Array{String}
BayesSpamFilter() = new()
end
```
The relevant attributes of the struct are `words_count_ham` and
`words_count_spam`, two dictionaries containing the frequency of
appearance of each word in the ham and spam datasets; `N_ham` and
`N_spam`, the total number of words appearing in each category; and
`vocabulary`, an array of all the unique words in the dataset.
The line `BayesSpamFilter() = new()` is the constructor of this struct.
Because the constructor is empty, all the attributes will be undefined
when we instantiate the filter. We'll have to define some functions to
fill these variables with values that are relevant to our particular
problem. First, here's a function `word_count` that counts the
occurrences of each word in the ham and spam categories.
Now we are going to define some functions that will be important for our
filter implementation.
```{julia, results = FALSE}
function words_count(word_data, vocabulary, labels, spam=0)
count_dict = Dict{String,Int64}()
n_emails = size(word_data)[2]
for (i, word) in enumerate(vocabulary)
count_dict[word] = sum([word_data[i, j] for j in 1:n_emails if labels[j] == spam])
end
return count_dict
end
```
The function *word_count* counts the occurrences of each word in the ham
and spam categories. One of its parameters is *word_data*, which we
defined before and is a matrix where each column is an email and each
row is a word.
Next, we'll define a `fit!` function for our spam filter struct. Notice
we're using the bang (`!`) convention here to indicate a function that
modifies its arguments in-place (in this case, the spam filter struct
itself). This function *fits* our model to the data, a typical procedure
in data science and machine learning areas.
```{julia, results = TRUE}
function fit!(model::BayesSpamFilter, x_train, y_train, voc)
model.vocabulary = voc
model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0)
model.words_count_spam = words_count(x_train, model.vocabulary, y_train, 1)
model.N_ham = sum(values(model.words_count_ham))
model.N_spam = sum(values(model.words_count_spam))
return
end
```
What we mean by fitting the model to the data is mainly filling all the
undefined parameters in our struct with values informed by the training
data. To do this, we use the `words_count` function we defined earlier.
Notice that we're only fitting the model to the training portion of the
data, since we're reserving the testing portion to evaluate the model's
accuracy.
## Training the Model
Now it's time to instantiate our spam filter and fit the model to the
training data. With the struct and helper functions we've defined, the
process is quite straightforward.
```{julia, results = FALSE}
spam_filter = BayesSpamFilter()
fit!(spam_filter, x_train, y_train, vocabulary)
```
We create an instance of our `BayesSpamFilter` struct and pass it to our
`fit!` function along with the data. Notice that we're only passing in
the training portion of the dataset, since we want to reserve the
testing portion to evaluate the model's accuracy later.
## Making Predictions
Now that we have our model, we can use it to make some spam vs. ham
predictions and assess its performance. We'll define a few more
functions to help with this process. First, we need a function
implementing the TAL formula that we discussed earlier.
```{julia, results = TRUE}
function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
return ham_prob, spam_prob
end
```
This function calculates $P(word_i|spam)$ and $P(word_i|ham)$ for a
given word. We'll call it for each word of an incoming email within
another function, `spam_predict`, to calculate the probability of that
email being spam or ham.
```{julia, results = FALSE}
function spam_predict(email, model::BayesSpamFilter, α, tol=100)
ngrams_email = ngrams(StringDocument(email))
email_words = keys(ngrams_email)
n_vocabulary = length(model.vocabulary)
ham_prior = model.N_ham / (model.N_ham + model.N_spam)
spam_prior = model.N_spam / (model.N_ham + model.N_spam)
if length(email_words) > tol
word_freq = values(ngrams_email)
sort_idx = sortperm(collect(word_freq), rev=true)
email_words = collect(email_words)[sort_idx][1:tol]
end
email_ham_probability = BigFloat(1)
email_spam_probability = BigFloat(1)
for word in intersect(email_words, model.vocabulary)
word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α)
email_ham_probability *= word_ham_prob
email_spam_probability *= word_spam_prob
end
return ham_prior * email_ham_probability, spam_prior * email_spam_probability
end
```
This function takes as input a new email that we want to classify as
spam or ham, our fitted model, an $α$ value (which we've already
discussed), and a tolerance value `tol`. The latter sets the maximum
number of unique words in an email that we'll look at. We saw that the
calculations for $P(email|spam)$ and $P(email|ham)$ require the
multiplication of each $P(word_i|spam)$ and $P(word_i|ham)$ term. When
emails consist of a large number of words, this multiplication may lead
to very small probabilities, up to the point that the computer
interprets those probabilities as zero. This isn't desirable; we need
values of $P(email|spam)$ and $P(email|ham)$ that are larger than zero
in order to multiply them by $P(spam)$ and $P(ham)$, respectively, and
compare these values to make a prediction. To avoid probabilities of
zero, we'll only consider up to the `tol` most frequently used words in
the email.
Finally, we arrive to the point of actually testing our model. We create
another function to manage the process. This function classifies each
email into Ham (represented by the number 0) or Spam (represented by the
number 1)
```{julia, results = FALSE}
function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200)
N = length(y_test)
predictions = Array{Int64,1}(undef, N)
for i in 1:N
email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...)
pham, pspam = spam_predict(email, model, α, tol)
pred = argmax([pham, pspam]) - 1
predictions[i] = pred
end
predictions
end
```
This function takes in the testing portion of the data and our trained
model. We call our `spam_predict` function for each email in the testing
data and use the maximum (`argmax`) of the two returned probability
values to predict (`pred`) if the email is spam or ham. We return the
predictions as an array of values, which will contain zeros for ham
emails, and ones for spam emails. Here we call the function to make
predictions about the test data:
```{julia, results = FALSE}
predictions = get_predictions(x_test, y_test, spam_filter, 1)
```
Let's take a look at the predicted classifications of just the first
five emails in the test data.
```{julia, results = TRUE}
predictions[1:5]
```
Of the first five emails, one (the third) was classified as spam, and
the rest were classified as ham.
## Evaluating the Accuracy
Looking at the predictions themselves is pretty meaningless; what we
really want to know is the model's accuracy. We'll define another
function to calculate this.
```{julia, results = FALSE}
function spam_filter_accuracy(predictions, actual)
N = length(predictions)
correct = sum(predictions .== actual)
accuracy = correct / N
accuracy
end
```
This function compares the predicted classifications with the actual
classifications of the test data, counts the number of correct
predictions, and divides this number by the total number of test emails,
giving us an accuracy measurement. Here we call the function:
```{julia, results = TRUE}
spam_filter_accuracy(predictions, y_test)
```
The output indicates our model is about 95 percent accurate. It appears
our model is performing very well! Such a high accuracy rate is quite
astonishing for a model so naive and simple. In fact, it may be a little
too good to be true, because we have to take into account one more
thing. Our model classifies emails into spam or ham, but the amount of
ham emails in our data set is considerably higher than the spam ones.
Let's see the percentages:
```{julia, results = TRUE}
sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction])
```
To calculate the proportion of spam to ham emails, we sum over the
`Prediction` column of the dataset remembering it only consists of 0s
and 1s, and then we divide by the total amount of emails.This type of
classification problem, where there's an unequal distribution of classes
in the dataset, is called *imbalanced*. With unbalanced data, a better
way to see how the model is performing is to construct a *confusion
matrix*, an $N \times N$ matrix, where $N$ is the number of target
classes (in our case, 2, for spam and ham). The matrix compares the
actual values for each class with those predicted by the model. Here's a
function that builds a confusion matrix for our spam filter:
```{julia, results = FALSE}
function spam_filter_confusion_matrix(y_test, predictions)
# 2x2 matrix is instantiated with zeros
confusion_matrix = zeros((2, 2))
confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
# Now we convert the confusion matrix into a DataFrame
confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))
return confusion_df
end
```
Now let's call our function to build the confusion matrix for our model.
```{julia, results = TRUE}
confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions)
```
Row 1 of the confusion matrix shows us all the times our model
classified emails to be ham; 1,056 of those classifications were correct
and 36 were incorrect. Similarly, the `spam_mail` column shows us the
classifications for all the spam emails; 36 were misidentified as ham,
and 427 were correctly identified as spam.
Now that we have the confusion matrix, we can calculate the accuracy of
the model segmented by category.
```{julia, results = TRUE}
ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail])
```
```{julia, results = TRUE}
spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail])
```
With these values now we have a more fine-grained measure of the
accuracy of our model. Now we know that our spam filter doesn't have the
same degree of accuracy for spam and for ham emails. As a consequence of
the imbalance in our data, ham emails will be classified as such more
accurately than spam emails. Still, with both percentages above 90, the
accuracy is pretty good for a model so simple and naive. Models like
these can be used like a baseline for creating more complex ones on top
of them.
## Summary
In this chapter, we've used a naive Bayes approach to build a simple
email spam filter. We walked through the whole process of training,
testing, and evaluating a learning model. First, we obtained a dataset
of emails already classified as spam or ham and preprocessed the data.
Then we considered the theoretical framework for our naive analysis.
Using Bayes's theorem on the data available, we assigned a probability
of belonging to a spam or ham email to each word of the email dataset.
The probability of a new email being classified as spam is therefore the
product of the probabilities of each of its constituent words. We
defined a Julia `struct` for the spam filter object and created
functions to fit the spam filter object to the data. Finally, we made
predictions on new data and evaluated our model's performance by
calculating the accuracy and making a confusion matrix.
## Appendix - A little more about alpha
As we have seen, to calculate the probability of the email being a spam
email, we should use
$P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)$
where P(wordnp\|spam) stands for the probability of the word that is not
presen t in our dataset. What probability should be assigned to this
word? One way to handle this could be to simply ignore that term in the
multiplication. In other words, assigning P(wordnp\|spam)=1. Without
thinking about it too much, we can conclude that this doesn't make any
sense, since that would mean that the probability to find that word in a
spam (or ham, too) email would be equal to 1. A more logically
consistent approach would be to assign 0 probability to that word. But
there is a problem: with $P(wordnp\|spam)=0$
we can quickly see that
$P(word0|spam)P(word1|spam)...P(wordnp|spam)=0$
This is the motivation for introducing the smoothing parameter α
into our equation. In a real-word scenario, we should expect that words
not present in our training set will appear, and altough it makes sense
that they don't have a high probability, it can't be 0. When such a word
appears, the probability assigned for it will be simply
$P(wordnp|spam)=Nwordnp|spam+αNspam+αNvocabulary=0+αNspam+αNvocabulary=αNspam+αNvocabulary$
In summary, α is just a smoothing parameter, so that the probability of
finding a word that is not in our dataset, doesn't go down to 0. Since
we want to keep the probability for these words low enough, it makes
sense to use α=1