-
Notifications
You must be signed in to change notification settings - Fork 100
/
Copy pathindex.Rmd
executable file
·181 lines (135 loc) · 14.5 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "Supervised Machine Learning for Text Analysis in R"
author: "Emil Hvitfeldt and Julia Silge"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
documentclass: krantz
bibliography: [book.bib]
link-citations: yes
links-as-notes: true
colorlinks: true
lot: false
lof: false
monofont: "Source Code Pro"
monofontoptions: "Scale=0.7"
github-repo: EmilHvitfeldt/smltar
description: "Supervised Machine Learning for Text Analysis in R"
cover-image: cover.jpg
url: https://smltar.com
graphics: yes
---
```{r include=FALSE}
# automatically create a bib database for R packages
knitr::write_bib(c(
.packages(), 'bookdown', 'knitr', 'rmarkdown'
), 'packages.bib')
```
\mainmatter
`r if (knitr:::is_html_output()) '
# Welcome to Supervised Machine Learning for Text Analysis in R {-}
This is the [website](https://smltar.com/) for *Supervised Machine Learning for Text Analysis in R*! Visit the [GitHub repository for this site](https://github.com/EmilHvitfeldt/smltar), or buy a physical copy from [CRC Press](https://doi.org/10.1201/9781003093459), [Bookshop.org](https://bookshop.org/books/supervised-machine-learning-for-text-analysis-in-r-9780367554194/9780367554194), or [Amazon](https://amzn.to/3DaHzjF).
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This online work by [Emil Hvitfeldt](https://www.emilhvitfeldt.com/) and [Julia Silge](http://juliasilge.com/) is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
'`
\pagenumbering{roman}
\setcounter{page}{11}
# Preface {-}
Modeling as a statistical practice can encompass a wide variety of activities. This book focuses on *supervised or predictive modeling for text*, using text data to make predictions about the world around us. We use the [tidymodels](https://www.tidymodels.org/) framework for modeling, a consistent and flexible collection of R packages developed to encourage good statistical practice.
Supervised machine learning using text data involves building a statistical model to estimate some output from input that includes language. The two types of models we train in this book are regression and classification. Think of \index{regression}regression models as predicting numeric or continuous outputs, such as predicting the year of a United States Supreme Court opinion from the text of that opinion. Think of \index{classification}classification models as predicting outputs that are discrete quantities or class labels, such as predicting whether a GitHub issue is about documentation or not from the text of the issue. Models like these can be used to make predictions for new observations, to understand what features or characteristics contribute to differences in the output, and more. We can evaluate our models using performance metrics to determine which are best, which are acceptable for our specific context, and even which are fair.
```{block, type = "rmdnote"}
Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features (predictors) for machine learning from language.
```
Natural language that we as speakers and/or writers use must be dramatically transformed to a machine-readable, numeric representation to be ready for computation. In this book, we explore typical text preprocessing steps from the ground up and consider the effects of these steps. We also show how to fluently use the **textrecipes** R package [@textrecipes] to prepare text data within a modeling pipeline.
@Silge2017 provides a practical introduction to text mining with R using tidy data principles, based on the **tidytext** package. If you have already started on the path of gaining insight from your text data, a next step is using that text directly in predictive modeling. Text data contains within it latent information that can be used for insight, understanding, and better decision-making, and predictive modeling with text can bring that information and insight to light. If you have already explored how to analyze text as demonstrated in @Silge2017, this book will move one step further to show you how to *learn and make predictions* from that text data with supervised models. If you are unfamiliar with this previous work, this book will still provide a robust introduction to how text can be represented in useful ways for modeling and a diverse set of supervised modeling approaches for text.
## Outline {-}
The book is divided into three sections. We make a (perhaps arbitrary) distinction between *machine learning methods* and *deep learning methods* by defining deep learning as any kind of multilayer neural network (LSTM, bi-LSTM, CNN) and machine learning as anything else (regularized regression, naive Bayes, SVM, random forest). We make this distinction both because these different methods use separate software packages and modeling infrastructure, and from a pragmatic point of view, it is helpful to split up the chapters this way.
- **Natural language features:** How do we transform text data into a representation useful for modeling? In these chapters, we explore the most common preprocessing steps for text, when they are helpful, and when they are not.
- **Machine learning methods:** We investigate the power of some of the simpler and more lightweight models in our toolbox.
- **Deep learning methods:** Given more time and resources, we see what is possible once we turn to neural networks.
Some of the topics in the second and third sections overlap as they provide different approaches to the same tasks.
Throughout the book, we will demonstrate with examples and build models using a selection of text data sets. A description of these data sets can be found in Appendix \@ref(appendixdata).
```{block, type = "rmdnote"}
We use three kinds of info boxes throughout the book to invite attention to notes and other ideas.
```
```{block, type = "rmdwarning"}
Some boxes call out warnings or possible problems to watch out for.
```
```{block, type = "rmdpackage"}
Boxes marked with hexagons highlight information about specific R packages and how they are used. We use **bold** for the names of R packages.
```
## Topics this book will not cover {-}
This book serves as a thorough introduction to prediction and modeling with text, along with detailed practical examples, but there are many areas of natural language processing we do not cover. \index{CRAN}The [*CRAN Task View on Natural Language Processing*](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) provides details on other ways to use R for computational linguistics. Specific topics we do not cover include:
- **Reading text data into memory:** Text data may come to a data practitioner in any of a long list of heterogeneous formats. Text data exists in PDFs, databases, plain text files (single or multiple for a given project), websites, APIs, literal paper, and more. The skills needed to access and sometimes wrangle text data sets so that they are in memory and ready for analysis are so varied and extensive that we cannot hope to cover them in this book. We point readers to R packages such as **readr** [@R-readr], **pdftools** [@R-pdftools], and **httr** [@R-httr], which we have found helpful in these tasks.
- **Unsupervised machine learning for text:** @Silge2017 provide an introduction to one method of unsupervised text modeling\index{machine learning!unsupervised}, and Chapter \@ref(embeddings) does dive deep into word embeddings, which learn from the latent structure in text data. However, many more unsupervised machine learning algorithms can be used for the goal of learning about the structure or distribution of text data when there are no outcome or output variables to predict.
- **Text generation:** The deep learning model architectures we discuss in Chapters \@ref(dldnn), \@ref(dllstm), and \@ref(dlcnn) can be used to generate new text\index{text generation}, as well as to model existing text. @Chollet2018 provide details on how to use neural network architectures and training data for text generation.
- **Speech processing:** Models that detect words in audio recordings of speech\index{speech} are typically based on many of the principles outlined in this book, but the training data is _audio_ rather than written text. R users can access pre-trained speech-to-text models via large cloud providers, such as Google Cloud's Speech-to-Text API accessible in R through the **googleLanguageR** package [@R-googleLanguageR].
- **Machine translation:** Machine translation\index{translation} of text between languages, based on either older statistical methods or newer neural network methods, is a complex, involved topic. Today, the most successful and well-known implementations of machine translation are proprietary, because large tech companies have access to both the right expertise and enough data in multiple languages to train successful models for general machine translation. Google is one such example, and Google Cloud's Translation API is again available in R through the **googleLanguageR** package.
## Who is this book for? {-}
This book is designed to provide practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate text into their modeling pipelines.
We assume that the reader is somewhat familiar with R, predictive modeling concepts for non-text data, and the [**tidyverse**](https://www.tidyverse.org/) family of packages [@Wickham2019]. For users who don't have this background with tidyverse code, we recommend [*R for Data Science*](http://r4ds.had.co.nz/) [@Wickham2017]. Helpful resources for getting started with modeling and machine learning include a [free interactive course](https://supervised-ml-course.netlify.app/) developed by one of the authors (JS) and [*Hands-On Machine Learning with R*](https://bradleyboehmke.github.io/HOML/) [@Boehmke2019], as well as [*An Introduction to Statistical Learning*](http://faculty.marshall.usc.edu/gareth-james/ISL/) [@James2013].
We don't assume an extensive background in text analysis, but [*Text Mining with R*](https://www.tidytextmining.com/) [@Silge2017], by one of the authors (JS) and David Robinson, provides helpful skills in exploratory data analysis for text that will promote successful text modeling. This book is more advanced than *Text Mining with R* and will help practitioners use their text data in ways not covered in that book.
## Acknowledgments {-}
We are so thankful for the contributions, help, and perspectives of people who have supported us in this project. There are several we would like to thank in particular.
We would like to thank Max Kuhn and Davis Vaughan for their investment in the **tidymodels** packages, David Robinson for his collaboration on the **tidytext** package, and Yihui Xie for his work on **knitr**, **bookdown**, and the R Markdown ecosystem. Thank you to Desirée De Leon for the site design of the online work and to Sarah Lin for the expert creation of the published work's index. We would also like to thank Carol Haney, Kasia Kulma, David Mimno, Kanishka Misra, and an additional anonymous technical reviewer for their detailed, insightful feedback that substantively improved this book, as well as our editor John Kimmel for his perspective and guidance during the process of writing and publishing.
```{r, eval = FALSE, echo = FALSE}
library(tidyverse)
contribs_all_json <- gh::gh("/repos/:owner/:repo/contributors",
owner = "EmilHvitfeldt",
repo = "smltar",
.limit = Inf
)
contribs_all <- tibble(
login = contribs_all_json %>% map_chr("login"),
n = contribs_all_json %>% map_int("contributions")
)
contribs_old <- read_csv("contributors.csv", col_types = list())
contribs_new <- contribs_all %>% anti_join(contribs_old, by = "login")
# Get info for new contributors
needed_json <- map(
contribs_new$login,
~ gh::gh("/users/:username", username = .x)
)
info_new <- tibble(
login = contribs_new$login,
name = map_chr(needed_json, "name", .default = NA),
blog = map_chr(needed_json, "blog", .default = NA)
)
info_old <- contribs_old %>% select(login, name, blog)
info_all <- bind_rows(info_old, info_new)
contribs_all <- bind_rows(contribs_old, contribs_new)
contribs_all <- contribs_all %>%
left_join(info_all, by = "login") %>%
arrange(login)
write_csv(contribs_all, "contributors.csv")
```
```{r, results = "asis", echo = FALSE, message = FALSE}
library(dplyr)
contributors <- read.csv("contributors.csv", stringsAsFactors = FALSE)
contributors <- contributors %>%
filter(!login %in% c("EmilHvitfeldt", "juliasilge", "dcossyleon")) %>%
mutate(
login = paste0("\\@", login),
desc = ifelse(is.na(name), login, paste0(name, " (", login, ")"))
)
cat("This book was written in the open, and multiple people contributed via pull requests or issues. Special thanks goes to the ", xfun::n2w(nrow(contributors)), " people who contributed via GitHub pull requests (in alphabetical order by username): ", sep = "")
cat(paste0(contributors$desc, collapse = ", "))
cat(".\n")
```
Note box icons by Smashicons from flaticon.com.
## Colophon {-}
This book was written in [RStudio](https://www.rstudio.com/ide/) using [**bookdown**](https://bookdown.org). The [website](https://smltar.com) is hosted via [GitHub Pages](https://pages.github.com), and the complete source is available on [GitHub](https://github.com/EmilHvitfeldt/smltar). We generated all plots in this book using [**ggplot2**](https://ggplot2.tidyverse.org) and its light theme (`theme_light()`). The `autoplot()` method for [`conf_mat()`](https://yardstick.tidymodels.org/reference/conf_mat.html) has been modified slightly to allow colors; modified code can be found [online](https://github.com/EmilHvitfeldt/smltar/blob/master/_common.R).
```{block, type = "rmdwarning"}
Because of changes in package versions since the publication of the first edition, you may notice slight differences in some results when comparing this online work and the published paper edition.
```
This version of the book was built with `r R.version.string` and the following packages:
```{r, echo = FALSE, results="asis", cache=FALSE}
deps <- desc::desc_get_deps()
pkgs <- sort(deps$package[deps$type == "Imports"])
pkgs <- sessioninfo::package_info(pkgs, dependencies = FALSE)
df <- tibble(
package = pkgs$package,
version = pkgs$ondiskversion,
source = gsub("@", "\\\\@", pkgs$source)
) %>%
mutate(source = str_remove(source, "\\\\@[a-f0-9]*"))
knitr::kable(df, format = "markdown")
```