-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathWeb Scraping Assignment_edit.Rmd
254 lines (211 loc) · 9.76 KB
/
Web Scraping Assignment_edit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
title: "Web Scraping: Songs and Lyrics"
output:
html_document
---
* October 24, 2019
* DATA 900 - Professor Gyory
* Jacob Mannix
#### Process Outline
1. **Wikipedia**
1. Access Wikipedia "Billboard Hot 100 Era" top singles by year (available years: 1958-2019)
1. Get the table of top songs for a particular year
1. Get a list of unique top songs w/ artist names for a particular year
2. **Genius**
1. Use the wikipedia lists to access song lyrics for each song
1. Create lists of all song lyrics for a particular year
3. **Visualizations**
4. **Analysis**
### Loading Libraries and Variables
```{r Loading Libraries, warning=FALSE, message=FALSE}
library(rvest)
library(RSelenium)
library(tidyverse)
library(stringr)
library(tm) # text mining
library(wordcloud) # world cloud
library(RColorBrewer) # color palettes
library(SnowballC) # text stemming
library(knitr)
library(rmarkdown)
```
``` {r Loading Saved Variables, warning=FALSE, message=FALSE, cache=TRUE}
load("/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Web Scraping Assignment Data Variables/Songs1980_2015.RData") # Songs
load("/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Web Scraping Assignment Data Variables/lyrics1980_2015.RData") # Lyrics
load("/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Web Scraping Assignment Data Variables/docsAll.RData") # docs
```
## P1: Wikipedia
### P1.1: Accessing Wikipedia "Billboard Hot 100 Era" top singles by year
Read the wikipedia page for the years of "Billboard Hot 100 Era" top singles
``` {r Hot 100 Era Years, warning=FALSE, message=FALSE, cache=TRUE}
# Reading page for "Billboard Hot 100 Era" top singles
billboard_singles <- read_html("https://en.wikipedia.org/wiki/List_of_Billboard_number-one_singles")
# Getting list of all years where Billboard had a "Hot 100 Era" list
hot100_years_full <- billboard_singles %>%
html_nodes("tbody") %>%
html_nodes("tr") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_text()
# Trimming the list above to the years of "Hot 100 Era" only (1958-2019)
hot100_years_all <- hot100_years_full[c(23:62)] #(1:62) #trimming the list of years
hot100_years_all
```
<center><img src="/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Images/Hot100Era.png" alt="drawing" width="550"/></center>
### P1.2: Get the table of top songs for a particular year
```{r Iterating, warning=FALSE, message=FALSE, cache=TRUE}
hot100_years <- "2005" # Here you can choose a specific year to look at
hot100_list <- c() # Creating an empty list for the songs
for(i in hot100_years){
hot100_session <- html_session("https://en.wikipedia.org/wiki/List_of_Billboard_number-one_singles") # Initializing HTML Session
hot100_link <- hot100_session %>%
follow_link(i)
hot100_chart <- hot100_link %>% # Get the song charts for a specific year
# html_nodes("table.wikitable.plainrowheaders") %>% #2010 and after
html_nodes(xpath = "/html/body/div[3]/div[3]/div[4]/div/table[2]") %>% #Before 2010
html_table(fill = TRUE, header = 1)
hot100_list <- append(hot100_list, hot100_chart) # Appending the table/songs to the overall list
}
```
### P1.3: Get a list of unique top songs w/ artist names for a particular year
```{r Unique List, warning=FALSE, message=FALSE, cache=TRUE}
# Converting list to a dataframe and getting unique list of songs and then back into list
hot100_df <- data.frame(hot100_list)[3:4]
hot100_df$Song.Artist <- paste(hot100_df$Song, hot100_df$Artist.s.)
hot100_songs_df <- unique(data.frame(hot100_df)[3]) # Getting unique list of songs and artists
hot100_songs_list <- c()
for (i in hot100_songs_df){
hot100_songs_list <- str_replace_all(i, "[:punct:]", '') # Removing all punctuation from the list
# hot100_songs_list <- str_replace_all(i, '"', '')
}
Songs2018 <- hot100_songs_list # Saving song lists for an individual year
Songs2018
```
<center><img src="/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Images/SongTable.png" alt="drawing" width="750"/></center>
## P2: Genius
### P2.1 Using the cleaned list of songs and artists to get lyrics for each song
```{r Getting Lyrics, warning=FALSE, message=FALSE, results="hide", cache=TRUE}
#Take Dataframe and pass song name into genius.com, Using RSelenium to access Genius.com lyrics for each song
driver <- rsDriver(browser = c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$open()
# Looping through each song name and getting lyrics for each song
lyrics_list <- c()
lyrics_out <- c()
# hot100_songs_list <- Songs2018
for(i in 1:length(hot100_songs_list)){
remote_driver$navigate("https://genius.com")
remote_driver$refresh() #Refresh to home page
Sys.sleep(2)
address_element <- remote_driver$findElement(using = 'xpath', value = '/html/body/div/div/div[1]/form/input')
address_element$sendKeysToElement(list(hot100_songs_list[i]))
Sys.sleep(2)
button_element <- remote_driver$findElement(using = 'xpath', value = "/html/body/div/div/div[1]/form/div[2]")
button_element$clickElement()
Sys.sleep(2)
button_element2 <- remote_driver$findElement(using = 'class', value = "mini_card")
button_element2$clickElement()
Sys.sleep(2)
lyrics_out <- remote_driver$findElement(using = "xpath", value="/html/body/routable-page/ng-outlet/song-page/div/div/div[2]/div[1]/div/defer-compile[1]/lyrics/div/div/section")
Sys.sleep(2)
lyrics_list_text <- lyrics_out$getElementText()
lyrics_list <- append(lyrics_list, lyrics_list_text)
#lyrics_list <- lyrics_list[-c(1)]
}
#driver$server$stop() # Drops the connection to the server
#Write Lyrics to CSV or text file
# write.csv(lyrics_list, file = "test1980lyrics.csv")
# lyrics1980 <- lyrics_list
# lyrics1980 <- lyrics1980[-c(6,7)]
# lyrics1985 <- lyrics1985[-c(8,16)]
# lyrics1990 <- lyrics1990[-c(3,12)]
# lyrics2015 <- lyrics2015[-c(2)]
```
```{r Add Lyrics to Song Information, include=FALSE, cache=TRUE}
#mydf <- data.frame(street_names, lat_long_column) %>%
# mutate(lat_long = str_remove_all(lat_long, "\\(|\\)")) %>% # Remove the parentheses
# from the lat long string
# separate(lat_long, into = c("latitude", "longitude"), sep = ",")
```
## P3: Visualizations and Analysis
### P3.1 Preparing the Lyrics for Visualizations
``` {r Preparing the Lyrics for Visualizations, warning=FALSE, message=FALSE, cache=TRUE}
# Preparing the Lyrics for Visualizations
# text <- read.csv(file = '/Users/jacobmannix/Desktop/test1980lyrics.csv')
docs <- Corpus(VectorSource(lyrics_list))
# inspect(docs)
# Cleaning up the docs
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Cleaning up the docs further
docs <- tm_map(docs, content_transformer(tolower)) #to lower case
docs <- tm_map(docs, removeNumbers) # Remove numbers
docs <- tm_map(docs, removeWords, stopwords("english")) # Remove english common stopwords
docs <- tm_map(docs, removePunctuation) # Remove punctuations
docs <- tm_map(docs, stripWhitespace) # Eliminate extra white spaces
docs <- tm_map(docs, removeWords, c("chorus", "verse"))
# docs <- tm_map(docs, removeWords, c()) # Remove your own stop word
# Creating a Term Document Matrix to display most frequently used words
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 15)
```
### P3.2 Wordcloud
```{r Wordclouds}
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
```
```{r Assocations and Frequency, include=FALSE}
#exploring frequent terms and there associations
# findFreqTerms(dtm, lowfreq = 4)
# findAssocs(dtm, terms = "psycho", corlimit = 0.3)
# head(d, 10)
```
### P3.3 Frequency
```{r Frequency Plot}
# Plotting word frequencies as Barplot
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
col ="lightblue", main ="Most frequent words",
ylab = "Word frequencies")
```
```{r Sentiment Analysis, include=FALSE}
# library(sentimentr)
# sentiment(dtm)
# maybe try for distinct words in songs? for different years/ decades?
# install.packages("tidytext")
# library(tidytext)
# get_sentiments("afinn")
```
### P4: Overall Analysis
#### P4.1: Number of Songs per year
```{r Songs per year count,include=FALSE}
Year <- c("1980", "1985", "1990", "1995", "2000", "2005", "2010", "2015")
Count <- c(17, 27, 26, 12, 18, 8, 17, 9)
year_song_counts <- data.frame(Year, Count)
```
```{r Number of Songs per year}
kable(year_song_counts, format='markdown')
```
#### P4.2: List of Song Names per year
<center><img src="/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Songsperyear/Songsperyear1.png" alt="drawing"/>
<img src="/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Songsperyear/Songsperyear2.png" alt="drawing"/></center>
#### P4.3: WordClouds
<center><img src="/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/WordClouds/AllWordClouds.png" alt="drawing"/></center>
#### P4.4: Word Frequencies
<center><img src="/Users/jacobmannix/Box Sync/M.S. Analytics/Analytics Fall/DATA 900/Web Scrapping/Assignment/Frequencies/AllWordFrequencies.png" alt="drawing"/></center>
#### P4.5: Top words throughout all specified years
```{r TermDocumentMatrix for all Years, include=FALSE}
dtm <- TermDocumentMatrix(docsAll)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
```
```{r Top 20 Words for all years}
kable(head(d, 20), format = 'markdown') # Top 20 words and frequency from all specified years
```