-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path26-webscraping-linkedin.Rmd
185 lines (138 loc) · 8.12 KB
/
26-webscraping-linkedin.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# Webscraping LinkedIn {#webscraping-linkedin}
```{r webscraping-linkedin, include=FALSE}
chap <- 27
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
warning = FALSE
)
options(scipen = 99, digits = 3)
# Set random number generator see value for replicable pseudorandomness. Why 76?
# https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
```
Did you know that the internet is the greatest source of publicly available data. One of the key skills to being able to obtain data from the web is “web-scraping”, where you use a piece of software to run through a website and collect information.
This technique can be used for collecting data from databases or to collect data that is scattered across a website.
In the following scenario, you are being asked to extract from LinkedIn all the education qualifications and later compare them with those from your own company. Your senior management is seriously worried that the employees working for your main competitor are better skilled.
You are tasked to extract all the forenames, the surnames, the job titles and the education qualifications, including the "from to"" dates and the education organisations.
In our example, we have chosen to look at [Innocent](https://www.innocentdrinks.co.uk/), the famous drink manufacturer. Innocent ranks third among "The Sunday Times 100 Best Companies to Work For".
From the [Sunday Times](https://appointments.thetimes.co.uk/article/best100companies/)
you know that Innocent has 289 employees and that brand managers are earning salaries of around £50,000 plus bonuses. Furthermore the male/female ratio is 40:60, the average age is 39, the amount of voluntary leavers 15% and the earning are over £35,000 for 70% of its staff.
Innocent makes a convincing offer to ambitious types. Nicki Garland is the firm’s team leader for UK and Ireland finance and enjoys the best of both worlds she gets at Innocent. “I wanted to work for a brand that had purpose and meaning,” she says. “When you walk in and see a wall that says the company’s mission is to ‘help people to live well and die old’, you realise this is definitely a very different business.”
This will be your [starting page](https://www.linkedin.com/search/results/people/v2/?facetCurrentCompany=%5B%2224177%22%5D&page=1) of your web scraping. After a few checks, you realise that according to LinkedIn the company has 543 employees for a total of 55 pages of records. From looking at the urls you also realise that %5B%2224177%22%5D is the code used to identify Innocent by LinkedIn.
After lengthy googling on the internet, you find an excellent [tutorial on DataCamp for webscraping](https://www.datacamp.com/community/tutorials/r-web-scraping-rvest) which does mostly what you plan to do, but unfortunately it is scraping Trustpilot, which on the top of things is a website that looks completely different to LinkedIn. You will need to adapt the code given in that tutorial, open each indivual LinkedIn profile for all 55 pages and scrape the required information. The final end product has to be a table which respects ["Hadley Wickham's tidy data rule"](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf)
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
```{r}
library(tidyverse) # General purpose data wrangling
library(rvest) # Parsing of html/xml files
library(rebus) # Verbose regular expressions
```
```{r}
get_last_page <- function(html){
pages_data <- html %>%
html_nodes('.pagination-page') %>% # The '.' indicates the class
html_text() # Extracts the raw text as a list
pages_data[(length(pages_data)-1)] %>% # The second to last of the buttons is the one
unname() %>% # Take the raw string
as.numeric() # Convert to number
}
get_reviews <- function(html){
html %>%
html_nodes('.review-info__body__text') %>% # The relevant tag
html_text() %>%
str_trim() %>% # Trims additional white space
unlist() # Converts the list into a vector
}
get_reviewer_names <- function(html){
html %>%
html_nodes('.consumer-info__details__name') %>%
html_text() %>%
str_trim() %>%
unlist()
}
get_review_dates <- function(html){
status <- html %>%
html_nodes('time') %>%
html_attrs() %>% # The status information is this time a tag attribute
map(2) %>% # Extracts the second element
unlist()
dates <- html %>%
html_nodes('time') %>%
html_attrs() %>%
map(1) %>%
ymd_hms() %>% # Using a lubridate function
# to parse the string into a datetime object
unlist()
return_dates <- tibble(status = status, dates = dates) %>% # You combine the status and the date
# information to filter one via the other
filter(status == 'ndate') %>% # Only these are actual reviews
pull(dates) %>% # Selects and converts to vector
as.POSIXct(origin = '1970-01-01 00:00:00') # Converts datetimes to POSIX objects
# The lengths still occasionally do not lign up. You then arbitrarily crop the dates to fit
# This can cause data imperfections, however reviews on one page are generally close in time)
length_reviews <- length(get_reviews(html))
return_reviews <- if (length(return_dates)> length_reviews){
return_dates[1:length_reviews]
} else{
return_dates
}
return_reviews
}
get_star_rating <- function(html){
# The pattern we look for
pattern = 'star-rating-'%R% capture(DIGIT)
vector <- html %>%
html_nodes('.star-rating') %>%
html_attr('class') %>%
str_match(pattern) %>%
map(1) %>%
as.numeric()
vector <- vector[!is.na(vector)]
vector <- vector[3:length(vector)]
vector
}
get_data_table <- function(html, company_name){
# Extract the Basic information from the HTML
reviews <- get_reviews(html)
reviewer_names <- get_reviewer_names(html)
dates <- get_review_dates(html)
ratings <- get_star_rating(html)
# Minimum length
min_length <- min(length(reviews), length(reviewer_names), length(dates), length(ratings))
# Combine into a tibble
combined_data <- tibble(reviewer = reviewer_names[1:min_length],
date = dates[1:min_length],
rating = ratings[1:min_length],
review = reviews[1:min_length])
# Tag the individual data with the company name
combined_data %>%
mutate(company = company_name) %>%
select(company, reviewer, date, rating, review)
}
get_data_from_url <- function(url, company_name){
html <- read_html(url)
get_data_table(html, company_name)
}
scrape_write_table <- function(url, company_name){
# Read first page
first_page <- read_html(url)
# Extract the number of pages that have to be queried
latest_page_number <- get_last_page(first_page)
# Generate the target URLs
list_of_pages <- str_c(url, '?page=', 1:latest_page_number)
# Apply the extraction and bind the individual results back into one table,
# which is then written as a tsv file into the working directory
list_of_pages %>%
map(get_data_from_url, company_name) %>% # Apply to all URLs
bind_rows() %>% # Combines the tibbles into one tibble
write_tsv(str_c(company_name,'.tsv')) # Writes a tab separated file
}
# url <-'http://www.trustpilot.com/review/www.amazon.com'
# scrape_write_table(url, 'amazon')
# amazon_table <- read_tsv('amazon.tsv')
```