Web scraping, archiving, and search retrieval of monsters scraped from DND Beyond
I love Dungeons & Dragons. I'm also a data-loving Data Scientist. These two worlds were inevitably bound to collide .
For those who have never heard of the game, Dungeons and Dragons (D&D) is a fantasy tabletop role-playing game that was first published in 1974, and has a tremendous amount of data associated with the game. Essentially, the characters band together to form a party and undertake adventures, ranging from finding a villager's missing prize-winning piglet to stopping the revival of an ancient Great Old One from the Far Realm that will bring about unspeakable horrors on the land. A Dungeon Master serves as the game's referee and storyteller, creating and maintaining the very world in which the adventures occur.
The large amounts of data is associated with the world-building aspect of the game, and is released by the official creators and players alike. This content shapes the encounters the players face throughout the campaign, including the monsters and enemies that inhabit the world created by the Dungeon Master.
In addition to the plethora of official books available for players, there is a fantastic site for D&D content called DND Beyond that catalogues information on characters, monsters, treasures, and spells. Unfortunately, there is no API available, but it is apparently in the works. I thought it would be a great exercise to master web scraping in R by playing around with the data on monster statistics from one of my favourite pastimes.
- Structure of the scraping task
- Scrape individual monster page URLS
- Using
RSelenium
to access pages behind login - Write a function to scrape each individual pages
- Putting it all together
If you visit https://www.dndbeyond.com/monsters, you'll see the first of many pages of monster listings. Each monster name is a link to the individual monster page that contains extensive details about the monster's statistics, abilities, and lore.
Some of the monsters, such as the Adult Black Dragon are free to view. Other monsters are not included in the Basic Rules Set and can only viewed if you have purchased the digital books associated with the monster.
library(RSelenium)
library(rvest)
library(tidyverse)
library(knitr)
library(kableExtra)
The URLs for the pages worth of monsters seem to follow a consistent format: https://www.dndbeyond.com/monsters?page=NUM
where NUM
ranges from 1 to the last page. We first need to get the total number of pages by locating the last page.
monsters <- read_html("https://www.dndbeyond.com/monsters")
num_pages <- monsters %>%
html_nodes(".b-pagination-item") %>%
html_text() %>%
as.integer() %>%
max(na.rm = TRUE)
Details tutorials for the rvest
can be found here.
Now that we know there are 53 pages worth of monsters (num_pages
), we can write a function that loops through and extracts the individual monster URLs on each page
get_monster_url <- function(url) {
page <- read_html(url)
abs_links <- page %>%
html_nodes(".link") %>%
html_attr("href") %>%
str_subset("/monsters/") %>%
str_replace_all("/monsters/", "https://www.dndbeyond.com/monsters/")
}
The get_monster_links
function takes the URL of a page of results as an input (such as https://www.dndbeyond.com/monsters?page=2). Now we can loop through all the pages of results to get the hundreds of pages of individual monsters.
all_monster_urls <- lapply(seq_len(num_pages), function(i) {
url = paste0("https://www.dndbeyond.com/monsters?page=", i)
get_monster_url(url)
}) %>% unlist
Now that we looped through pages of tables to get the individual monster URLs, we are almost ready to scrape the details. Unfortunately, most of the monster pages can only be viewed if you are logged on and have paid for the corresponding digital books. DND Beyond uses Twitch for authentication which involves a redirect. Luckily, we can use the RSelenium
package to automate web browsing.
We can start an automatic browsing session using the following:
rD <- rsDriver(browser = "chrome")
remdr <- rD[['client']]
Now we can start navigating the browser.
url <- "https://www.dndbeyond.com/login"
remdr$navigate(url)
remdr$findElement(using = "css selector", value = ".twitch-button")$clickElement()
# Uncomment below and replace USERNAME/PASSWORD to automatically login
# remdr$findElement(using = "css selector", value = ".text")$sendKeysToElement(list("USERNAME", key = "tab", "PASSWORD", "\uE007"))
Sys.sleep(1)
remdr$findElement(using = 'css selector', value = '.js-authorize-text')$clickElement()
Aaand we're in! Now that we have access to our digital books and associated monsters, we can start to scrape all the juicy details.
Click here for complete `scrape_monster_page` function
scrape_monster_page <- function(page){
# Overview
monster_name <- select_text(page, ".mon-stat-block__name-link")
monster_portrait <- page %>% html_node('.detail-content') %>% html_node('img') %>% html_attr('src')
meta_text <- select_text(page, ".mon-stat-block__meta")
meta_split1 <- str_split(meta_text, ", ")[[1]]
meta_split2 <- str_split(meta_split1, " ")[[1]]
monster_size <- tools::toTitleCase(meta_split2[1])
monster_type <- tools::toTitleCase(meta_split2[2])
monster_alignment <- tools::toTitleCase(meta_split1[2])
monster_overview <- tibble(Label = c('Name', "Size", "Type", "Alignment"),
Content = c(monster_name, monster_size, monster_type, monster_alignment))
# Monster Attributes
monster_attributes_node <- page %>% html_nodes('.mon-stat-block__attribute')
monster_attributes <- do.call(rbind, lapply(monster_attributes_node, function(node) {
attribute_label <- node %>% select_text(".mon-stat-block__attribute-label")
attribute_value <- node %>% select_text(".mon-stat-block__attribute-data-value")
attribute_extra <- node %>% select_text(".mon-stat-block__attribute-data-extra") %>% replace_if_empty("")
attribute_content <- paste(attribute_value, attribute_extra)
tibble(Label = attribute_label, Content = attribute_content)
}))
# Ability Score
ability_score_node <- page %>% html_nodes(".ability-block__stat")
monster_ability_scores <- do.call(rbind, lapply(ability_score_node, function(node) {
ability <- node %>% select_text(".ability-block__heading")
score <- node %>% select_text('.ability-block__score') %>% as.integer()
modifier <- node %>% select_text('.ability-block__modifier')
score_content <- paste(score, modifier)
tibble(Label = ability, Content = score_content)
}))
#Tidbits
tidbits_node <- page %>% html_nodes('.mon-stat-block__tidbits')
tidbits <- do.call(rbind, lapply(tidbits_node, function(node) {
label <- node %>% select_text(".mon-stat-block__tidbit-label")
value <- node %>% select_text(".mon-stat-block__tidbit-data")
tibble(Label = label, Content = value)
}))
#Description
description_node <- page %>% html_nodes('.mon-stat-block__description-block')
description_sections <- do.call(rbind, lapply(description_node, function(node) {
header <- node %>% select_text(".mon-stat-block__description-block-heading") %>% replace_if_empty("Misc Features")
content <- node %>% select_text(".mon-stat-block__description-block-content p")
tibble(Label = header, Content = content)
}))
# Extra Info
extra_info <- page %>% select_text(".mon-details__description-block-content")
# Extended Info
extended_info <- tibble(Label = c("Tags", "Environment", "Source"),
Content = c(page %>% select_text(".monster-tag") %>% paste(collapse = ",") %>% replace_if_empty("Unknown"),
page %>% select_text(".environment-tag") %>% paste(collapse = ",") %>% replace_if_empty("Unknown"),
page %>% select_text(".monster-source") %>% paste(collapse = ",") %>% replace_if_empty("Unknown")))
tibble(Portrait = monster_portrait,
Name = monster_name,
Overview = list(monster_overview),
Attributes = list(monster_attributes),
Ability_Scores = list(monster_ability_scores),
Tidbits = list(tidbits),
Description= list(description_sections),
Extra_Info = (extra_info),
Extended_info = list(extended_info))
}
The main goal of this project is to write the scrape_monster_page
function to scrape the data from each individual monster page. Click the dropdown link above for the full code or view it on Github. There are three main principles to pay attention to when writing a similar function.
If we wanted to pull the values for the Armor Class, Hit Points, and Speed, we can inspect the page source and find that specific block.
<div class="mon-stat-block__attributes">
<div class="mon-stat-block__attribute">
<span class="mon-stat-block__attribute-label">Armor Class</span>
<span class="mon-stat-block__attribute-value">
<span class="mon-stat-block__attribute-data-value">
19
</span>
<span class="mon-stat-block__attribute-data-extra">
(Natural Armor)
</span>
</span>
</div>
<div class="mon-stat-block__attribute">
<span class="mon-stat-block__attribute-label">Hit Points</span>
<span class="mon-stat-block__attribute-data">
<span class="mon-stat-block__attribute-data-value">
195
</span>
<span class="mon-stat-block__attribute-data-extra">
(17d12 + 85)
</span>
</span>
</div>
<div class="mon-stat-block__attribute">
<span class="mon-stat-block__attribute-label">Speed</span>
<span class="mon-stat-block__attribute-data">
<span class="mon-stat-block__attribute-data-value">
40 ft., fly 80 ft., swim 40 ft.
</span>
</span>
</div>
</div>
With the .mon-stat-block__attributes
class selected, we can see that each line is subdivided by spans with the classes mon-stat-block__attribute-label
, mon-stat-block__attribute-data
, and sometimes mon-stat-block__attribute-data-extra
. However, it is not as simple as scraping the each attribute
class and we'll see why below.
Using the above example of the Armor Class, Hit Points, and Speed, if we arbitrarily select the attribute
class for each attribute, this is what we get:
page %>% html_nodes(".mon-stat-block__attribute") %>% html_text()
[1] "\n Armor Class\n \n \n 19\n \n \n \n (Natural Armor) \n \n \n \n "
[2] "\n Hit Points\n \n \n 195\n \n \n (17d12 + 85)\n \n \n "
[3] "\n Speed\n \n \n 40 ft., fly 80 ft., swim 40 ft.\n \n \n \n "
While the character vector does contain all the information, it is very messy. Instead, we should access the three mon-stat-block__attribute
sections separately and rbind
them together.
page %>% html_nodes(".mon-stat-block__attribute-label") %>% html_text()
[1] "Armor Class" "Hit Points" "Speed"
page %>% html_nodes(".mon-stat-block__attribute-data-value") %>% html_text() %>% trimws()
[1] "19" "195" "40 ft., fly 80 ft., swim 40 ft."
page %>% html_nodes(".mon-stat-block__attribute-data-extra") %>% html_text() %>% trimws()
[1] "(Natural Armor)" "(17d12 + 85)"
As we noted before, the mon-stat-block__attribute-data-extra
class is not present for every attribute (Speed
does not have an extra value in this case).
The end result we want looks like this:
monster_attributes
# A tibble: 3 x 2
Label Content
<chr> <chr>
1 Armor Class 19 (Natural Armor)
2 Hit Points 195 (17d12 + 85)
3 Speed "40 ft., fly 80 ft., swim 40 ft. "
First, we need to define some helper functions to avoid redundancies.
select_text
to minimize the repetitivepage %>% html_nodes %>% html_text
select_text <- function(xml, selector, trim = TRUE) {
text <- xml %>%
html_nodes(selector) %>%
html_text %>% trimws()
text
}
replace_if_empty
to replace empty text with a string of choice
replace_if_empty <- function(text, replacement) {
if (length(text)==0) {text <- replacement}
else if (text == "") {text <- replacement}
text
}
Now we can run the loop shown below to generate the tidy monster_attributes
table:
# Monster Attributes: Armor Class, HP, and Speed
monster_attributes_node <- page %>% html_nodes('.mon-stat-block__attribute')
monster_attributes <- do.call(rbind, lapply(monster_attributes_node, function(node) {
attribute_label <- node %>% select_text(".mon-stat-block__attribute-label")
attribute_value <- node %>% select_text(".mon-stat-block__attribute-data-value")
attribute_extra <- node %>% select_text(".mon-stat-block__attribute-data-extra") %>% replace_if_empty("")
attribute_content <- paste(attribute_value, attribute_extra)
tibble(Label = attribute_label, Content = attribute_content)
}))
- Note that the
mon-stat-block__attribute-data-value
andmon-stat-block__attribute-data-extra
are merged into one column (Content
).
Nearly all of the code on the scrape_monster_page
function is a repeat of the aforementioned principles, and can be used for the majority of scraping tasks that use rvest.
Add a little knitr::kable & kableExtra
magic for HTML tables and viola!
Click here for the complete `find_monster` function to search for a monster by name and output an HTML Table
tibble_length <- function(tibs) {
(monster_table %>% .[[tibs]] %>% unlist %>% length)/2
}
find_monster <- function(monster){
monster_table <- monster_info %>% filter(Name == monster)
monster_portrait <- monster_table$Portrait
portrait_row <- tibble("Label" = " ", "Content" = sprintf("<img src=%s>)", monster_portrait))
monster_data <- unlist(monster_table[,c(3:6,9,7)], recursive = FALSE)
monster_data <- do.call("rbind", monster_data)
monster_data <- rbind(portrait_row, monster_data[2:nrow(monster_data),])
monster_summary <- kable(monster_data, row.names = F, booktabs=T, col.names=c("", "")) %>%
kable_styling("striped") %>%
group_rows(monster_table$Name, 1, 4) %>%
group_rows(index = c(" " = 4,
"Attributes" = tibble_length("Attributes"),
"Ability Score" = tibble_length("Ability_Scores"),
"Tidbits" = tibble_length("Tidbits"),
"Tags" = tibble_length("Extended_info"),
"Description" = tibble_length('Description')))
description <- monster_table[[8]] %>%
str_split("\n") %>%
simplify() %>%
str_replace("Lair Actions", "<b>Lair Actions</b>") %>%
str_replace("Regional Effects", "<b>Regional Effects</b>") %>%
kable(col.names=c("Extended Description"), escape = F, booktabs = T, format="html") %>%
kable_styling('striped', full_width = F)
print(monster_summary)
print(description)
}
find_monster("Adult Black Dragon")
Extended Description |
---|
Black dragons dwell in swamps on the frayed edges of civilization. A black dragon’s lair is a dismal cave, grotto, or ruin that is at least partially flooded, providing pools where the dragon rests, and where its victims can ferment. The lair is littered with the acid-pitted bones of previous victims and the fly-ridden carcasses of fresh kills, watched over by crumbling statues. Centipedes, scorpions, and snakes infest the lair, which is filled with the stench of death and decay. |
Lair Actions |
On initiative count 20 (losing initiative ties), the dragon takes a lair action to cause one of the following effects; the dragon can’t use the same effect two rounds in a row: |
Pools of water that the dragon can see within 120 feet of it surge outward in a grasping tide. Any creature on the ground within 20 feet of such a pool must succeed on a DC 15 Strength saving throw or be pulled up to 20 feet into the water and knocked prone. |
A cloud of swarming insects fills a 20-foot-radius sphere centered on a point the dragon chooses within 120 feet of it. The cloud spreads around corners and remains until the dragon dismisses it as an action, uses this lair action again, or dies. The cloud is lightly obscured. Any creature in the cloud when it appears must make on a DC 15 Constitution saving throw, taking 10 (3d6) piercing damage on a failed save, or half as much damage on a successful one. A creature that ends its turn in the cloud takes 10 (3d6) piercing damage. |
Magical darkness spreads from a point the dragon chooses within 60 feet of it, filling a 15-foot-radius sphere until the dragon dismisses it as an action, uses this lair action again, or dies. The darkness spreads around corners. A creature with darkvision can’t see through this darkness, and nonmagical light can’t illuminate it. If any of the effect’s area overlaps with an area of light created by a spell of 2nd level or lower, the spell that created the light is dispelled. |
Regional Effects |
The region containing a legendary black dragon’s lair is warped by the dragon’s magic, which creates one or more of the following effects: |
The land within 6 miles of the lair takes twice as long as normal to traverse, since the plants grow thick and twisted, and the swamps are thick with reeking mud. |
Water sources within 1 mile of the lair are supernaturally fouled. Enemies of the dragon that drink such water regurgitate it within minutes. |
Fog lightly obscures the land within 6 miles of the lair. |
If the dragon dies, vegetation remains as it has grown, but other effects fade over 1d10 days. |
Happy scraping!