-
Notifications
You must be signed in to change notification settings - Fork 38
Parse html information within esummary records
The summary records returned by some NCBI databases, notably SRA, contain a bunch of data "dumped" into the file as escaped html. As an example, here is a record describing a sequencing data from a metagenomic study.
bugs <- entrez_summary(db="biosample", id="2886856", rettype="JSON")
substr(bugs$sampledata, 1, 100)
[1] "<BioSample submission_date=\"2014-06-26T08:08:36.203\" last_update=\"2015-08-10T23:52:20.737\" public"
Sometimes this data is quite useful. In this case, the dumped html contains metadata about the experiment from which these sequences were generated. In order to acess this information we first need to "decode" the encoded html entities (eg the <s;
that represent <
s need to be converted) then parse the file. We can do that using the textutils
function HTMLdecode
and the XML library to parse the resulting text. (Note the pacakge xml2
also provides html parsing and processing functions, and may provide a more straightforward forward syntax for these tasks)
raw_html <- textutils::HTMLdecode(bugs$sampledata)
parsed_html <- XML::htmlTreeParse(raw_html, useInternalNodes=TRUE)
Most of the data in the html dump is stored in attribute
tags. You can see all of the attributes
in the file using an XPATH query (here I am just showing the first 6).
head(XML::xpathApply(parsed_html, "//attribute"))
[[1]]
<attribute attribute_name="collection_date" harmonized_name="collection_date" display_name="collection date">7/24/12</attribute>
[[2]]
<attribute attribute_name="" public="">y</attribute>
[[3]]
<attribute attribute_name="tot_org_carb" harmonized_name="tot_org_carb" display_name="total organic carbon">16.58</attribute>
[[4]]
<attribute attribute_name="sample_id">1257471</attribute>
[[5]]
<attribute attribute_name="common_name">soil metagenome</attribute>
[[6]]
<attribute attribute_name="samp_size" harmonized_name="samp_size" display_name="sample size">0.1 g</attribute>
There is a special XPATH syntax for extracting data from attributes with a particular name. Here we can get extract just value of the attribute field.
XML::xmlValue(parsed_html[["//attribute[@attribute_name='latitude']"]])
XML::xmlValue(parsed_html[["//attribute[@attribute_name='latitude']"]])
The above is a little long-winded. If we want to extract particular html-encoded data from a set of summary records we should write a function that we could then apply to a set of records.
sample_data <- function(summ_rec){
raw_html <- textutils::HTMLdecode(summ_rec$sampledata)
parsed_html <- XML::htmlTreeParse(raw_html, useInternalNodes=TRUE)
lat <- XML::xmlValue(parsed_html[["//attribute[@attribute_name='latitude']"]])
lon <- XML::xmlValue(parsed_html[["//attribute[@attribute_name='longitude']"]])
depth <- XML::xmlValue(parsed_html[["//attribute[@attribute_name='depth']"]])
list(NCBI_ID = summ_rec$uid, latitude=as.numeric(lat), longitude=as.numeric(lon), depth=as.numeric(depth))
}
sample_data(bugs)
$NCBI_ID
[1] "2886856"
$latitude
[1] 40.79108
$longitude
[1] -73.96178
$depth
[1] 0.05