Parse html information within esummary records

Some summary records include a "data dump" in html format

The summary records returned by some NCBI databases, notably SRA, contain a bunch of data "dumped" into the file as escaped html. As an example, here is a record describing a sequencing data from a metagenomic study.

bugs  <- entrez_summary(db="biosample", id="2886856", rettype="JSON")
substr(bugs$sampledata, 1, 100)

[1] "&lt;BioSample submission_date=\"2014-06-26T08:08:36.203\" last_update=\"2015-08-10T23:52:20.737\" public"

Accessing this data

Sometimes this data is quite useful. In this case, the dumped html contains metadata about the experiment from which these sequences were generated. In order to acess this information we first need to "decode" the encoded html entities (eg the &lts; that represent <s need to be converted) then parse the file. We can do that using the textutils function HTMLdecode and the XML library to parse the resulting text. (Note the pacakge xml2 also provides html parsing and processing functions, and may provide a more straightforward forward syntax for these tasks)

raw_html  <- textutils::HTMLdecode(bugs$sampledata)
parsed_html <- XML::htmlTreeParse(raw_html, useInternalNodes=TRUE)

Most of the data in the html dump is stored in attribute tags. You can see all of the attributes in the file using an XPATH query (here I am just showing the first 6).

head(XML::xpathApply(parsed_html, "//attribute"))

[[1]]
<attribute attribute_name="collection_date" harmonized_name="collection_date" display_name="collection date">7/24/12</attribute> 

[[2]]
<attribute attribute_name="" public="">y</attribute> 

[[3]]
<attribute attribute_name="tot_org_carb" harmonized_name="tot_org_carb" display_name="total organic carbon">16.58</attribute> 

[[4]]
<attribute attribute_name="sample_id">1257471</attribute> 

[[5]]
<attribute attribute_name="common_name">soil metagenome</attribute> 

[[6]]
<attribute attribute_name="samp_size" harmonized_name="samp_size" display_name="sample size">0.1 g</attribute>

There is a special XPATH syntax for extracting data from attributes with a particular name. Here we can get extract just value of the attribute field.

XML::xmlValue(parsed_html[["//attribute[@attribute_name='latitude']"]])

XML::xmlValue(parsed_html[["//attribute[@attribute_name='latitude']"]])

Wrapping it up into a function

The above is a little long-winded. If we want to extract particular html-encoded data from a set of summary records we should write a function that we could then apply to a set of records.

sample_data <- function(summ_rec){
  raw_html  <- textutils::HTMLdecode(summ_rec$sampledata)
  parsed_html <- XML::htmlTreeParse(raw_html, useInternalNodes=TRUE)
  lat <- XML::xmlValue(parsed_html[["//attribute[@attribute_name='latitude']"]])
  lon <- XML::xmlValue(parsed_html[["//attribute[@attribute_name='longitude']"]])
  depth <- XML::xmlValue(parsed_html[["//attribute[@attribute_name='depth']"]])
  list(NCBI_ID = summ_rec$uid, latitude=as.numeric(lat), longitude=as.numeric(lon), depth=as.numeric(depth))
}

sample_data(bugs)

$NCBI_ID
[1] "2886856"

$latitude
[1] 40.79108

$longitude
[1] -73.96178

$depth
[1] 0.05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse html information within esummary records

Some summary records include a "data dump" in html format

Accessing this data

Wrapping it up into a function

Clone this wiki locally