Skip to content

Commit

Permalink
Explain use-case of dataset testing
Browse files Browse the repository at this point in the history
  • Loading branch information
katrinleinweber committed Sep 3, 2018
1 parent 9d9b2e5 commit d3db77d
Show file tree
Hide file tree
Showing 7 changed files with 124 additions and 4 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: BacDiveR
Title: A Programmatic Interface For BacDive, The DSMZ's Bacterial Diversity Metadatabase
Version: 0.5.1
Version: 0.6.0
Authors@R: person("Katrin", "Leinweber", email = "katrin.leinweber@tib.eu",
role = c("aut", "cre"),
comment = c(ORCID = "0000-0001-5135-5758"))
Expand Down
11 changes: 11 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Fixed
### Security

# BacDiveR 0.6.0

### Added

- The vignette [Logic-Checking BacDive Datasets](https://tibhannover.github.io/BacDiveR/articles/logic-checking-bacdive-datasets.html)

### Changed

- `retrieve_search_results()` now returns `NULL` when no results are found, in
order to ease integration of datasets into `testthat` tests.

## BacDiveR 0.5.1

### Fixed
Expand Down
8 changes: 5 additions & 3 deletions R/retrieve_search_results.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,10 @@ retrieve_search_results <- function(queryURL)
if (!grepl(pattern = paste0("$", download_param), x = queryURL))
queryURL <- paste0(queryURL, download_param)

result_IDs <-
strsplit(x = RCurl::getURL(queryURL), split = "\\n")[[1]]
payload <- RCurl::getURL(queryURL)

aggregate_datasets(result_IDs, from_IDs = TRUE)
if (grepl("^[[:digit:]]", payload))
aggregate_datasets(strsplit(x = payload, split = "\\n")[[1]], from_IDs = TRUE)
else if (grepl("^<!DOCTYPE", payload))
NULL # needed for logic-checking datasets, see vignette
}
8 changes: 8 additions & 0 deletions tests/testthat/test-retrieve_search_results.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,11 @@ test_that("downloading a dataset from an 'advanced search' URL works", {
expect_equal(Millers_strains[[1]], "Borrelia mayonii")
expect_equal(Millers_strains[[2]], "Bacillus wiedmannii")
})


test_that("Inconsistent datasets get corrected", {
inconsistent_data <- retrieve_search_results(
"https://bacdive.dsmz.de/advsearch?advsearch=search&site=advsearch&searchparams[20][contenttype]=text&searchparams[20][typecontent]=contains&searchparams[20][searchterm]=Sea+of+Japan&searchparams[17][searchterm]=Europe")

expect_false(is.null(inconsistent_data))
})
Binary file added vignettes/BacDive-geo-logic-fault.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 23 additions & 0 deletions vignettes/BacDive.bib
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,26 @@ @article{BD16
doi = {10.1093/nar/gkv983},
URL = {https://academic.oup.com/nar/article/44/D1/D581/2503137}
}

@Article{TT,
author = {Hadley Wickham},
title = {testthat: Get Started with Testing},
journal = {The R Journal},
year = {2011},
volume = {3},
pages = {5--10},
url = {https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf},
}

@book{T,
author = {Hadley Wickham},
langid = {english},
location = {{Sebastopol, CA}},
title = {R {{Packages}}: {{Organize}}, {{Test}}, {{Document}}, and {{Share Your Code}}},
edition = {1st edition},
isbn = {978-1-4919-1059-7},
url = {http://r-pkgs.had.co.nz/},
publisher = {{O'Reilly Media}},
date = {2015-04-13}
}

76 changes: 76 additions & 0 deletions vignettes/logic-checking-bacdive-datasets.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: "Logic-Checking BacDive Datasets"
author: "Katrin Leinweber"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: inline
bibliography: BacDive.bib
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

### Example of a data inconsistency

Just as the correctness of data analysis code should be tested automatically, the
consistency of data should be evaluated and monitored as well. Using [BacDive's advanced search](https://bacdive.dsmz.de/AdvSearch)
and [BacDiveR's `retrieve_search_results()`](https://tibhannover.github.io/BacDiveR/reference/retrieve_search_results.html)
several examples of geographic inconsistencies have been found. Presumably due to
an overly strict location-to-country-to-continent mapping, several samples collected
from seas neighbouring Russia (like the [Sea of Japan)](https://bacdive.dsmz.de/advsearch?site=advsearch&searchparams%5B20%5D%5Bcontenttype%5D=text&searchparams%5B20%5D%5Btypecontent%5D=contains&searchparams%5B20%5D%5Bsearchterm%5D=Sea+of+Japan&searchparams%5B100%5D%5Bcontenttype%5D=text&searchparams%5B100%5D%5Btypecontent%5D=contains&searchparams%5B100%5D%5Bsearchterm%5D=&searchparams%5B17%5D%5Bsearchterm%5D=Europe&advsearch=search),
were assigned to Europe.

![Two datasets with a geo-logic fault (pun intended)](BacDive-geo-logic-fault.png)

While one may debate where exactly border between Asia and Europe runs through Russia,
it is clear that its Eastern shoreline is located well within Asia. These and
other datasets with East Russian locations have been reported to the BacDive team
and a portion of those was corrected in [BacDive's 04.07.2018 release](https://bacdive.dsmz.de/news).

```{r data}
library(BacDiveR)
inconsistent_data <- retrieve_search_results(
"https://bacdive.dsmz.de/advsearch?advsearch=search&site=advsearch&searchparams[20][contenttype]=text&searchparams[20][typecontent]=contains&searchparams[20][searchterm]=Sea+of+Japan&searchparams[17][searchterm]=Europe"
)
```

As long as this specific inconsistency is not fixed, the above should display:
`Data download in progress for BacDive-IDs: 131115 139987`.


### How to test datasets

If a BacDive user finds an inconsistency within the datasets they use, BacDiveR's
`retrieve_search_results()` can be used to construct a test-case for such a problem.
In the following example, the test fails as long as BacDive contains datasets with
the above-described discrepancy between the `geo_loc_name` and `continent` fields.

```{r test, error=TRUE}
library(testthat)
test_that("No inconsistent datasets exist", {
expect_null(inconsistent_data)
})
```

Once the inconsistency is corrected in BacDive, the advanced search returns no
results any more, and the above test passes. It can thus be used to monitor the
resolution of such a problem after [reporting](https://bacdive.dsmz.de/?site=contact)
it. Furthermore, the users is alerted (by the test failing again) in case new
datasets appear in BacDive with the same inconsistency.

### References

See [testthat.R-lib.org](https://testthat.r-lib.org/) and the
[related "R Packages" chapter](http://r-pkgs.had.co.nz/tests.html) to learn
more about testing in R [@TT; @T].

0 comments on commit d3db77d

Please sign in to comment.