Skip to content

Data Integrity Testing

Christoph Broschinski edited this page May 12, 2017 · 8 revisions

Background

Since most of the metadata submitted to OpenAPC has been manually created at some point in its life cycle, it will inevitably contain errors. Furthermore, even data imported from external sources like CrossRef cannot be relied on to be correct or up-to-date in all cases. We address this problem by employing a software test suite which checks the whole dataset for potential errors on a regular basis.

Technical details

The test script is written in Python and based on the pytest testing framework. Upon execution the script imports both the OpenAPC core data file and the offsetting file and sends every entry through a set of test functions. A report lists any encountered errors after finishing.

There are 2 work modes for the test suite: First, it can be simply called from the command line to verify data integrity in the local git repository (This should always be done before pushing back any changes to the APC data files back to github!). Second, it is automatically called whenever a push or pull request occurs in the OpenAPC repository by hooking into a continuous integration service (Travis, in our case). The test suite is executed on a remote server and results are reported to the OpenAPC team via mail/Slack integration. A small widget on the OpenAPC README page also informs about the latest test status:

Build Status

Test cases

The following tests are applied to every article (csv row) in the OpenAPC core data file and the offsetting file:

Standalone tests

These tests are independent of other lines in the file:

  • (syntax) Every line must consist of exactly 18 columns.
  • (content) The columns publisher and journal_full_title may not be empty or NA. publisher and journal_full_title may not contain leading or trailing whitespaces.
  • (content) The columns is_hybrid, indexed_in_crossref and doaj must either be TRUE or FALSE.
  • (content) The column doi must either be NA or contain a valid DOI (checked against a regular expression).
  • (content) The column issn may not be empty or NA. Its content must represent an ISSN which is both checked for syntactical (regular expression) and semantical correctness (ISSN check digit calculation). The other ISSN fields (issn_print, issn_electronic and issn_l) may be NA, but if they contain a value, it must pass the same checks.
  • (content) The column euro must contain a valid numerical value (dot (".") as decimal point, no thousands separator) which must be larger than 0. Entries from the offsetting file skip this test.