Skip to content

Data Integrity Testing

Christoph Broschinski edited this page May 12, 2017 · 8 revisions

Background

Since most of the metadata submitted to OpenAPC has been manually created at some point in its life cycle, it will inevitably contain errors. Furthermore, even data imported from external sources like CrossRef cannot be relied on to be correct or up-to-date in all cases. We address this problem by employing a software test suite which checks the whole dataset for potential errors on a regular basis.

Technical details

The test script is written in Python and based on the pytest testing framework. Upon execution the script imports both the OpenAPC core data file and the offsetting file and sends every entry through a set of test functions. A report lists any encountered errors after finishing.

There are 2 work modes for the test suite: First, it can be simply called from the command line to verify data integrity in the local git repository (This should always be done before pushing back any changes to the APC data files back to GitHub!). Second, it is automatically called whenever a push or pull request occurs in the OpenAPC repository by hooking into a continuous integration service (Travis, in our case). The test suite is executed on a remote server and results are reported to the OpenAPC team via mail/Slack integration. A small widget on the OpenAPC README page also informs about the latest test status:

Build Status