Skip to content

Commit

Permalink
Merge pull request epiforecasts#387 from csoneson/ref-fixes
Browse files Browse the repository at this point in the history
Minor fixes to text/references
  • Loading branch information
seabbs authored Jun 24, 2021
2 parents 3b18401 + a83e2e1 commit bf4b680
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 7 deletions.
5 changes: 3 additions & 2 deletions inst/paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,9 @@ @article{Davies2021
year = {2021}
}

@misc{Dong2020,
@article{Dong2020,
author = {Dong, Ensheng and Du, Hongru and Gardner, Lauren},
booktitle = {The Lancet Infectious Diseases},
journal = {The Lancet Infectious Diseases},
doi = {10.1016/S1473-3099(20)30120-1},
file = {:home/joe/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Dong, Du, Gardner - 2020 - An interactive web-based dashboard to track COVID-19 in real time(2).pdf:pdf},
issn = {14744457},
Expand Down Expand Up @@ -210,6 +210,7 @@ @Manual{sars2pack
author = {Sean Davis and VJ Carey},
year = {2021},
note = {R package version 0.99.2},
url = {https://github.com/seandavi/sars2pack}
}

@article{Wahltinez2020,
Expand Down
10 changes: 5 additions & 5 deletions inst/paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,25 +50,25 @@ link-citations: yes

# Summary

`covidregionaldata` is an R [@Rdev:2020] package that provides an interface to subnational and national level COVID-19 data. The package provides cleaned and verified COVID-19 test-positive case counts and, where available, counts of deaths, recoveries, and hospitalisations in a consistent and fully transparent framework. The package automates common processing steps while allowing researchers to easily and transparently trace the origin of the underlying data sources. It has been designed to allow users to easily extend the packages' capabilities and contribute to shared data handling. All package code is archived on [Zenodo](https://zenodo.org/record/4718466) [@covidregionaldata] and [Github](https://github.com/epiforecasts/covidregionaldata).
`covidregionaldata` is an R [@Rdev:2020] package that provides an interface to subnational and national level COVID-19 data. The package provides cleaned and verified COVID-19 test-positive case counts and, where available, counts of deaths, recoveries, and hospitalisations in a consistent and fully transparent framework. The package automates common processing steps while allowing researchers to easily and transparently trace the origin of the underlying data sources. It has been designed to allow users to easily extend the package's capabilities and contribute to shared data handling. All package code is archived on [Zenodo](https://zenodo.org/record/4718466) [@covidregionaldata] and [GitHub](https://github.com/epiforecasts/covidregionaldata).

# Statement of need

The onset of the COVID-19 pandemic in late 2019 has placed pressure on public health and research communities to generate evidence that can help advise national and international policy in order to reduce transmission and mitigate harm. At the same time, there has been a renewed policy and public health emphasis on localised, subnational decision making and implementation [@Hale2021; @Liu2021]. This requires reliable sources of data disaggregated to a fine spatial scale, ideally with few and/or known sources of bias.

At a national level, epidemiological COVID-19 data is available to download from official sources such as the [World Health Organisation (WHO)](https://covid19.who.int/) [@WorldHealthOrganisation] or the [European Centre for Disease Control (ECDC)](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide%7D) [@EuropeanCentreforDiseasePreventionandControl]. Many government bodies provide a wider range of country specific data, such as [Public Health England in the United Kingdom](https://coronavirus.data.gov.uk/details/about-data) [@PublicHealthEngland], and this is often the only way to access data at a subnational scale, for example by state, district, or province.
At a national level, epidemiological COVID-19 data is available to download from official sources such as the [World Health Organisation (WHO)](https://covid19.who.int/) [@WorldHealthOrganisation] or the [European Centre for Disease Prevention and Control (ECDC)](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide%7D) [@EuropeanCentreforDiseasePreventionandControl]. Many government bodies provide a wider range of country specific data, such as [Public Health England in the United Kingdom](https://coronavirus.data.gov.uk/details/about-data) [@PublicHealthEngland], and this is often the only way to access data at a subnational scale, for example by state, district, or province.

Sometimes collated from a range of national and subnational sources, these data come in a variety of formats, requiring users to check and standardise data before it can be combined or processed for analysis. This is a particularly time-consuming process for subnational data sets, which are often only available in the originating countries’ languages and require customised methods for downloading and processing. This generates potential for errors through programming mistakes, changes to a dependency package, or unexpected changes to a data source. This can lead to misrepresenting the data in ways which are difficult to identify. At best, an independent data processing workflow only slows down the pace of research and analysis, while at worst it can lead to misleading and erroneous results.

Because of these issues, it is important to develop robust tools that provide cleaned, checked and standardised data from multiple sources in a transparent manner. `covidregionaldata` provides easy access to clean data using a single-argument function, ready for analysing the epidemiology of COVID-19 from local to global scales, and in a framework that is easy to trace from raw data to the final standardised data set. Additional arguments to this function support users to, amongst other options, specify the spatial level of subnational data, return data with either standardised or country-specific variable names, or to access the full pipeline from raw to clean data. By default, cleaned and processed data is returned, however, the raw data from a source can also be returned. All data sources are checked daily via Github workflows and their status reported in the documentation section 'Data Status'. `covidregionaldata` largely depends on popular packages that many researchers are familiar with (such as the `tidyverse` suite [@Wickham2019]) and can therefore be easily adopted by researchers working in R. In addition to code coverage tests, we test and report the status of all data sets daily.
Because of these issues, it is important to develop robust tools that provide cleaned, checked and standardised data from multiple sources in a transparent manner. `covidregionaldata` provides easy access to clean data using a single-argument function, ready for analysing the epidemiology of COVID-19 from local to global scales, and in a framework that is easy to trace from raw data to the final standardised data set. Additional arguments to this function support users to, amongst other options, specify the spatial level of subnational data, return data with either standardised or country-specific variable names, or to access the full pipeline from raw to clean data. By default, cleaned and processed data is returned, however, the raw data from a source can also be returned. All data sources are checked daily via GitHub workflows and their status reported in the documentation section 'Data Status'. `covidregionaldata` largely depends on popular packages that many researchers are familiar with (such as the `tidyverse` suite [@Wickham2019]) and can therefore be easily adopted by researchers working in R. In addition to code coverage tests, we test and report the status of all data sets daily.

Currently, `covidregionaldata` provides subnational data collated by official government bodies or by credible non-governmental efforts for 15 countries, including the UK, India, USA, and Brazil. It also provides an interface to subnational data curated by Johns Hopkins University [@Dong2020], and the [Google COVID-19 open data project](https://github.com/GoogleCloudPlatform/covid-19-open-data) [@Wahltinez2020]. National-level data is provided from the World Health Organisation (WHO) [@WorldHealthOrganisation], European Centre for Disease Prevention and Control (ECDC) [@EuropeanCentreforDiseasePreventionandControl], Johns Hopkins University (JHU) [@Dong2020], and the Google COVID-19 open data project [@Wahltinez2020].

# State of the field

Multiple organisations have built private COVID-19 data curation pipelines similar to that provided in `covidregionaldata`, including Johns Hopkins University (JHU) [@Dong2020], Google [@Wahltinez2020], and the COVID-19 Data Hub [@Guidotti2020]. However, most of these efforts aggregate the data they collate into a separate data stream, breaking the linkage with the raw data, and often do not fully surface their data processing pipeline for others to inspect. In contrast `covidregionaldata` provides a clear set of open and fully documented tools that directly operate on raw data where possible in order to make the full data cleaning process transparent to end users.
Multiple organisations have built private COVID-19 data curation pipelines similar to that provided in `covidregionaldata`, including Johns Hopkins University (JHU) [@Dong2020], Google [@Wahltinez2020], and the COVID-19 Data Hub [@covid19datahub:2020]. However, most of these efforts aggregate the data they collate into a separate data stream, breaking the linkage with the raw data, and often do not fully surface their data processing pipeline for others to inspect. In contrast `covidregionaldata` provides a clear set of open and fully documented tools that directly operate on raw data where possible in order to make the full data cleaning process transparent to end users.

Other interfaces to COVID-19 data are available in R, though there are fewer that provide tools for downloading subnational data for multiple countries and none that are known to the authors provide a consistent cleaning pipeline of the data sources they support. COVID-19 Data Hub [@Guidotti2020] provides cleaning functions, a wrapper to a custom database hosted by COVID-19 Data Hub, and access to snapshots of data reported historically. `Covdata` [@covdata] provides weekly COVID-19 data updates as well as mobility and activity data from [Apple](https://covid19.apple.com/mobility) [@Apple] and [Google](https://www.google.com/covid19/mobility/data_documentation.html) [@Google]. `Sars2pack` [@sars2pack] provides interfaces to a large number of data sets curated by external organisations. To our knowledge, none of these packages provide an interface to individual country data sources or a consistent set of data handling tools for both raw and processed data.
Other interfaces to COVID-19 data are available in R, though there are fewer that provide tools for downloading subnational data for multiple countries and none that are known to the authors provide a consistent cleaning pipeline of the data sources they support. COVID-19 Data Hub [@covid19datahub:2020] provides cleaning functions, a wrapper to a custom database hosted by COVID-19 Data Hub, and access to snapshots of data reported historically. `Covdata` [@covdata] provides weekly COVID-19 data updates as well as mobility and activity data from [Apple](https://covid19.apple.com/mobility) [@Apple] and [Google](https://www.google.com/covid19/mobility/data_documentation.html) [@Google]. `Sars2pack` [@sars2pack] provides interfaces to a large number of data sets curated by external organisations. To our knowledge, none of these packages provide an interface to individual country data sources or a consistent set of data handling tools for both raw and processed data.

`covidregionaldata` has been used by researchers to source standardised data for estimating the effective reproductive number of COVID-19 in real-time both nationally and subnationally [@Abbott2020]. It has also been used in analyses comparing effective reproduction numbers from different subnational data sources in the United Kingdom [@Sherratt2020], and estimating the increase in transmission related to the B.1.1.7 variant [@Davies2021]. As well as its use in research it has also been used to visualise and explore current trends in COVID-19 case, deaths, and hospitalisations.

Expand Down

0 comments on commit bf4b680

Please sign in to comment.