Major refactor of {epiparameter} #197

joshwlambert · 2023-10-10T08:47:05Z

This PR is a rewrite of the way {epiparameter} reads in and handles epidemiological parameters.

It reduces the size of the package namespace (a point of improvement mentioned in #151), and make the package more focussed on the <epidist> class which is the main functional unit.

Key changes include:

Updated parameter library to be a structured JSON database which more closely resembles the structure of the <epidist> class. The modular JSON structure should allow further changes to each component of a database entry to be made without large breaking changes.
The data dictionary is updated to validate the updates to the JSON parameter library.
The <epiparam> class is removed and instead replaced by a list of <epidist> objects as the main method of handling epidemiological parameters. The <epiparam> methods, constructor, validator, and utility function are also removed (e.g. bind_epiparam()). (A minimal S3 class <multi_epidist> is introduced purely to provide cleaner printing when the list of <epidist> objects is long).
The reading in of the data from the library is now done via a single function: epidist_db(). epidist_db() is largely rewritten and includes new internal functions (.read_epidist_db(), .filter_epidist_db(), .format_epidist(), .format_params(), .is_cond_epidist()).
The documentation, tests and vignettes have been updated.

Minor changes include:

calc_dist_params() has been updated and now includes conversion from mean and dispersion.
Improved handling of NULL in mean.epidist().
create_epidist_summary_stats() is unnested and now produces a single-nested list

This PR should evaluate these changes in comparison to the old implementation, and determine if these changes make the package easier to use, easier to maintain, and easier to develop.

There are further improvement to be made if this PR is merged, such as improving the speed of reading in the database and filtering. However, these should be done in a subsequent PR so that this one does not become too cumbersome.

…alid

… of df

… needed since db update)

…_epiparam, summary, head, tail)

…aram, add_ci_limits)

…_can_reconstruct, df_reconstruct, dplyr_reconstruct.epiparam)

…piparam

pratikunterwegs

Thanks @joshwlambert - this is a lot of work and seems to have reduced the codebase by a bit. I've only been able to take a quick look as it's a pretty large PR with changes in a lot of files. I've put down some comments in the files as well. I would request other reviewers to also take a look to catch potential issues I've missed.

My overall thoughts are that while the backend has changed somewhat, the functionality that I use most, epidist_db() is relatively unchanged - I consider that a good thing. However, data access using epidist_db() is noticeably slower than the previous implementation, and does not improve when setting single_epidist = TRUE - would be good to try and speed this up. I've made some suggestions in the relevant file (on .read_epidist_db()).

I would actually suggest removing the <multi_epidist> class - this would further slim down the codebase.

I think it could confusing for new users to understand the difference with <epidist> - the same issue as with <epiparam>.
It should probably be fine for epidist_db() to return a list of <epidist>.
For more compact printing, I would suggest changing the print method to use a shorter citation, and using fewer lines for the disease and pathogen. The full citation dominates the output, making it difficult to quickly grasp whether the correct study has been accessed, and what the key info is. I would go with first-author:last-name et al. (Year) Journal.

R/calc_dist_params.R

R/epidist.R

R/epidist_db.R

man/figures/README-plot-epidist-1.png

Bisaloo · 2023-10-16T08:29:30Z

I have not reviewed the implementation because as this stage, I prefer doing a full review where I can have a complete overview of the codebase rather than a partial diff.

But I agree with the design decision taken here. I find it much clearer to have a single class instead of of epiparam and epidist.

jamesmbaazam · 2023-10-24T20:45:15Z

R/epidist_db.R

+#' library of epidemiological parameters is compiled from primary literature
+#' sources. The list output from [epidist_db()] can be subset by the data it
+#' contains, for example by: disease, pathogen, epidemiological distribution,
+#' sample size, region, etc.


Suggest listing everything.

It is not easy to list every possible subsetting, since the function is set up to allow flexible subsetting with the subset argument (similar to how the NSE with the subset() function works). Therefore, it is possible to subset on most elements of an <epidist> object, of which there are quite a few. I'm happy to reword the documentation to make it clearer if you suggest an alternative.

jamesmbaazam

Code review

Thanks for the job well done, @joshwlambert. This PR simplifies the package namespace with the goal of improving user-friendliness. I think it will be better served with rigorous and honest user testing and feedback from internal and external stakeholders looking to incorporate the package in their outbreak analytics pipelines.

From the developer perspective, I only have a few extra comments that have not already been mentioned in the PR description or by others who have taken a look, i.e., make epidist_db() faster and also consider pre-building the database for use (#198).

My comments

A quick profiling of epidist_db() shows that utils::format.bibtex(), which is called in create_epidist_citation() seems to be taking a lot of time underneath. I wonder if there are performant alternatives.
In epidist(), I wonder if the argument epi_dist is not potentially confusing with the class <epidist>. I think it's fine but I wonder if it's worth considering a different name for the argument to remove any future confusion.
I'd suggest adopting the cli package for printing the <multi_epidist> objects. There's a number of neat features that could help here including pluralizing single versus multiple entries when they're enumerated and using various formatting to emphasize various aspects of the output.

In the following example, "result(s)" and "(are) parameterised" can be pluralized with cli. The citation DOI can be formatted with cli's hyperlink features to make it clickable in the console. Moreover, parts of the output can also be formatted with cli's color formatting features to make it stand out. I recognize that it's not a priority compared to other pressing issues with the API but just registering it here for later.

epiparameter::epidist_db(disease = "influenza", epi_dist = "serial_interval")
#> Using Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> To retrieve the citation use the 'get_citation' function
#> Disease: Influenza
#> Pathogen: Influenza-A-H1N1Pdm
#> Epi Distribution: serial interval
#> Study: Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> Distribution: gamma
#> Parameters:
#>   shape: 2.622
#>   scale: 0.957
Created on 2023-10-24 with reprex v2.0.2

Still on the issue of printing, taking a look at the following example,

epidist_db(
     disease = "COVID-19",
     epi_dist = "incubation_period",
     subset = is_parameterised
 )
#> Returning 10 results that match the criteria (10 are parameterised).
#> Use subset to filter by entry variables or single_epidist to return a single entry.
#> To retrieve the short citation for each use the 'get_citation' function
#> List of <epidist> objects
#>  Number of entries in library: 10
#>  Number of studies in library: 4
#> Number of diseases: 1
#>  Number of delay distributions: 10
#>  Number of offspring distributions: 0

I think the message "Returning 10 results that match the criteria" essentially repeats the list entry, " Number of entries in library: 10". Might be worth removing one of them. I would remove the former and move the rest of the message about filtering to the bottom as extra information after printing the summary of the returned database.

In the example for epidist_db(), I see you use functional subsetting in one of the examples with is_parameterised. It might be worth explicitly namespacing it with epiparameter::is_parametrised to make it clear that the function comes from epiparameter. Is it also possible to list out other functions that exist internally that can be used with the subset argument?
With regards to functions like is_parametrised(), where there is a variation in spelling, consider providing an American alias via is_parameterized() like dplyr::summarise() and dplyr::summarize() I know it's a small thing but can reduce confusion for users who don't use tab or auto-completion.
Concerning the @return for epidist_db(), I wonder if it'll be better to list what an <epidist> contains.

joshwlambert · 2023-11-13T13:49:47Z

With regards to functions like is_parametrised(), where there is a variation in spelling, consider providing an American alias via is_parameterized() like dplyr::summarise() and dplyr::summarize() I know it's a small thing but can reduce confusion for users who don't use tab or auto-completion.

American spelling has been added for is_parameterised() and discretise() in 1582803.

joshwlambert · 2023-11-13T14:08:07Z

Thanks for all the comments and suggestions @pratikunterwegs & @jamesmbaazam!

I have made some updates from comments left with commits to this branch, and some other aspects that fall outside the scope of this PR I have logged as issues and will address in a separate PR.

Responses to some comments:

There seems to be some disagreement around the use the <multi_epidist> class and whether it should be deleted or be enhanced with improved printing. I will move this to a new issue referencing peoples comments and we can decide which is the best direction to go in, with a dedicated PR to this issue.
I agree that the citation length printed to console can be a large part of the <epidist> and can create issues reading the list of <epidist>s output from epidist_db(). There are some other aspects of citations that I need to improve (see Team authors requires hyphen separation #193) so I will tackle these together in a later PR.
@jamesmbaazam the first code chunk does not match what I reproduce locally. Can you check you have the most up-to-date version of {epiparameter} installed. I get

epiparameter::epidist_db(disease = "influenza", epi_dist = "serial_interval")
#> Returning 1 results that match the criteria (1 are parameterised). 
#> Use subset to filter by entry variables or single_epidist to return a single entry. 
#> To retrieve the short citation for each use the 'get_citation' function
#> Disease: Influenza
#> Pathogen: Influenza-A-H1N1Pdm
#> Epi Distribution: serial interval
#> Study: Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> Distribution: gamma
#> Parameters:
#>   shape: 2.622
#>   scale: 0.957

^{Created on 2023-11-13 with reprex v2.0.2}

I agree the number of entries returned from epidist_db() is replicated, but there are some instances where it is not. When the output of epidist_db() is assigned to a variable, or if the number of <epidist> objects returned is less than 5. For these cases I will leave the message there for now.
Concerning the @return for epidist_db(), I wonder if it'll be better to list what an <epidist> contains.

I'm not sure what you mean. If you would prefer more information on what is in each <epidist> object I think it is best to document that in the @return field of the epidist() documentation.

pratikunterwegs

Thanks for making these changes @joshwlambert, great work, looks good to me. I've done some nitpicking but nothing really preventing this from being merged and a follow up sweep could also take care of these issues; I leave it to you when to take care of them.

R/calc_dist_params.R

R/epidist.R

R/epidist_db.R

joshwlambert added 30 commits September 29, 2023 17:11

updated parameter library JSON to modular objects

eb6084e

updated data dictionary to test modular parameter library

178818d

added enum to validate epi distributions and updated database to be v…

0a6669c

…alid

added .format_epidist function

ed91b52

added .format_params function

31e277b

added .read_epidist_db function

102ee0c

added .is_cond_epidist function

9261f4b

updated epidist_db to read in from JSON db and work with list instead…

02ae1a1

… of df

updated epidist_db documentation

2d3ce8b

added print method for multi_epidist class

4bba713

added has_r_params function and updated epidist constructor

74bfa11

updated epidist helper functions (removed outdated nesting)

e8ac648

added median and dispersion conversion to calc_dist_params

6ef3784

updated calc_dist_params documentation

67ab1f2

removed outdated percentiles formatting in get_percentiles (no longer…

d959b55

… needed since db update)

updated create_epidist_summary_stats documentation

83bec18

removed epiparam functions (constructor, validator, print, format, is…

3f33977

…_epiparam, summary, head, tail)

removed epiparam_fields and epiparam_col_type functions

c0aa33d

removed epiparam utility functions (as_epidist, make_epidist, as_epip…

697d0a2

…aram, add_ci_limits)

removed epiparam methods ([, names, $, epiparam_reconstruct, epiparam…

8ec2c53

…_can_reconstruct, df_reconstruct, dplyr_reconstruct.epiparam)

removed bind_epiparam function

7741bc8

updated multi_epidist print method

ae58204

updated epidist_db documentation

48eca71

add filter by disease to .read_epidist_db

93946cc

removed bind_epiparam tests

bae37cc

removed epiparam tests

ae3e61a

removed epiparam utility function tests

a2bfa83

update epidist mean method for new summary stats list

8a3eb52

updated get_citation methods to work with multi_epidist and not epiparam

df115ca

updated is_parameterised methods to work with multi_epidist and not e…

a2925eb

…piparam

updated snapshots

412a57a

joshwlambert marked this pull request as ready for review October 11, 2023 09:03

joshwlambert requested review from pratikunterwegs and Bisaloo October 11, 2023 09:03

TimTaylor mentioned this pull request Oct 11, 2023

Pre-build database? #198

Closed

jamesmbaazam self-requested a review October 11, 2023 12:28

pratikunterwegs reviewed Oct 12, 2023

View reviewed changes

jamesmbaazam reviewed Oct 24, 2023

View reviewed changes

joshwlambert added 2 commits November 10, 2023 17:00

use checkmate to simplify if statements in calc_dist_params

8c34eea

replaced logical statements with checkmate in validate_epidist

5d601d8

joshwlambert mentioned this pull request Nov 13, 2023

Improve method of reading parameters with epidist_db() #200

Closed

updated list assertion for epidist summary stats

2c6b041

joshwlambert mentioned this pull request Nov 13, 2023

Vectorised filtering for list of <epidist> #201

Open

added american spelling for is_parameterised and discretise functions

1582803

joshwlambert mentioned this pull request Nov 13, 2023

Enhance, delete or leave <multi_epidist> #202

Closed

pratikunterwegs approved these changes Nov 13, 2023

View reviewed changes

joshwlambert added 4 commits November 13, 2023 15:20

remove duplicated epidist prob_dist check in validate_epidist

fb6503c

return numeric NA in mean.epidist

ed39ed9

replace test_character with test_string when scalar character

065e1ba

check for finite parameters and positive sample size in calc_dist_params

b2fea45

joshwlambert merged commit e9a27b6 into main Nov 13, 2023

joshwlambert deleted the refctr_db branch November 13, 2023 16:04

joshwlambert mentioned this pull request Nov 14, 2023

Remove {dplyr} from Suggests #205

Merged

avallecam mentioned this pull request Dec 12, 2023

how to know which variables are available to subset by in a multi_epidist object? #224

Closed

joshwlambert mentioned this pull request Mar 1, 2024

Add as.data.frame() method for <multi_epidist> #249

Closed

joshwlambert mentioned this pull request Aug 13, 2024

Split vignette into topic-specific vignettes #75

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major refactor of {epiparameter} #197

Major refactor of {epiparameter} #197

joshwlambert commented Oct 10, 2023

pratikunterwegs left a comment

Bisaloo commented Oct 16, 2023

jamesmbaazam Oct 24, 2023

joshwlambert Nov 13, 2023

jamesmbaazam left a comment

joshwlambert commented Nov 13, 2023

joshwlambert commented Nov 13, 2023

pratikunterwegs left a comment

Major refactor of {epiparameter} #197

Major refactor of {epiparameter} #197

Conversation

joshwlambert commented Oct 10, 2023

pratikunterwegs left a comment

Choose a reason for hiding this comment

Bisaloo commented Oct 16, 2023

jamesmbaazam Oct 24, 2023

Choose a reason for hiding this comment

joshwlambert Nov 13, 2023

Choose a reason for hiding this comment

jamesmbaazam left a comment

Choose a reason for hiding this comment

Code review

My comments

joshwlambert commented Nov 13, 2023

joshwlambert commented Nov 13, 2023

pratikunterwegs left a comment

Choose a reason for hiding this comment