Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major refactor of {epiparameter} #197

Merged
merged 72 commits into from
Nov 13, 2023
Merged

Major refactor of {epiparameter} #197

merged 72 commits into from
Nov 13, 2023

Conversation

joshwlambert
Copy link
Member

This PR is a rewrite of the way {epiparameter} reads in and handles epidemiological parameters.

It reduces the size of the package namespace (a point of improvement mentioned in #151), and make the package more focussed on the <epidist> class which is the main functional unit.

Key changes include:

  • Updated parameter library to be a structured JSON database which more closely resembles the structure of the <epidist> class. The modular JSON structure should allow further changes to each component of a database entry to be made without large breaking changes.
  • The data dictionary is updated to validate the updates to the JSON parameter library.
  • The <epiparam> class is removed and instead replaced by a list of <epidist> objects as the main method of handling epidemiological parameters. The <epiparam> methods, constructor, validator, and utility function are also removed (e.g. bind_epiparam()). (A minimal S3 class <multi_epidist> is introduced purely to provide cleaner printing when the list of <epidist> objects is long).
  • The reading in of the data from the library is now done via a single function: epidist_db(). epidist_db() is largely rewritten and includes new internal functions (.read_epidist_db(), .filter_epidist_db(), .format_epidist(), .format_params(), .is_cond_epidist()).
  • The documentation, tests and vignettes have been updated.

Minor changes include:

  • calc_dist_params() has been updated and now includes conversion from mean and dispersion.
  • Improved handling of NULL in mean.epidist().
  • create_epidist_summary_stats() is unnested and now produces a single-nested list

This PR should evaluate these changes in comparison to the old implementation, and determine if these changes make the package easier to use, easier to maintain, and easier to develop.

There are further improvement to be made if this PR is merged, such as improving the speed of reading in the database and filtering. However, these should be done in a subsequent PR so that this one does not become too cumbersome.

…_can_reconstruct, df_reconstruct, dplyr_reconstruct.epiparam)
@joshwlambert joshwlambert marked this pull request as ready for review October 11, 2023 09:03
@TimTaylor TimTaylor mentioned this pull request Oct 11, 2023
@jamesmbaazam jamesmbaazam self-requested a review October 11, 2023 12:28
Copy link
Collaborator

@pratikunterwegs pratikunterwegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joshwlambert - this is a lot of work and seems to have reduced the codebase by a bit. I've only been able to take a quick look as it's a pretty large PR with changes in a lot of files. I've put down some comments in the files as well. I would request other reviewers to also take a look to catch potential issues I've missed.

My overall thoughts are that while the backend has changed somewhat, the functionality that I use most, epidist_db() is relatively unchanged - I consider that a good thing. However, data access using epidist_db() is noticeably slower than the previous implementation, and does not improve when setting single_epidist = TRUE - would be good to try and speed this up. I've made some suggestions in the relevant file (on .read_epidist_db()).

I would actually suggest removing the <multi_epidist> class - this would further slim down the codebase.

  • I think it could confusing for new users to understand the difference with <epidist> - the same issue as with <epiparam>.
  • It should probably be fine for epidist_db() to return a list of <epidist>.
  • For more compact printing, I would suggest changing the print method to use a shorter citation, and using fewer lines for the disease and pathogen. The full citation dominates the output, making it difficult to quickly grasp whether the correct study has been accessed, and what the key info is. I would go with first-author:last-name et al. (Year) Journal.

@Bisaloo
Copy link
Member

Bisaloo commented Oct 16, 2023

I have not reviewed the implementation because as this stage, I prefer doing a full review where I can have a complete overview of the codebase rather than a partial diff.

But I agree with the design decision taken here. I find it much clearer to have a single class instead of of epiparam and epidist.

#' library of epidemiological parameters is compiled from primary literature
#' sources. The list output from [epidist_db()] can be subset by the data it
#' contains, for example by: disease, pathogen, epidemiological distribution,
#' sample size, region, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest listing everything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not easy to list every possible subsetting, since the function is set up to allow flexible subsetting with the subset argument (similar to how the NSE with the subset() function works). Therefore, it is possible to subset on most elements of an <epidist> object, of which there are quite a few. I'm happy to reword the documentation to make it clearer if you suggest an alternative.

Copy link
Member

@jamesmbaazam jamesmbaazam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

Thanks for the job well done, @joshwlambert. This PR simplifies the package namespace with the goal of improving user-friendliness. I think it will be better served with rigorous and honest user testing and feedback from internal and external stakeholders looking to incorporate the package in their outbreak analytics pipelines.

From the developer perspective, I only have a few extra comments that have not already been mentioned in the PR description or by others who have taken a look, i.e., make epidist_db() faster and also consider pre-building the database for use (#198).

My comments

  • A quick profiling of epidist_db() shows that utils::format.bibtex(), which is called in create_epidist_citation() seems to be taking a lot of time underneath. I wonder if there are performant alternatives.
  • In epidist(), I wonder if the argument epi_dist is not potentially confusing with the class <epidist>. I think it's fine but I wonder if it's worth considering a different name for the argument to remove any future confusion.
  • I'd suggest adopting the cli package for printing the <multi_epidist> objects. There's a number of neat features that could help here including pluralizing single versus multiple entries when they're enumerated and using various formatting to emphasize various aspects of the output.

In the following example, "result(s)" and "(are) parameterised" can be pluralized with cli. The citation DOI can be formatted with cli's hyperlink features to make it clickable in the console. Moreover, parts of the output can also be formatted with cli's color formatting features to make it stand out. I recognize that it's not a priority compared to other pressing issues with the API but just registering it here for later.

epiparameter::epidist_db(disease = "influenza", epi_dist = "serial_interval")
#> Using Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> To retrieve the citation use the 'get_citation' function
#> Disease: Influenza
#> Pathogen: Influenza-A-H1N1Pdm
#> Epi Distribution: serial interval
#> Study: Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> Distribution: gamma
#> Parameters:
#>   shape: 2.622
#>   scale: 0.957
Created on 2023-10-24 with reprex v2.0.2
  • Still on the issue of printing, taking a look at the following example,
epidist_db(
     disease = "COVID-19",
     epi_dist = "incubation_period",
     subset = is_parameterised
 )
#> Returning 10 results that match the criteria (10 are parameterised).
#> Use subset to filter by entry variables or single_epidist to return a single entry.
#> To retrieve the short citation for each use the 'get_citation' function
#> List of <epidist> objects
#>  Number of entries in library: 10
#>  Number of studies in library: 4
#> Number of diseases: 1
#>  Number of delay distributions: 10
#>  Number of offspring distributions: 0

I think the message "Returning 10 results that match the criteria" essentially repeats the list entry, " Number of entries in library: 10". Might be worth removing one of them. I would remove the former and move the rest of the message about filtering to the bottom as extra information after printing the summary of the returned database.

  • In the example for epidist_db(), I see you use functional subsetting in one of the examples with is_parameterised. It might be worth explicitly namespacing it with epiparameter::is_parametrised to make it clear that the function comes from epiparameter. Is it also possible to list out other functions that exist internally that can be used with the subset argument?
  • With regards to functions like is_parametrised(), where there is a variation in spelling, consider providing an American alias via is_parameterized() like dplyr::summarise() and dplyr::summarize() I know it's a small thing but can reduce confusion for users who don't use tab or auto-completion.
  • Concerning the @return for epidist_db(), I wonder if it'll be better to list what an <epidist> contains.

@joshwlambert
Copy link
Member Author

With regards to functions like is_parametrised(), where there is a variation in spelling, consider providing an American alias via is_parameterized() like dplyr::summarise() and dplyr::summarize() I know it's a small thing but can reduce confusion for users who don't use tab or auto-completion.

American spelling has been added for is_parameterised() and discretise() in 1582803.

@joshwlambert
Copy link
Member Author

Thanks for all the comments and suggestions @pratikunterwegs & @jamesmbaazam!

I have made some updates from comments left with commits to this branch, and some other aspects that fall outside the scope of this PR I have logged as issues and will address in a separate PR.

Responses to some comments:

  • There seems to be some disagreement around the use the <multi_epidist> class and whether it should be deleted or be enhanced with improved printing. I will move this to a new issue referencing peoples comments and we can decide which is the best direction to go in, with a dedicated PR to this issue.
  • I agree that the citation length printed to console can be a large part of the <epidist> and can create issues reading the list of <epidist>s output from epidist_db(). There are some other aspects of citations that I need to improve (see Team authors requires hyphen separation #193) so I will tackle these together in a later PR.
  • @jamesmbaazam the first code chunk does not match what I reproduce locally. Can you check you have the most up-to-date version of {epiparameter} installed. I get
epiparameter::epidist_db(disease = "influenza", epi_dist = "serial_interval")
#> Returning 1 results that match the criteria (1 are parameterised). 
#> Use subset to filter by entry variables or single_epidist to return a single entry. 
#> To retrieve the short citation for each use the 'get_citation' function
#> Disease: Influenza
#> Pathogen: Influenza-A-H1N1Pdm
#> Epi Distribution: serial interval
#> Study: Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> Distribution: gamma
#> Parameters:
#>   shape: 2.622
#>   scale: 0.957

Created on 2023-11-13 with reprex v2.0.2

  • I agree the number of entries returned from epidist_db() is replicated, but there are some instances where it is not. When the output of epidist_db() is assigned to a variable, or if the number of <epidist> objects returned is less than 5. For these cases I will leave the message there for now.
  • Concerning the @return for epidist_db(), I wonder if it'll be better to list what an <epidist> contains.

I'm not sure what you mean. If you would prefer more information on what is in each <epidist> object I think it is best to document that in the @return field of the epidist() documentation.

Copy link
Collaborator

@pratikunterwegs pratikunterwegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes @joshwlambert, great work, looks good to me. I've done some nitpicking but nothing really preventing this from being merged and a follow up sweep could also take care of these issues; I leave it to you when to take care of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants