-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major refactor of {epiparameter} #197
Conversation
… needed since db update)
…_epiparam, summary, head, tail)
…aram, add_ci_limits)
…_can_reconstruct, df_reconstruct, dplyr_reconstruct.epiparam)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joshwlambert - this is a lot of work and seems to have reduced the codebase by a bit. I've only been able to take a quick look as it's a pretty large PR with changes in a lot of files. I've put down some comments in the files as well. I would request other reviewers to also take a look to catch potential issues I've missed.
My overall thoughts are that while the backend has changed somewhat, the functionality that I use most, epidist_db()
is relatively unchanged - I consider that a good thing. However, data access using epidist_db()
is noticeably slower than the previous implementation, and does not improve when setting single_epidist = TRUE
- would be good to try and speed this up. I've made some suggestions in the relevant file (on .read_epidist_db()
).
I would actually suggest removing the <multi_epidist>
class - this would further slim down the codebase.
- I think it could confusing for new users to understand the difference with
<epidist>
- the same issue as with<epiparam>
. - It should probably be fine for
epidist_db()
to return a list of<epidist>
. - For more compact printing, I would suggest changing the print method to use a shorter citation, and using fewer lines for the disease and pathogen. The full citation dominates the output, making it difficult to quickly grasp whether the correct study has been accessed, and what the key info is. I would go with
first-author:last-name et al. (Year) Journal
.
I have not reviewed the implementation because as this stage, I prefer doing a full review where I can have a complete overview of the codebase rather than a partial diff. But I agree with the design decision taken here. I find it much clearer to have a single class instead of of |
#' library of epidemiological parameters is compiled from primary literature | ||
#' sources. The list output from [epidist_db()] can be subset by the data it | ||
#' contains, for example by: disease, pathogen, epidemiological distribution, | ||
#' sample size, region, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest listing everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not easy to list every possible subsetting, since the function is set up to allow flexible subsetting with the subset
argument (similar to how the NSE with the subset()
function works). Therefore, it is possible to subset on most elements of an <epidist>
object, of which there are quite a few. I'm happy to reword the documentation to make it clearer if you suggest an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review
Thanks for the job well done, @joshwlambert. This PR simplifies the package namespace with the goal of improving user-friendliness. I think it will be better served with rigorous and honest user testing and feedback from internal and external stakeholders looking to incorporate the package in their outbreak analytics pipelines.
From the developer perspective, I only have a few extra comments that have not already been mentioned in the PR description or by others who have taken a look, i.e., make epidist_db()
faster and also consider pre-building the database for use (#198).
My comments
- A quick profiling of
epidist_db()
shows thatutils::format.bibtex()
, which is called increate_epidist_citation()
seems to be taking a lot of time underneath. I wonder if there are performant alternatives. - In
epidist()
, I wonder if the argumentepi_dist
is not potentially confusing with the class<epidist>
. I think it's fine but I wonder if it's worth considering a different name for the argument to remove any future confusion. - I'd suggest adopting the
cli
package for printing the<multi_epidist>
objects. There's a number of neat features that could help here including pluralizing single versus multiple entries when they're enumerated and using various formatting to emphasize various aspects of the output.
In the following example, "result(s)" and "(are) parameterised" can be pluralized with cli
. The citation DOI can be formatted with cli
's hyperlink features to make it clickable in the console. Moreover, parts of the output can also be formatted with cli
's color formatting features to make it stand out. I recognize that it's not a priority compared to other pressing issues with the API but just registering it here for later.
epiparameter::epidist_db(disease = "influenza", epi_dist = "serial_interval")
#> Using Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> To retrieve the citation use the 'get_citation' function
#> Disease: Influenza
#> Pathogen: Influenza-A-H1N1Pdm
#> Epi Distribution: serial interval
#> Study: Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> Distribution: gamma
#> Parameters:
#> shape: 2.622
#> scale: 0.957
Created on 2023-10-24 with reprex v2.0.2
- Still on the issue of printing, taking a look at the following example,
epidist_db(
disease = "COVID-19",
epi_dist = "incubation_period",
subset = is_parameterised
)
#> Returning 10 results that match the criteria (10 are parameterised).
#> Use subset to filter by entry variables or single_epidist to return a single entry.
#> To retrieve the short citation for each use the 'get_citation' function
#> List of <epidist> objects
#> Number of entries in library: 10
#> Number of studies in library: 4
#> Number of diseases: 1
#> Number of delay distributions: 10
#> Number of offspring distributions: 0
I think the message "Returning 10 results that match the criteria" essentially repeats the list entry, " Number of entries in library: 10". Might be worth removing one of them. I would remove the former and move the rest of the message about filtering to the bottom as extra information after printing the summary of the returned database.
- In the example for
epidist_db()
, I see you use functional subsetting in one of the examples withis_parameterised
. It might be worth explicitly namespacing it withepiparameter::is_parametrised
to make it clear that the function comes from epiparameter. Is it also possible to list out other functions that exist internally that can be used with thesubset
argument? - With regards to functions like
is_parametrised()
, where there is a variation in spelling, consider providing an American alias viais_parameterized()
likedplyr::summarise()
anddplyr::summarize()
I know it's a small thing but can reduce confusion for users who don't use tab or auto-completion. - Concerning the @return for
epidist_db()
, I wonder if it'll be better to list what an<epidist>
contains.
American spelling has been added for |
Thanks for all the comments and suggestions @pratikunterwegs & @jamesmbaazam! I have made some updates from comments left with commits to this branch, and some other aspects that fall outside the scope of this PR I have logged as issues and will address in a separate PR. Responses to some comments:
epiparameter::epidist_db(disease = "influenza", epi_dist = "serial_interval")
#> Returning 1 results that match the criteria (1 are parameterised).
#> Use subset to filter by entry variables or single_epidist to return a single entry.
#> To retrieve the short citation for each use the 'get_citation' function
#> Disease: Influenza
#> Pathogen: Influenza-A-H1N1Pdm
#> Epi Distribution: serial interval
#> Study: Ghani A, Baguelin M, Griffin J, Flasche S, van Hoek AJ, Cauchemez S,
#> Donnelly C, Robertson C, White M, Truscott J, Fraser C, Garske T, White
#> P, Leach S, Hall I, Jenkins H, Ferguson N, Cooper B (2009). "The Early
#> Transmission Dynamics of H1N1pdm Influenza in the United Kingdom."
#> _PLoS Currents_. doi:10.1371/currents.RRN1130
#> <https://doi.org/10.1371/currents.RRN1130>.
#> Distribution: gamma
#> Parameters:
#> shape: 2.622
#> scale: 0.957 Created on 2023-11-13 with reprex v2.0.2
I'm not sure what you mean. If you would prefer more information on what is in each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes @joshwlambert, great work, looks good to me. I've done some nitpicking but nothing really preventing this from being merged and a follow up sweep could also take care of these issues; I leave it to you when to take care of them.
This PR is a rewrite of the way {epiparameter} reads in and handles epidemiological parameters.
It reduces the size of the package namespace (a point of improvement mentioned in #151), and make the package more focussed on the
<epidist>
class which is the main functional unit.Key changes include:
<epidist>
class. The modular JSON structure should allow further changes to each component of a database entry to be made without large breaking changes.<epiparam>
class is removed and instead replaced by a list of<epidist>
objects as the main method of handling epidemiological parameters. The<epiparam>
methods, constructor, validator, and utility function are also removed (e.g.bind_epiparam()
). (A minimal S3 class<multi_epidist>
is introduced purely to provide cleaner printing when the list of<epidist>
objects is long).epidist_db()
.epidist_db()
is largely rewritten and includes new internal functions (.read_epidist_db()
,.filter_epidist_db()
,.format_epidist()
,.format_params()
,.is_cond_epidist()
).Minor changes include:
calc_dist_params()
has been updated and now includes conversion from mean and dispersion.NULL
inmean.epidist()
.create_epidist_summary_stats()
is unnested and now produces a single-nested listThis PR should evaluate these changes in comparison to the old implementation, and determine if these changes make the package easier to use, easier to maintain, and easier to develop.
There are further improvement to be made if this PR is merged, such as improving the speed of reading in the database and filtering. However, these should be done in a subsequent PR so that this one does not become too cumbersome.