Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

behavioral consistency of calculate() #356

Merged
merged 31 commits into from
Jan 17, 2021
Merged

behavioral consistency of calculate() #356

merged 31 commits into from
Jan 17, 2021

Conversation

simonpcouch
Copy link
Collaborator

@simonpcouch simonpcouch commented Jan 4, 2021

This PR makes some changes to calculate() aiming to error more informatively and behave more consistently. Briefly, from NEWS:

  • supply a consistent error when the supplied stat argument isn't well-defined for the variables specified
  • supply a message and return the test statistic when the user supplies unneeded information to calculate an observed statistic, and
  • supply a warning, assume a reasonable null value, and return the test statistic when the user does not supply sufficient information to calculate an observed statistic

Error informatively with inappropriate stat

The most confusing errors I see while teaching with calculate() seem to predominantly arise from within the dispatched calc_impl methods. As a result, the errors (or incorrect output) appear differently for the same type of mistake: supplying a stat that doesn't make sense given the variable types specify()ed. For instance, with the current develop:

> gss %>%
+   specify(college ~ sex, success = "degree") %>%
+   calculate(stat = "prop", order = c("female", "male"))
#> Error: Only strings can be converted to symbols
#> Run `rlang::last_error()` to see where the error occurred.

or

> gss %>%
+   specify(college ~ NULL, success = "degree") %>%
+   calculate(stat = "diff in props")
#> Error: Must extract column with a single valid subscript.
#> x Subscript `as.character(attr(x, "explanatory"))` has size 0 but must be size 1.
#> Run `rlang::last_error()` to see where the error occurred.

or

> gss %>%
+   specify(hours ~ NULL) %>%
+   calculate(stat = "slope")
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1    NA

We caught a few of the more cryptic errors with some conditional logic for special cases (the check_for_*_stat functions), though a good bit still slip through. We ought to make sure that calculate() errors come through before dispatching to calc_impl().

The new generalized error takes the form:

> gss %>%
+     specify(hours ~ NULL) %>%
+     calculate(stat = "diff in means")
#> Error: A difference in means is not well-defined for a numeric response variable 
#> (hours) and no explanatory variable.

or

> gss %>%
+     specify(college ~ partyid, success = "degree") %>%
+     calculate(stat = "diff in props")
#> Error: A difference in proportions is not well-defined for a dichotomous categorical 
#> response variable (college) and a multinomial categorical explanatory variable (partyid).

I've tossed around a few different iterations of phrasings for this error, trying to get at "it doesn't make sense" with sensitive but eliciting language. Very much willing to discuss rewordings here. Specifically, are "dichotomous" and "multinomial" the most common phrases here?

This implementation eliminates the need to run ad-hoc checks/transformations on the variable types/existence within the calc_impl methods that have caused issues in the past (see my most two most recent PRs). Those now take place before dispatching. Further, errors will now be consistent for what is essentially the same mistake—specifying a stat that doesn't make sense given the variable(s) specified.

Its logic relies on a stat_types tibble in utils.R—would appreciate a close look at what it considers a valid test statistic given variable types.


Calculating observed statistics

The steps to calculating observed statistics seem to vary unnecessarily across test statistics. Some test statistics couldn't be calculated without generate()ing first, test statistics differ in their response to unneeded hypothesis information, test statistics with nontrivial null hypotheses differed in the strictness with which they required a null, and we tended often not to warn in pipelines that were similar to those needed to successfully calculate a test statistic. My most recent two PRs for reference here, as well, in addition to the following examples.

Calculate will now supply a message when the user supplies "too much" information to calculate the given ("untheorized") observed statistic:

> gss %>%
+   specify(response = hours) %>%
+   hypothesize(null = "point", mu = 40) %>%
+   calculate(stat = "mean")
#> Message: The point null hypothesis `mu = 40` does not inform calculation of 
#> the observed statistic (a mean) and will be ignored.
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1  41.4

In previous infer versions, depending on the statistic, we used to error out or return the original x argument here. Some of my more recent PRs adjust some of the latter behavior to return the statistic without messaging for z statistics.

On the other hand, not supplying enough information (for "theorized" statistics) results in a warning:

> gss %>%
+     specify(response = sex, success = "female") %>%
+     calculate(stat = "z")
#> # A tibble: 1 x 1
#>    stat
#>   <dbl>
#> 1 -1.16
#> Warning message:
#> A z statistic requires a null hypothesis to calculate the observed statistic. 
#> Output assumes the following null value: `p = .5`. 

or

> gss %>%
+   specify(response = partyid) %>%
+   calculate(stat = "Chisq")
#> # A tibble: 1 x 1
#>    stat
#>  <dbl>
#> 1  334.
#> Warning message:
#> A chi-square statistic requires a null hypothesis to calculate the observed statistic. 
#> Output assumes the following null values: `p = c(dem = 0.2, ind = 0.2, rep = 0.2, other = 0.2, DK = 0.2)`. 

In the above case, depending on the statistic, we used to error (Chi-Square GOF), message (one-sample t), or silently report the statistic/error uninformatively (other theorized statistics). I could see the argument for erroring here, though this would be a breaking change for some statistics and wrappers.

calculate() will behave similarly after generate()ing.


Some other changes:

  • deletes code comments justifying old PRs / mentioning historical behavior
  • adds some explanatory commenting
  • increases usage of helper functions
  • removes redundant checks & simplifies conditional logic
  • relocates some helpers that are only used in one core verb
  • accept "two.sided" as an alias for "two-sided" in *_p_value() (Allow direction = "two.sided" in get_p_value() #355)
  • some style edits, mostly adding vertical space and ensuring spaces after commas and before/after operators

I don't believe this PR breaks any code that used to work... all previous unit tests pass except for rewordings and transitions between messages/warnings/errors. See 508eefa for needed adjustments.

I don't introduce it in this PR pending more discussion, but I think it's worth mentioning that this allows for a single function to calculate observed statistics (observe()?) a la the current *_stat() functions. Not sure if this goes against the grain of infer's pedagogical approach.

moves one-off helpers into the file they're used in
this should be useful in determining whether a supplied statistic is appropriate given variable types.
also fixes a bug in copying attrs to calculate results: generate -> generated

closes #355
adjust line break, use appropriate formatting for t statistic / null mu
some for new functionality, some making up for coverage that we let slip in recent PRs
one was indistinguishable, the other had to do with copying attributes.
Turns out previous behavior wasn't to error when calculating a _distribution_ of statistics without hypothesizing. This will allow calculate to be internally consistent with its handling of null hypotheses.
@ismayc
Copy link
Collaborator

ismayc commented Jan 4, 2021

I kinda do like observe() but I'd need to better understand what the layout would like like and some examples.

Thanks much for all the other work here! It's amazing all the little pieces that needed to be put back together and finished up that have somehow trickled their way into the package. I'm glad we have some master tidiers 😀

@ismayc
Copy link
Collaborator

ismayc commented Jan 4, 2021

What I'd really like is tying the infer verbs back to Allen Downey's There Is Only One Test diagram (http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html?m=1). Adding observe() would get us closer to that.

@simonpcouch
Copy link
Collaborator Author

Much appreciated!😄

If we decide that these changes are a go, I'll open up a separate PR/issue with some thoughts on observe(). The implementation of that function should be pretty straightforward, as we can rely on the internals of the existing core verbs for argument checking, but the integration into documentation and tests would be more significant.

@echasnovski
Copy link
Collaborator

Wow, that's a lot of good work! Thanks!
And a lot to review too :). Now I truly understand the importance of a byte-sized PRs.

Copy link
Collaborator

@echasnovski echasnovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for great work!
After reviewing it does look like splitting this PR into at least two parts (utility functions tidying and calculate() update) would lead to a better review.

R/calculate.R Outdated Show resolved Hide resolved
R/calculate.R Outdated Show resolved Hide resolved
R/calculate.R Show resolved Hide resolved
R/calculate.R Outdated Show resolved Hide resolved
R/hypothesize.R Outdated Show resolved Hide resolved
R/utils.R Show resolved Hide resolved
R/utils.R Show resolved Hide resolved
R/utils.R Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
dispatch to helpers within calculate

avoid eval(parse( in favor of switch

some style improvements
@simonpcouch
Copy link
Collaborator Author

@echasnovski Lots of helpful changes coming from your first round of review! Much appreciated. :-)

I hear you on keeping future PRs smaller--will do. Probably should have mentioned that switching out that one vdiffr .svg resulted in ~500 lines changed.

I’m not sure if you mean splitting up the PR itself or its review. Totally on board for the latter—feel free to take your time reviewing. We probably could/should nudge the merging of these changes until after the release of 0.5.4 (addressing the vdiffr issues) in the next few days. This PR will be pretty related to coming observe() and constructor PRs here soon, too, so maybe a 0.6.0 release in a few months along with those new features is a better home for these changes. If you meant splitting up the code in this PR into several PRs, the changes to utilities + helpers and the features in calculate() are pretty interdependent here; I think splitting them up would have the effect of decontextualizing the changes rather than easing review.

R/utils.R Outdated Show resolved Hide resolved
R/calculate.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
@echasnovski
Copy link
Collaborator

I’m not sure if you mean splitting up the PR itself or its review. Totally on board for the latter—feel free to take your time reviewing. We probably could/should nudge the merging of these changes until after the release of 0.5.4 (addressing the vdiffr issues) in the next few days. This PR will be pretty related to coming observe() and constructor PRs here soon, too, so maybe a 0.6.0 release in a few months along with those new features is a better home for these changes. If you meant splitting up the code in this PR into several PRs, the changes to utilities + helpers and the features in calculate() are pretty interdependent here; I think splitting them up would have the effect of decontextualizing the changes rather than easing review.

That was me thinking out loud, sorry. I meant that some changes (utilities wrapper or rename) need relatively small amount of targeted conversation but lead to cluttering the whole PR. When looking at every change made in PR I had to determine if it is an "attention needed" change or a simple move/rename. If it was a separate PR, it will need only skipping through changes after agreeing in pre-conversation. Never fully internalized this until actually making review of a rather big PR :)

Yes, making CRAN release soon is a priority here. These changes are better to wait until 0.6.0.

@simonpcouch simonpcouch marked this pull request as draft January 6, 2021 02:15
@simonpcouch
Copy link
Collaborator Author

All for it, thanks for clarifying + dealing with the clutter in this PR.

I'll holler here and merge upstream once 0.5.4 is on CRAN! Starting that release process tonight.

@echasnovski
Copy link
Collaborator

Finally got an opportunity to write code. Made some updates myself, hope you don't mind.

@echasnovski echasnovski self-requested a review January 7, 2021 13:34
@simonpcouch
Copy link
Collaborator Author

Thanks for catching that negation for has_attr()--woops. Those changes look great.

@simonpcouch simonpcouch changed the base branch from develop to master January 13, 2021 17:06
notably, fixes merge conflicts in NEWS

Merge branch 'master' into obs-stat

# Conflicts:
#	NEWS.md
@simonpcouch simonpcouch marked this pull request as ready for review January 13, 2021 17:13
@simonpcouch simonpcouch merged commit e084945 into master Jan 17, 2021
@simonpcouch simonpcouch deleted the obs-stat branch January 17, 2021 00:12
@github-actions
Copy link

github-actions bot commented Mar 8, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants