Skip to content

Commit

Permalink
Merge pull request #280 from tidymodels/theory-in-docs
Browse files Browse the repository at this point in the history
more documentation for theoretical methods (closes #278)
  • Loading branch information
ismayc authored Jan 27, 2020
2 parents f94daf2 + 0ba7bd4 commit be264a0
Show file tree
Hide file tree
Showing 7 changed files with 296 additions and 22 deletions.
26 changes: 20 additions & 6 deletions R/visualize.R
Original file line number Diff line number Diff line change
Expand Up @@ -54,16 +54,16 @@
#'
#' @examples
#'
#' # ...and a null distribution
#' # find a null distribution
#' null_dist <- gss %>%
#' # ...we're interested in the number of hours worked per week
#' # we're interested in the number of hours worked per week
#' specify(response = hours) %>%
#' # hypothesizing that the mean is 40
#' hypothesize(null = "point", mu = 40) %>%
#' # generating data points for a null distribution
#' generate(reps = 10000, type = "bootstrap") %>%
#' # finding the null distribution
#' calculate(stat = "mean")
#' # calculating a distribution of t test statistics
#' calculate(stat = "t")
#'
#' # we can easily plot the null distribution by piping into visualize
#' null_dist %>%
Expand All @@ -73,8 +73,8 @@
#' # find the point estimate---mean number of hours worked per week
#' point_estimate <- gss %>%
#' specify(response = hours) %>%
#' calculate(stat = "mean") %>%
#' dplyr::pull()
#' hypothesize(null = "point", mu = 40) %>%
#' calculate(stat = "t")
#'
#' # find a confidence interval around the point estimate
#' ci <- null_dist %>%
Expand All @@ -92,6 +92,20 @@
#' null_dist %>%
#' visualize() +
#' shade_confidence_interval(ci)
#'
#' # to plot a theoretical null distribution, skip the generate()
#' # step and supply `method = "theoretical"` to `visualize()`
#' null_dist_theoretical <- gss %>%
#' specify(response = hours) %>%
#' hypothesize(null = "point", mu = 40) %>%
#' calculate(stat = "t")
#'
#' visualize(null_dist_theoretical, method = "theoretical")
#'
#' # to plot both a theory-based and simulation-based null distribution,
#' # use the simulation-based null distribution and supply
#' # `method = "both"` to `visualize()`
#' visualize(null_dist, method = "both")
#'
#' # More in-depth explanation of how to use the infer package
#' vignette("infer")
Expand Down
26 changes: 20 additions & 6 deletions man/visualize.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 26 additions & 3 deletions vignettes/anova.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,10 @@ observed_f_statistic <- gss %>%

The observed $F$ statistic is `r observed_f_statistic`. Now, we want to compare this statistic to a null distribution, generated under the assumption that age and political party affiliation are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between the two variables.

We can `generate` the null distribution using simulation. The simulation approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
We can `generate` the null distribution using randomization. The randomization approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.

```{r generate-null-f, warning = FALSE, message = FALSE}
# generate the null distribution using simulation
# generate the null distribution using randomization
null_distribution <- gss %>%
specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
Expand All @@ -81,7 +81,29 @@ null_distribution %>%
direction = "greater")
```

It looks like our observed test statistic would be _really_ unlikely if there were actually no association between age and political party affiliation. More exactly, we can calculate the p-value:
We could also visualize the observed statistic against the theoretical null distribution. Note that we skip the `generate()` and `calculate()` steps when using the theoretical approach, and that we now need to provide `method = "theoretical"` to `visualize()`.

```{r visualize-f-theor, warning = FALSE, message = FALSE}
# visualize the theoretical null distribution and test statistic!
gss %>%
specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
visualize(method = "theoretical") +
shade_p_value(observed_f_statistic,
direction = "greater")
```

To visualize both the randomization-based and theoretical null distributions to get a sense of how the two relate, we can pipe the randomization-based null distribution into `visualize()`, and then further provide `method = "both"` to `visualize()`.

```{r visualize-indep-both, warning = FALSE, message = FALSE}
# visualize both null distributions and the test statistic!
null_distribution %>%
visualize(method = "both") +
shade_p_value(observed_f_statistic,
direction = "greater")
```

Either way, it looks like our observed test statistic would be _really_ unlikely if there were actually no association between age and political party affiliation. More exactly, we can calculate the p-value:

```{r p-value-indep, warning = FALSE, message = FALSE}
# calculate the p value from the observed statistic and null distribution
Expand All @@ -94,4 +116,5 @@ p_value

Thus, if there were really no relationship between age and political party affiliation, the probability that we would see a statistic as or more extreme than `r observed_f_statistic` is approximately `r p_value`.


The package currently does not supply a wrapper for tidy ANOVA tests.
30 changes: 26 additions & 4 deletions vignettes/chi_squared.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,18 @@ observed_indep_statistic <- gss %>%

The observed $\chi^2$ statistic is `r observed_indep_statistic`. Now, we want to compare this statistic to a null distribution, generated under the assumption that these variables are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between education and income.

We can `generate` the null distribution in one of two ways---using simulation or theoretical approximation. The simulation approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
We can `generate` the null distribution in one of two ways---using randomization or theory-based methods. The randomization approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.

```{r generate-null-indep, warning = FALSE, message = FALSE}
# generate the null distribution using simulation
# generate the null distribution using randomization
null_distribution_simulated <- gss %>%
specify(college ~ finrela) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
```

Note that, in the line `specify(college ~ finrela)` above, we could use the equivalent syntax `specify(response = college, explanatory = finrela)`. The same goes in the code below, which generates the null distribution using theoretical approximation instead of simulation.
Note that, in the line `specify(college ~ finrela)` above, we could use the equivalent syntax `specify(response = college, explanatory = finrela)`. The same goes in the code below, which generates the null distribution using theory-based methods instead of randomization.

```{r generate-null-indep-t, warning = FALSE, message = FALSE}
# generate the null distribution by theoretical approximation
Expand All @@ -94,7 +94,29 @@ null_distribution_simulated %>%
direction = "greater")
```

It looks like our observed test statistic would be _really_ unlikely if there were actually no association between education and income. More exactly, we can calculate the p-value:
We could also visualize the observed statistic against the theoretical null distribution. Note that we skip the `generate()` and `calculate()` steps when using the theoretical approach, and that we now need to provide `method = "theoretical"` to `visualize()`.

```{r visualize-indep-theor, warning = FALSE, message = FALSE}
# visualize the theoretical null distribution and test statistic!
gss %>%
specify(college ~ finrela) %>%
hypothesize(null = "independence") %>%
visualize(method = "theoretical") +
shade_p_value(observed_indep_statistic,
direction = "greater")
```

To visualize both the randomization-based and theoretical null distributions to get a sense of how the two relate, we can pipe the randomization-based null distribution into `visualize()`, and further provide `method = "both"`.

```{r visualize-indep-both, warning = FALSE, message = FALSE}
# visualize both null distributions and the test statistic!
null_distribution_simulated %>%
visualize(method = "both") +
shade_p_value(observed_indep_statistic,
direction = "greater")
```

Either way, it looks like our observed test statistic would be _really_ unlikely if there were actually no association between education and income. More exactly, we can calculate the p-value:

```{r p-value-indep, warning = FALSE, message = FALSE}
# calculate the p value from the observed statistic and null distribution
Expand Down
49 changes: 47 additions & 2 deletions vignettes/infer.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ gss %>%
calculate("diff in means", order = c("degree", "no degree"))
```

### Other Utilities in {infer}
### Other Utilities

`infer` also offers several utilities to extract the meaning out of summary statistics and null distributions---the package provides functions to visualize where a statistic is relative to a distribution (with `visualize()`), calculate p-values (with `get_p_value()`), and calculate confidence intervals (with `get_confidence_interval()`).

Expand Down Expand Up @@ -222,4 +222,49 @@ null_dist %>%

As you can see, 40 hours per week is not contained in this interval, which aligns with our previous conclusion that this finding is significant at the confidence level $\alpha = .05$.

This vignette covers most all of the key functionality of infer. See `help(package = "infer")` for a full list of functions and vignettes.
### Theoretical Methods

{infer} also provides functionality to use theoretical methods for `"Chisq"`, `"F"` and `"t"` test statistics.

Generally, to find a null distribution using theory-based methods, use the same code that you would use to find the null distribution using randomization-based methods, but skip the `generate()` step. For example, if we wanted to find a null distribution for the relationship between age (`age`) and party identification (`partyid`) using randomization, we could write:

```{r, message = FALSE, warning = FALSE}
null_f_distn <- gss %>%
specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
```

To find the null distribution using theory-based methods, instead, skip the `generate()` step entirely:

```{r, message = FALSE, warning = FALSE}
null_f_distn_theoretical <- gss %>%
specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
calculate(stat = "F")
```

We'll calculate the observed statistic to make use of in the following visualizations---this procedure is the same, regardless of the methods used to find the null distribution.

```{r, message = FALSE, warning = FALSE}
F_hat <- gss %>%
specify(age ~ partyid) %>%
calculate(stat = "F")
```

Now, instead of just piping the null distribution into `visualize()`, as we would do if we wanted to visualize the randomization-based null distribution, we also need to provide `method = "theoretical"` to `visualize()`.

```{r, message = FALSE, warning = FALSE}
visualize(null_f_distn_theoretical, method = "theoretical") +
shade_p_value(obs_stat = F_hat, direction = "greater")
```

To get a sense of how the theory-based and randomization-based null distributions relate, as well, we can pipe the randomization-based null distribution into `visualize()` and also specify `method = "both"`

```{r, message = FALSE, warning = FALSE}
visualize(null_f_distn, method = "both") +
shade_p_value(obs_stat = F_hat, direction = "greater")
```

That's it! This vignette covers most all of the key functionality of infer. See `help(package = "infer")` for a full list of functions and vignettes.
Loading

0 comments on commit be264a0

Please sign in to comment.