Merge pull request #280 from tidymodels/theory-in-docs

more documentation for theoretical methods (closes #278)
tidymodels · Jan 27, 2020 · be264a0 · be264a0
2 parents f94daf2 + 0ba7bd4
commit be264a0
Show file tree

Hide file tree

Showing 7 changed files with 296 additions and 22 deletions.
diff --git a/R/visualize.R b/R/visualize.R
@@ -54,16 +54,16 @@
 #'
 #' @examples
 #'   
-#' # ...and a null distribution
+#' # find a null distribution
 #' null_dist <- gss %>%
-#'   # ...we're interested in the number of hours worked per week
+#'   # we're interested in the number of hours worked per week
 #'   specify(response = hours) %>%
 #'   # hypothesizing that the mean is 40
 #'   hypothesize(null = "point", mu = 40) %>%
 #'   # generating data points for a null distribution
 #'   generate(reps = 10000, type = "bootstrap") %>%
-#'   # finding the null distribution
-#'   calculate(stat = "mean")
+#'   # calculating a distribution of t test statistics
+#'   calculate(stat = "t")
 #'   
 #' # we can easily plot the null distribution by piping into visualize
 #' null_dist %>%
@@ -73,8 +73,8 @@
 #' # find the point estimate---mean number of hours worked per week
 #' point_estimate <- gss %>%
 #'   specify(response = hours) %>%
-#'   calculate(stat = "mean") %>%
-#'   dplyr::pull()
+#'   hypothesize(null = "point", mu = 40) %>%
+#'   calculate(stat = "t")
 #'   
 #' # find a confidence interval around the point estimate
 #' ci <- null_dist %>%
@@ -92,6 +92,20 @@
 #' null_dist %>%
 #'   visualize() +
 #'   shade_confidence_interval(ci)
+#'   
+#' # to plot a theoretical null distribution, skip the generate()
+#' # step and supply `method = "theoretical"` to `visualize()`
+#' null_dist_theoretical <- gss %>%
+#'   specify(response = hours) %>%
+#'   hypothesize(null = "point", mu = 40) %>%
+#'   calculate(stat = "t") 
+#'   
+#' visualize(null_dist_theoretical, method = "theoretical")
+#' 
+#' # to plot both a theory-based and simulation-based null distribution,
+#' # use the simulation-based null distribution and supply
+#' # `method = "both"` to `visualize()`
+#' visualize(null_dist, method = "both")
 #'
 #' # More in-depth explanation of how to use the infer package
 #' vignette("infer")

diff --git a/man/visualize.Rd b/man/visualize.Rd
diff --git a/vignettes/anova.Rmd b/vignettes/anova.Rmd
@@ -58,10 +58,10 @@ observed_f_statistic <- gss %>%
 
 The observed $F$ statistic is `r observed_f_statistic`. Now, we want to compare this statistic to a null distribution, generated under the assumption that age and political party affiliation are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between the two variables.
 
-We can `generate` the null distribution using simulation. The simulation approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
+We can `generate` the null distribution using randomization. The randomization approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
 
 ```{r generate-null-f, warning = FALSE, message = FALSE}
-# generate the null distribution using simulation
+# generate the null distribution using randomization
 null_distribution <- gss %>%
   specify(age ~ partyid) %>%
   hypothesize(null = "independence") %>%
@@ -81,7 +81,29 @@ null_distribution %>%
                 direction = "greater")
 ```
 
-It looks like our observed test statistic would be _really_ unlikely if there were actually no association between age and political party affiliation. More exactly, we can calculate the p-value:
+We could also visualize the observed statistic against the theoretical null distribution. Note that we skip the `generate()` and `calculate()` steps when using the theoretical approach, and that we now need to provide `method = "theoretical"` to `visualize()`.
+
+```{r visualize-f-theor, warning = FALSE, message = FALSE}
+# visualize the theoretical null distribution and test statistic!
+gss %>%
+  specify(age ~ partyid) %>%
+  hypothesize(null = "independence") %>%
+  visualize(method = "theoretical") + 
+  shade_p_value(observed_f_statistic,
+                direction = "greater")
+```
+
+To visualize both the randomization-based and theoretical null distributions to get a sense of how the two relate, we can pipe the randomization-based null distribution into `visualize()`, and then further provide `method = "both"` to `visualize()`.
+
+```{r visualize-indep-both, warning = FALSE, message = FALSE}
+# visualize both null distributions and the test statistic!
+null_distribution %>%
+  visualize(method = "both") + 
+  shade_p_value(observed_f_statistic,
+                direction = "greater")
+```
+
+Either way, it looks like our observed test statistic would be _really_ unlikely if there were actually no association between age and political party affiliation. More exactly, we can calculate the p-value:
 
 ```{r p-value-indep, warning = FALSE, message = FALSE}
 # calculate the p value from the observed statistic and null distribution
@@ -94,4 +116,5 @@ p_value
 
 Thus, if there were really no relationship between age and political party affiliation, the probability that we would see a statistic as or more extreme than `r observed_f_statistic` is approximately `r p_value`.
 
+
 The package currently does not supply a wrapper for tidy ANOVA tests.
diff --git a/vignettes/chi_squared.Rmd b/vignettes/chi_squared.Rmd
@@ -62,18 +62,18 @@ observed_indep_statistic <- gss %>%
 
 The observed $\chi^2$ statistic is `r observed_indep_statistic`. Now, we want to compare this statistic to a null distribution, generated under the assumption that these variables are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between education and income.
 
-We can `generate` the null distribution in one of two ways---using simulation or theoretical approximation. The simulation approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
+We can `generate` the null distribution in one of two ways---using randomization or theory-based methods. The randomization approach permutes the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
 
 ```{r generate-null-indep, warning = FALSE, message = FALSE}
-# generate the null distribution using simulation
+# generate the null distribution using randomization
 null_distribution_simulated <- gss %>%
   specify(college ~ finrela) %>%
   hypothesize(null = "independence") %>%
   generate(reps = 1000, type = "permute") %>%
   calculate(stat = "Chisq")
 ```
 
-Note that, in the line `specify(college ~ finrela)` above, we could use the equivalent syntax `specify(response = college, explanatory = finrela)`. The same goes in the code below, which generates the null distribution using theoretical approximation instead of simulation.
+Note that, in the line `specify(college ~ finrela)` above, we could use the equivalent syntax `specify(response = college, explanatory = finrela)`. The same goes in the code below, which generates the null distribution using theory-based methods instead of randomization.
 
 ```{r generate-null-indep-t, warning = FALSE, message = FALSE}
 # generate the null distribution by theoretical approximation
@@ -94,7 +94,29 @@ null_distribution_simulated %>%
                 direction = "greater")
 ```
 
-It looks like our observed test statistic would be _really_ unlikely if there were actually no association between education and income. More exactly, we can calculate the p-value:
+We could also visualize the observed statistic against the theoretical null distribution. Note that we skip the `generate()` and `calculate()` steps when using the theoretical approach, and that we now need to provide `method = "theoretical"` to `visualize()`.
+
+```{r visualize-indep-theor, warning = FALSE, message = FALSE}
+# visualize the theoretical null distribution and test statistic!
+gss %>%
+  specify(college ~ finrela) %>%
+  hypothesize(null = "independence") %>%
+  visualize(method = "theoretical") + 
+  shade_p_value(observed_indep_statistic,
+                direction = "greater")
+```
+
+To visualize both the randomization-based and theoretical null distributions to get a sense of how the two relate, we can pipe the randomization-based null distribution into `visualize()`, and further provide `method = "both"`.
+
+```{r visualize-indep-both, warning = FALSE, message = FALSE}
+# visualize both null distributions and the test statistic!
+null_distribution_simulated %>%
+  visualize(method = "both") + 
+  shade_p_value(observed_indep_statistic,
+                direction = "greater")
+```
+
+Either way, it looks like our observed test statistic would be _really_ unlikely if there were actually no association between education and income. More exactly, we can calculate the p-value:
 
 ```{r p-value-indep, warning = FALSE, message = FALSE}
 # calculate the p value from the observed statistic and null distribution

diff --git a/vignettes/infer.Rmd b/vignettes/infer.Rmd
@@ -156,7 +156,7 @@ gss %>%
   calculate("diff in means", order = c("degree", "no degree"))
 ```
 
-### Other Utilities in {infer}
+### Other Utilities
 
 `infer` also offers several utilities to extract the meaning out of summary statistics and null distributions---the package provides functions to visualize where a statistic is relative to a distribution (with `visualize()`), calculate p-values (with `get_p_value()`), and calculate confidence intervals (with `get_confidence_interval()`).
 
@@ -222,4 +222,49 @@ null_dist %>%
 
 As you can see, 40 hours per week is not contained in this interval, which aligns with our previous conclusion that this finding is significant at the confidence level $\alpha = .05$.
 
-This vignette covers most all of the key functionality of infer. See `help(package = "infer")` for a full list of functions and vignettes.
+### Theoretical Methods
+
+{infer} also provides functionality to use theoretical methods for `"Chisq"`, `"F"` and `"t"` test statistics. 
+
+Generally, to find a null distribution using theory-based methods, use the same code that you would use to find the null distribution using randomization-based methods, but skip the `generate()` step. For example, if we wanted to find a null distribution for the relationship between age (`age`) and party identification (`partyid`) using randomization, we could write:
+
+```{r, message = FALSE, warning = FALSE}
+null_f_distn <- gss %>%
+   specify(age ~ partyid) %>%
+   hypothesize(null = "independence") %>%
+   generate(reps = 1000, type = "permute") %>%
+   calculate(stat = "F")
+```
+
+To find the null distribution using theory-based methods, instead, skip the `generate()` step entirely:
+
+```{r, message = FALSE, warning = FALSE}
+null_f_distn_theoretical <- gss %>%
+   specify(age ~ partyid) %>%
+   hypothesize(null = "independence") %>%
+   calculate(stat = "F")
+```
+
+We'll calculate the observed statistic to make use of in the following visualizations---this procedure is the same, regardless of the methods used to find the null distribution.
+
+```{r, message = FALSE, warning = FALSE}
+F_hat <- gss %>% 
+  specify(age ~ partyid) %>%
+  calculate(stat = "F")
+```
+
+Now, instead of just piping the null distribution into `visualize()`, as we would do if we wanted to visualize the randomization-based null distribution, we also need to provide `method = "theoretical"` to `visualize()`.
+
+```{r, message = FALSE, warning = FALSE}
+visualize(null_f_distn_theoretical, method = "theoretical") +
+  shade_p_value(obs_stat = F_hat, direction = "greater")
+```
+
+To get a sense of how the theory-based and randomization-based null distributions relate, as well, we can pipe the randomization-based null distribution into `visualize()` and also specify `method = "both"`
+
+```{r, message = FALSE, warning = FALSE}
+visualize(null_f_distn, method = "both") +
+  shade_p_value(obs_stat = F_hat, direction = "greater")
+```
+
+That's it! This vignette covers most all of the key functionality of infer. See `help(package = "infer")` for a full list of functions and vignettes.