vignettes/getting-started.Rmd

---
title: "Analysing amplicon data with DivNet"
author: "Amy Willis"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{css, echo=FALSE}
.note-box {
  background-color: lightgreen;
  border: 3px solid green;
  font-weight: bold;
}
```

## Vignette Info

Mike Lee, a bioinformatics wizard, recently put together some fantastic tutorials on bioinformatics for analysing microbiome data at [link](astrobiomike.github.io). Mike has very kindly allowed us to distribute the phyloseq data object from this tutorial via DivNet. In this tutorial we're going to analyse this data set.

## Preliminaries

Let's load some required packages and check our session info:

```{r}
library(magrittr)
if ("speedyseq" %in% row.names(installed.packages())) { 
  library(speedyseq) # if speedyseq is installed, use it for tax_glom 
} else {
  library(phyloseq) # if not, use phyloseq for tax_glom
}
library(breakaway)
library(DivNet) # last checked with version 0.3.5
sessionInfo()
```

Let's take a quick look at the Lee et al dataset.

```{r}
data(Lee)
Lee
```

16 samples, 1490 ASVs, and taxonomy information. Fantastic!

## What does DivNet do that I can't do already?

The idea behind `DivNet`, just like the idea behind `breakaway`, is that you don't care about your samples. You care about the population that your samples were drawn from. So Mike doesn't care about the X mL of sea scum that he scraped off these basalts, he cares about all of the microbes that live on those basalts. For example, if we knew the true relative abundances of all of the phyla living on seafloor basalts, we could calculate the true Shannon diversity at the phylum level of microbes living on seafloor basalt, and compare it to the, e.g., true Shannon diversity of microbes living on land basalts. In order to do this comparison, we need an estimate of the Shannon diversity of microbes living on seafloor basalts.

One approach to getting such an estimate is to consider the  Shannon diversity of the samples that we took:

```{r}
tax_glom(Lee, taxrank="Phylum") %>%
  sample_richness %>%
  plot
```

These numbers tell us something about the X mL of  microbial matter that we observed. However, while we hope that they are reflective of all microbial matter living on those rocks, the samples were inexhaustive (we didn't get all the microbes on the rocks into the MiSeq), and so the relative abundances that we observed are definitely not the relative abundances of all phyla on the rock. There are likely missing taxa (that are on the rocks but not in the sample), over sampled taxa (that were observed in greater proportion in the sample than live on the rock), and under sampled taxa (that were observed in lower proportion).

By taking advantage of biological replicates and covariate information, DivNet works to get a more complete picture of the diversity of the microbes on the rocks. Let's take a look to see how it works.

## Using DivNet

It's tempting to jump right in and run `divnet(Lee)`, but DivNet gets more expensive as the number of taxa increases. For this reason, were going to analyse the data at a higher taxonomic level than ASVs. Let's look at the phylum level (just to illustrate!).  

```{r}
lee_phylum <- tax_glom(Lee, taxrank="Phylum")
lee_phylum
```

20 taxa is incredibly manageable! Let's go ahead and run `divnet`. My computer has 4 cores, so I am just going to run in parallel with the `ncores` argument.

**Sidenote for the reproducibility nerds**: `DivNet`'s model fitting procedure has a Metropolis Hastings step to estimate a complex integral using Monte Carlo estimation, which introduces some randomness into the results. Under the default settings the results should change very little between runs (especially if you set `tuning = "careful"` or increase `EMiter` or `MCiter` from the defaults, which are `tuning = list(EMiter = 10, EMburn = 5, MCiter = 1000, MCburn = 500, stepsize = 0.01)`). To make sure that they don't change between runs, we *could* set the random number seed... but this would only work if we were running everything on one core. Since we want to distribute the process over 4 cores, and setting seeds over multiple cores is [kind of a pain](https://www.r-bloggers.com/%F0%9F%8C%B1-setting-a-seed-in-r-when-using-parallel-simulation/), I'm just going to up the ante with `tuning = "careful"` and hope you'll forgive me this time. If you're a less powerful machine feel free to remove this argument (default is `tuning = "fast"`). 


```{r, results = "hide"}
set.seed(20200318)
divnet_phylum <- lee_phylum %>%
  divnet(ncores = 4, tuning = "careful")
```

Hopefully that didn't take too long! If you don't want to run in parallel, you can ignore the `ncores`  argument.

Let's take a look at what the output of DivNet is: a list of diversity indices and some variances.

```{r}
divnet_phylum %>% names
```

For each of the 4 diversity indices mentioned above, we have an estimate of the diversity index of the population from which that sample was drawn. So the estimated Shannon index is

```{r}
divnet_phylum$shannon %>% head
```

and the variance of the estimate is also shown.

Why are the estimates all different? We didn't tell DivNet about any covariate information, so it just assumes that all the samples are from different populations. But we have information about the samples and the conditions under which they observed:

```{r}
lee_phylum %>% sample_data
```

Let's use this to estimate the Shannon diversity of basalts of different characteristics (`char`) using DivNet. There are several ways we can accomplish this using the `divnet` function. If we take a look at a description of the function using 

```{r}
?DivNet::divnet
```

We see that `divnet()` requires an abundance table `W` where taxa are columns and samples are rows; or a phyloseq object. If your input is a phyloseq object, as is the case with our example using `lee_phylum`, then there are two ways you could provide covariate information such as (`char`) to DivNet: 

```{r, results = "hide"}
divnet_phylum_char <- lee_phylum %>%
  divnet(X = "char", ncores = 4)
```

or

```{r, results = "hide"}
divnet_phylum_char2 <- lee_phylum %>%
  divnet(formula = ~ char, ncores = 4)
```

The first way uses `X = "char"` to indicate using the `char` variable as a covariate. Using this approach you can only specify a single covariate. If you use `formula` to specify your model of interest you can indicate one or more covariates. For example, if you were interested in using more than one covariate such as `char` and `temp` you would use `formula` to specify your model: 

```{r, results = "hide"}
divnet_phylum_char_temp <- lee_phylum %>%
  divnet(formula = ~ char + temp, ncores = 4)
```

If your abundance table `W` is not a phyloseq object and you want to utilize covariate information with DivNet you will need to provide a matrix of covariates to the `X` argument where samples are rows and variables are columns.  You can additionally indicate your model of interest using `formula` in the same manner illustrated above if you wanted to model a subset of variables in your covariate matrix. 

Now that we have `divnet_phylum_char`, let's now compare the plug-in Shannon index with the divnet estimates

```{r}
library(ggplot2)
divnet_phylum_char$shannon %>% 
  plot(lee_phylum, color = "char") +
  xlab("Basalt characteristic") +
  ylab("Shannon diversity estimate\n(phylum level)") +
  ylim(c(0,2))
```

You will notice that the plug-in  estimates of Shannon diversity are different for each sample, but there is only a single DivNet estimate for each  characteristic (along with error bars). For characteristics for which many samples were observed, there are smaller error bars than for samples for which there was only one sample (seems reasonable -- we had less data).

## Where do the variance estimates come from?

When taxa cluster together (spatially), you can imagine that your samples are more likely to be different from each other. For example, you may happen upon a patch of Microbe A in your first sample, but get a patch of Microbe B in your second sample. The statistics word for this is variance, and it is well studied that dependence structures (like spatial organisation) lead to greater variance. Plug-in estimates of diversity (the ones you're familiar with: where you just take the sample data and calculate the diversity index on it) ignore any dependence structure. This leads to understatements of variance (samples are more varying than the model that underpins the estimate allows), and exaggerated risk of concluding significant differences when none exist (p values are smaller than they should be, leading to an increased Type 1 error rate).
DivNet addresses this by explicitly estimating those dependence structures, and then using them to come up with variance estimates (useful for the error bars). Let's see that in action.

## Hypothesis testing with DivNet

I think the most common way ecologists currently test hypotheses about diversity is with a t-test. For example, if biological replicates are available, they would take the estimated index for one group and compared to the estimated index of the second group (Don't stop reading! You shouldn't do this!):

```{r}
plugin <- tax_glom(Lee, taxrank="Phylum") %>%
            estimate_richness(measures = "Shannon") %$% Shannon
char <- Lee %>% sample_data %$% char
t.test(plugin[char == "altered"],
       plugin[char == "glassy"])
```

Underpinning this approach is a major problem: the plug-in estimates of diversity are not very good because they don't account for missing taxa or undersampling or oversampling of taxa. For the Shannon index, estimated diversity is too low, with a known negative bias. They are also based on the multinomial model, which ignores taxon-taxon interactions and the additional variance attributable to them.

DivNet addresses these problems by accounting for oversampling and undersampling, and modelling these interactions. This gives us better variance estimates with which to do hypothesis testing.

Let's say we want to compare the Shannon index across the different characteristics. The package `breakaway` provides an implementation for statistical inference for alpha diversity called `betta`. We are going to use this function here for inference on the Shannon index.

 To set this up, we need a vector of our alpha diversity estimates, a vector of the standard errors in these estimates, and a design matrix

```{r}
estimates <- divnet_phylum_char$shannon %>% summary %$% estimate
ses <- sqrt(divnet_phylum_char$`shannon-variance`)
X <- breakaway::make_design_matrix(lee_phylum, "char")
betta(estimates, ses, X)$table
```

The intercept term is our altered basalts (they are the only ones that aren't listed), and so all comparisons are made to this baseline. We see that using DivNet we conclude that there is a significant differences between altered and glassy basalts, unlike what we saw with the t-test (DivNet p value is `r betta(estimates, ses, X)$table["predictorsglassy",3]` not `r round(t.test(plugin[char == "altered"], plugin[char == "glassy"])$p.value, 2)`). Note that the estimated differences between altered and glassy are similar (DivNet estimates a change in Shannon diversity of `r round(betta(estimates, ses, X)$table["predictorsglassy", 1], 3)` while the t-test estimated `r round(mean(plugin[char == "glassy"])-mean(plugin[char == "altered"]), 3)`), but the difference is the lower standard error (greater precision in estimating the difference) from DivNet. 

We also see some differences in alpha diversity across the other different groups: biofilm basalts have significantly different diversity (at the phylum level) than altered basalts. If we want to do the global test of whether or not there are differences across categories, we can get the p value as follows:

```{r}
betta(estimates, ses, X)$global[2]
```

`breakaway` also implements several other hypothesis testing functions. `betta_random` will let you fit a model with a random effect. `betta_lincom` will generate confidence intervals for and test hypotheses about linear combinations of fixed effects estimated with `betta` or `betta_random`. `test_submodel` will test whether your mean diversity index varies by any level of your covariate of interest by using an F test to compare a full model to a submodel. Examples of these functions can be found in this [`breakaway` tutorial](https://github.com/adw96/breakaway/blob/main/vignettes/diversity-hypothesis-testing.Rmd).

If you use our tools, please don't forget to cite them!

- `breakaway`: Willis & Bunge. (2015). *Estimating diversity via frequency ratios*. Biometrics. [doi:10.1111/biom.12332](https://doi.org/10.1111/biom.12332).
- `DivNet`: Willis & Martin. (2020). *DivNet: Estimating diversity in networked communities*. Biostatistics. [doi:10.1093/biostatistics/kxaa015](https://doi.org/10.1093/biostatistics/kxaa015).
- `betta`: Willis, Bunge & Whitman. (2016). *Improved detection of changes in species richness in high diversity microbial communities*. Journal of the Royal Statistical Society: Series C. [doi:10.1111/rssc.12206](https://doi.org/10.1111/rssc.12206).