-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathvignette3.rmd
342 lines (250 loc) · 15 KB
/
vignette3.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: "Introduction to gllvm Part 1: Ordination"
author: "Jenni Niku"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to gllvm Part 1: Ordination}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
warning = FALSE,
fig.width=5, fig.height=5,
fig.align = "center",
dev = "png",
fig.pos = 'H'
)
```
# Introduction to gllvm
## R package gllvm
- **R** package **gllvm** fits Generalized linear latent variable models (GLLVM) for multivariate data^[Niku, J., F.K.C. Hui, S. Taskinen, and D.I. Warton. 2019. Gllvm - Fast Analysis of Multivariate Abundance Data with Generalized Linear Latent Variable Models in R. 10. Methods in Ecology and Evolution: 2173–82].
- Developed by J. Niku, W.Brooks, R. Herliansyah, F.K.C. Hui, S. Taskinen, D.I. Warton, B. van der Veen.
- The package available in
- GitHub: <https://github.com/JenniNiku/gllvm>
- CRAN: <https://CRAN.R-project.org/package=gllvm>
- Package installation:
```{r, eval = FALSE, echo=TRUE, warning=FALSE}
# From CRAN
install.packages(gllvm)
# OR
# From GitHub using devtools package's function install_github
devtools::install_github("JenniNiku/gllvm")
```
<details><summary><span style="color:red"> Problems? </span></summary>
<span style="color:red"> **gllvm** package depends on R packages **TMB** and **mvabund**, try to install these first.</span>
</details>
- GLLVMs are computationally intensive to fit due the integral in log-likelihood.
- **gllvm** package overcomes computational problems by applying closed form approximations to log-likelihood and using automatic differentiation in C++ to accelerate computation times (**TMB**^[Kasper Kristensen, Anders Nielsen, Casper W. Berg, Hans Skaug, Bradley M. Bell (2016). TMB: Automatic Differentiation and Laplace Approximation. Journal of Statistical Software, 70(5), 1-21]).
- Estimation is performed using either variational approximation (VA^[Hui, F. K. C., Warton, D., Ormerod, J., Haapaniemi, V., and Taskinen, S. (2017). Variational approximations for generalized linear latent variable models. Journal of Computational and Graphical Statistics. Journal of Computational and Graphical Statistics, 26:35-43]), extended variational approximation method (EVA^[Korhonen, P., Hui, F. K. C., Niku, J., and Taskinen, S. (2021). Fast, universal estimation of latent variable models using extended variational approximations, arXiv:2107.02627 .]) or Laplace approximation (LA^[Niku, J., Warton, D. I., Hui, F. K. C., and Taskinen, S. (2017). Generalized linear latent variable models for multivariate count and biomass data in ecology. Journal of Agricultural, Biological, and Environmental Statistics, 22:498-522.]) method implemented via **R** package **TMB**.
- VA method is faster and more accurate than LA, but not applicable for all distributions and link functions.
- Using **gllvm** we can fit
- GLLVM without covariates gives model-based ordination and biplots
- GLLVM with environmental covariates for studying factors explaining species abundance
- Fourth corner models with latent variables for studying environmental-trait interactions
- GLLVM without latent variables fits basic multivariate GLMs
- Additional tools: model checking, model selection, inference, visualization.
## Distributions
| Response | Distribution | Method | Link |
| ----------- |:------------:|:------- |:------- |
|Counts | Poisson | VA/LA |log |
| | NB | VA/LA |log |
| | ZIP | LA |log |
|Binary | Bernoulli | VA/LA |probit |
| | | EVA/LA |logit |
|Ordinal | Ordinal | VA |probit |
|Normal | Gaussian | VA/LA |identity|
|Positive continuous| Gamma | VA/LA |log|
|Non-negative continuous| Exponential | VA/LA |log|
|Biomass | Tweedie | LA/EVA |log |
|Percent cover| beta | LA/EVA |probit/logit |
## Data input
Main function of the **gllvm** package is `gllvm()`, which can be used to fit GLLVMs for multivariate data with the most important arguments listed in the following:
```{r, eval = FALSE, echo=TRUE}
gllvm(y = NULL, X = NULL, TR = NULL, family, num.lv = 2,
formula = NULL, method = "VA", row.eff = FALSE, n.init=1, starting.val ="res", ...)
```
- y: matrix of abundances
- X: matrix or data.frame of environmental variables
- TR: matrix or data.frame of trait variables
- family: distribution for responses
- num.lv: number of latent variables
- method: approximation used "VA" or "LA"
- row.eff: type of community level row effects
- n.init: number of random starting points for latent variables
- starting.val: starting value method
```{r, eval = TRUE, echo=TRUE}
library(gllvm)
```
## Example: Spiders
- Abundances of 12 hunting spider species measured as a count at 28 sites^[van der Aart, P. J. M., and Smeenk-Enserink, N. (1975) Correlations between distributions of hunting spiders (Lycosidae, Ctenidae) and environmental characteristics in a dune area. Netherlands Journal of Zoology 25, 1-45.].
- Six environmental variables measured at each site.
* `soil.dry`: Soil dry mass
* `bare.sand`: cover of bare sand
* `fallen.leaves`: cover of fallen leaves/twigs
* `moss`: cover of moss
* `herb.layer`: cover of herb layer
* `reflection`: reflection of the soil surface with a cloudless sky
## Data fitting
Fitting basic GLLVM $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$ with **gllvm**:
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
data("spider", package = "mvabund")
library(gllvm)
fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2)
fitnb
```
## Residual analysis
- Residual analysis can be used to assess the appropriateness of the fitted model (eg. in terms of mean-variance relationship).
- Randomized quantile/Dunn-Smyth residuals^[Dunn, P. K., and Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5, 236-244.] are used in the package, as they provide standard normal distributed residuals, even for discrete responses, in the case of a proper model.
```{r, eval = TRUE, echo=TRUE, fig.width=8}
par(mfrow = c(1,2))
plot(fitnb, which = 1:2)
```
## Model selection
- Information criteria can be used for model selection.
- For example, compare distributions or choose suitable number of latent variables.
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
fitp <- gllvm(y = spider$abund, family = poisson(), num.lv = 2)
fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2)
AIC(fitp)
AIC(fitnb)
```
# Exercises
Try to do these exercises for the next 10 minutes, as many as time is enough for.
<span style="color:blue"> E1. Load spider data from **mvabund** package and take a look at the dataset. </span>
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
library(gllvm)
data("spider", package = "mvabund")
# more info:
# ?spider
```
<details><summary><span style="color:red"> Show the answers. </span></summary>
<span style="color:red"> 1. Print the data and covariates and draw a boxplot of the data. </span>
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
# response matrix:
spider$abund
# Environmental variables
spider$x
# Plot data using boxplot:
boxplot(spider$abund)
```
</details>
<span style="color:blue"> E2. Fit GLLVM to spider data with a suitable distribution. Data consists of counts of spider species.</span>
```{r, eval = FALSE, echo=TRUE, warning=FALSE}
# Take a look at the function documentation for help:
?gllvm
```
<details><summary><span style="color:red"> Show the answers. </span></summary>
<span style="color:red"> 2. Response variables in spider data are counts, so Poisson, negative binomial and zero inflated Poisson are possible. However, ZIP is implemented only with Laplace method, so it need to be noticed, that if models are fitted with different methods they can not be compared with information criteria. Let's try just with a Poisson and NB.</span>
<span style="color:red"> **NOTE THAT** the results may not be exactly the same as below, as the initial values for each model fit are slightly different, so the results may also differ slightly.</span>
```{r, eval = TRUE, echo=TRUE, warning=FALSE, fig.width=8, fig.height=5}
# Fit a GLLVM to data
fitp <- gllvm(y = spider$abund, family = poisson(), num.lv = 2)
fitp
fitnb <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 2)
fitnb
```
Based on AIC, NB distribution suits better. How about residual analysis:
<span style="color:red"> **NOTE THAT** The package uses randomized quantile residuals so each time you plot the residuals, they look a little different.</span>
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
# Fit a GLLVM to data
plot(fitp)
plot(fitnb)
```
You could do these comparisons with Laplace method as well, using the code below, and it would give the same conclusion that NB distribution suits best:
```{r, eval = FALSE, echo=TRUE, warning=FALSE}
fitLAp <- gllvm(y = spider$abund, family = poisson(), method = "LA", num.lv = 2)
fitLAnb <- gllvm(y = spider$abund, family = "negative.binomial", method = "LA", num.lv = 2)
fitLAzip <- gllvm(y = spider$abund, family = "ZIP", method = "LA", num.lv = 2)
AIC(fitLAp)
AIC(fitLAnb)
AIC(fitLAzip)
```
</details>
<span style="color:blue"> E3. Explore the fitted model. Where are the estimates for parameters? What about predicted latent variables? Standard errors?</span>
<details><summary><span style="color:red"> Show the answers. </span></summary>
<span style="color:red"> 3. Lets explore the fitted model: </span>
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
# Parameters:
coef(fitnb)
# Where are the predicted latent variable values? just fitp$lvs or
getLV(fitnb)
# Standard errors for parameters:
fitnb$sd
```
</details>
<span style="color:blue"> E4. Fit model with different numbers of latent variables.</span>
<details><summary><span style="color:red"> Show the answers. </span></summary>
<span style="color:red"> 4. Default number of latent variables is 2. Let's try 1 and 3 latent variables as well: </span>
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
# In exercise 2, we fitted GLLVM with two latent variables
fitnb
# How about 1 or 3 LVs
fitnb1 <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 1)
fitnb1
getLV(fitnb1)
fitnb3 <- gllvm(y = spider$abund, family = "negative.binomial", num.lv = 3)
fitnb3
getLV(fitnb3)
```
</details>
<span style="color:blue"> E5. Include environmental variables to the GLLVM and explore the model fit. </span>
<details><summary><span style="color:red"> Show the answers. </span></summary>
<span style="color:red"> 5. Environmental variables can be included with an argument `X`: </span>
```{r, eval = TRUE, echo=TRUE, warning=FALSE}
fitnbx <- gllvm(y = spider$abund, X = spider$x, family = "negative.binomial", seed = 123, num.lv = 2)
fitnbx
coef(fitnbx)
# confidence intervals for parameters:
confint(fitnbx)
```
</details>
<details><summary><span style="color:red"> Problems? See hints:</span></summary>
<span style="color:red"> I have problems in model fitting. My model converges to infinity or local maxima: </span>
GLLVMs are complex models where starting values have a big role. Choosing a different starting value method (see argument `starting.val`) or use multiple runs and pick up the one giving highest log-likelihood value using argument `n.init`. More variation to the starting points can be added with `jitter.var`.
<span style="color:red"> My results does not look the same as in answers:</span>
The results may not be exactly the same as in the answers, as the initial values for each model fit are slightly different, so the results may also differ slightly.
</details>
# Ordination
## GLLVM as a model based ordination method
- GLLVMs can be used as a model-based approach to unconstrained ordination by including two latent variables in the model: $g(E(y_{ij})) = \beta_{0j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$
- The latent variable term try to capture the underlying factors driving species abundances at sites.
- Predictions for the two latent variables, $\boldsymbol{\hat u}_i=(\hat u_{i1}, \hat u_{i2})$, then provide coordinates for sites in the ordination plot and then provides a graphical representation of which sites are similar in terms of their species composition.
## Ordination plot
- `ordiplot()` produces ordination plots based on fitted GLLVMs.
- Uncertainty of the ordination points in model based ordination can be assessed with prediction regions based on the prediction errors of latent variables.
- Prediction regions may also help interpreting which differences between ordination points are really a sign of the difference between species composition at those sites.
```{r, eval = TRUE, echo=FALSE, fig.width=5, fig.height=5}
par(mfrow=c(1,1))
ordiplot(fitnb, predict.region = TRUE, ylim=c(-2.5,2.5), xlim=c(-2,3))
```
## Biplot
- Between species correlations can be visualized with biplot^[Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58, 453-467] by adding latent variable loadings $\boldsymbol{\theta}_j$ to the ordination of sites, by producing a biplot, (argument `biplot = TRUE` in `ordiplot()`).
- In a biplot latent variables and their loadings are rotated so that the LV loadings of the species are in the same direction with the sites where they are most abundant.
- Biplot can be used for finding groups of correlated species or finding indicator species common at specific sites.
- For example, species named Pardlugu is common in the group of sites located on the bottom (eg sites 8, 19, 20 and 21).
- x and y axes may help the interpretation.
```{r, eval = TRUE, echo=TRUE}
ordiplot(fitnb, biplot = TRUE)
abline(h = 0, v = 0, lty=2)
```
## Environmental gradients
- The potential impact of environmental variables on species communities can be viewed by coloring ordination points according to the variables.
- For example, species named Pardlugu seems to prefer sites with lot of dry soil mass and low reflection of the soil surface with a cloudless sky.
```{r, eval = TRUE, echo=TRUE, fig.width=6, fig.height=9}
# Arbitrary color palette, a vector length of 20. Can use, for example, colorRampPalette from package grDevices
rbPal <- c("#00FA9A", "#00EC9F", "#00DFA4", "#00D2A9", "#00C5AF", "#00B8B4", "#00ABB9", "#009DBF", "#0090C4", "#0083C9", "#0076CF", "#0069D4", "#005CD9", "#004EDF", "#0041E4", "#0034E9", "#0027EF", "#001AF4", "#000DF9", "#0000FF")
X <- spider$x
par(mfrow = c(3,2), mar=c(4,4,2,2))
for(i in 1:ncol(X)){
Col <- rbPal[as.numeric(cut(X[,i], breaks = 20))]
ordiplot(fitnb, symbols = T, s.colors = Col, main = colnames(X)[i],
biplot = TRUE)
}
```
- Here environmental gradients stand out quite clearly, indicating that, at least, some of the differences in species compositions at sites can be explained by the differences in environmental conditions.
- The next step would be to include covariates to the model to study more precisely the effects of environmental variables:
$g(E(y_{ij})) = \beta_{0j} + \boldsymbol{x}_i'\boldsymbol{\beta}_{j} + \boldsymbol{u}_i'\boldsymbol{\theta}_j$