-
Notifications
You must be signed in to change notification settings - Fork 51
/
AMOVA.Rmd
179 lines (140 loc) · 7.85 KB
/
AMOVA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
```{r,message=FALSE,echo=FALSE}
html <- TRUE
library("knitcitations")
library("knitr")
cite_options(citation_format = "pandoc", max.names = 3, style = "html", hyperlink = "to.doc")
bib <- read.bibtex("bibtexlib.bib")
opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE,
fig.width = 10, fig.height = 6, cache = TRUE)
if (html) opts_chunk$set(out.width = "700px", dpi = 300)
# use this to set knitr options:
# http://yihui.name/knitr/options #chunk_options
```
---
title: "AMOVA"
subtitle: "*ZN Kamvar, SE Everhart and NJ Grünwald*"
---
In this chapter, we will utilize AMOVA to analyze our populations. AMOVA stands
for **A**nalysis of **MO**lecular **VA**riance and is a method to detect
population differentiation utilizing molecular markers `r citep(bib[c("excoffier1992analysis")])`. This procedure was initially implemented for DNA haplotypes, but applies to any marker system. The implementation of AMOVA in *poppr* requires two very basic components: (1) A distance matrix derived from the data and (2) a separate table used to partition the data into different stratifications.
The distance matrix can be calculated using any distance as long as it is euclidean. The distance that is used in the program Arlequin is the opposite of the Kronecker Delta function that counts the number of differences summed over $L$ loci:
$$
\delta_{l,m} = \begin{cases}
1 \text{ if } m = l,\\
0 \text{ if } m \neq l
\end{cases}
$$
$$
d_{i,j} = \sum_{L = 1}^L 1 - \delta_{i,j}
$$
Data set
----
To calculate AMOVA in *poppr*, one simply needs to supply a data set with
stratifications. We will use the *Aphanomyces euteiches* data set from
`r citep(bib["grunwald2006hierarchical"])`.
```{r}
library("poppr")
data("Aeut")
strata(Aeut) <- data.frame(other(Aeut)$population_hierarchy)
Aeut <- as.genclone(Aeut)
Aeut
```
We can see that this data set contains clonal data and has three stratifications where the first is really a combination of the other levels. We can take a look
at the different stratifications, populations or subpopulations:
```{r}
table(strata(Aeut, ~Pop)) # Populations
table(strata(Aeut, ~Pop/Subpop, combine = FALSE)) # Subpopulations
```
In this example, we have a data set of 187 individuals sampled from two fields located in Athena or Mt. Vernon over 8 or 10 different soil samples within each field. We want to see if most of the variance is observed at the sample, field, or regional level.
Analysis
----
In panmictic populations, we would expect to see most of the variance arise from within samples. If we see that the most of the variance occurs among samples within populations or among populations, then there is evidence that we have some sort of population structure. In the case of clonal organisms, this would help support a hypothesis of clonal reproduction. Since *Aphanomyces eutieches* is known to be clonal, we would not expect most of the variance to come from within samples.
Let's invoke the AMOVA functions with and without clone correction:
```{r, eval = FALSE}
Aeutamova <- poppr.amova(Aeut, ~Pop/Subpop)
Aeutamovacc <- poppr.amova(Aeut, ~Pop/Subpop, clonecorrect = TRUE)
```
```{r, echo = FALSE}
Aeutamova <- poppr.amova(Aeut, ~Pop/Subpop, quiet = TRUE)
Aeutamovacc <- poppr.amova(Aeut, ~Pop/Subpop, clonecorrect = TRUE, quiet = TRUE)
```
We'll look at the AMOVA results for both analyses.
```{r}
Aeutamova
Aeutamovacc
```
The first thing to look at are the `$results` element. The degrees of freedom
(the `Df` column) should match what we expect from our (not clone-corrected) data. The number in the `Total` row should equal 186 or $N - 1$, where values are calculated from pooled data. Note that here, "samples" actually refers to
subpopulations since we cannot asses within-sample variance of dominant data.
The `$componentsofcovariance` table shows how much
variance is detected at each stratification. We expect variations within samples to
give the greatest amount of variation for populations that are not significantly
differentiated. `Sigma` represents the variance, $\sigma$, for each hierarchical
level and to the right is the percent of the total.
Finally, `$statphi` provides the $\phi$ population differentiation statistics. These are used to test hypotheses about population differentiation. We would expect a higher $\phi$
statistic to represent a higher amount of differentiation.
Note, if you want to make a table of any of these components, you can isolate them by using the `$` operator and then export it to a table with `write.table`. Here's an example with the components of covariance:
```{r, eval = FALSE}
write.table(Aeutamova$componentsofcovariance, sep = ",", file = "~/Documents/AeutiechesAMOVA.csv")
```
Significance testing
----
To test if populations are significantly different, we perform a randomization test using the function `randtest()` from the *ade4* package. This will randomly permute the sample matrices as described in `r citep(bib["excoffier1992analysis"])`.
```{r, first_randtest, cache=TRUE}
set.seed(1999)
Aeutsignif <- randtest(Aeutamova, nrepet = 999)
Aeutccsignif <- randtest(Aeutamovacc, nrepet = 999)
```
```{r, Aeut_random_plot}
plot(Aeutsignif)
Aeutsignif
```
From this output, you can see three histograms representing the distribution of
the randomized strata. The black line represents the observed data. You can
see a table of observed results in the output showing that there is significant
population structure considering all levels of the population strata. Of
course, this could be due to the presence of clones, so let's visualize the
results from the clone corrected data set below:
```{r, Aeut_clonecorrect_random_plot}
plot(Aeutccsignif)
Aeutccsignif
```
The above graphs show significant population differentiation at all levels given that the observed $\phi$ does not fall within the distribution expected from the permutation. Compare the results of our AMOVA analysis to those published in `r citep(bib["grunwald2006hierarchical"])`. They should be identical.
Randomized population structure
----
Since AMOVA is used to detect whether or not there is significant population
structure, we can see what happens when we randomly shuffle the population
assignments in our data. Here we will show what the populations look like before
and after shuffling:
```{r}
Aeut.new <- Aeut
head(strata(Aeut)[, -1])
set.seed(9001)
head(strata(Aeut)[sample(nInd(Aeut)), -1])
```
Here we see that the populations are completely shuffled, so in the next step,
we will reassign the strata with these newly shuffled populations and rerun
the AMOVA analysis.
```{r, randomized_strata, cache=TRUE}
set.seed(9001)
strata(Aeut.new) <- strata(Aeut)[sample(nInd(Aeut)), -1]
Aeut.new.amova <- poppr.amova(Aeut.new, ~Pop/Subpop)
Aeut.new.amova
Aeut.new.amova.test <- randtest(Aeut.new.amova, nrepet = 999)
Aeut.new.amova.test
```
```{r, randomized_strata_plot}
plot(Aeut.new.amova.test)
```
We see that there now is no significant population structure.
Conclusions
----
AMOVA is a powerful tool that can help support hypotheses of population structure due to clonal reproduction or isolation without making assumptions about Hardy-Weinberg equilibrium. We have shown that we can reject the
$H_o$ of random mating between the two populations and have strong evidence that
these populations are significantly differentiated at all stratifications `r citep(bib["grunwald2006hierarchical"])`. From these results, we can investigate
hypotheses as to why these populations are significantly differentiated.
This example was performed with a data set of dominant (AFLP)
markers, but it can also be performed on codominant markers such as SNPs or SSRs. These provide more information because within sample (individual) variance is also assessed. If one wants to utilize a genetic distance that has biological relevance, a different distance matrix can be chosen. See `help('amova', package = 'poppr')` for more details.
References
----------
<!----------->