Validity of gllvm (and copulas) on compositional data #126
-
Hi there! @dwarton @JenniNiku @BertvanderVeen One way to look at this is if one species' (sp. A) actual abundance in the environment increases while all other taxa (spp. B-Z) stay at the same abundance, sequencing data would show a negative correlation between A and all B-Z even though this is just an artifact of the sequencing method; the relative abundance of A increases with the relative abundance of B-Z decrease even though their abundances haven't changed. Unfortunately, the amount of reads returned for a sample has little to due with the amount of microbial cells, so this causes a whole host of statistical issues and requires all methods to deal with the data being compositional rather than absolute counts. Included are some plots I've generated for the same dataset using various ordination methods, they seem to be relatively consistent in the pattern they are showing. Therefore, I am curious if the dependence structure of this data type is actually leading to issues with model-based ordinations or if most of the species having spurious correlations "cancel each other out": I'm not entirely sure how this works out mathematically. This is just a single case example, but I've seen it in some of my other datasets too. Just for background information, this data if from 16S rRNA microbial data taken along two rivers, Spring Creek and the Imperial River on three dates in Florida. Orange and black points are from the "dry season" from both rivers and blue points are from the wet season in one river. Samples on the same date were collected at somewhat evenly distributed sampling points along the rivers from downstream to upstream. There was a marked salinity gradient seen in the physicochemical dataset, particularly from dry season samples, which is seen along at least one axis from each ordination. NMDS with relative abundance data with "Bray-Curtis" (really percentage difference): PCoA with relative abundance data with "Bray-Curtis" (really percentage difference): Distance-based redundancy analysis (DB-RDA) with relative abundance data with "Bray-Curtis" (really percentage difference) and four constraining environmental variables: Nitrate+Nitrite, electrical conductivity, total organic nitrogen, and sucralose: Unconstrained GLLVM with random intercepts using the negative binomial distribution and sequencing depth of each sample as an offset: set.seed(1) Copula using the negative binomial distribution and sequencing depth of each sample as an offset: |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
Thanks Mike! Since this is not an issue, but a discussion, I will convert it to that (see discussions tab). I will have a more thorough read of your post tomorrow! |
Beta Was this translation helpful? Give feedback.
-
I don't know how to give any more definitive an answer than I already have, I feel like this is going around in circles. Maybe you missed a previous post on this? gllvms and copulas, with a row effect in the model, are fine for compositional data. More than fine - I see adding a row effect to a count dataset (with a log-link) as the appropriate way to handle compositional counts. Doing this conditions on library size hence all other terms in the mean model can be interpreted as describing effects on relative abundance not total abundance, hence are modelling compositional effects. I think the confusion comes from a sub-literature that talks about composition as inducing negative correlation (as mentioned in the above post) if we model proportions or other transformations of the data to a compositional scale. Sure, it does. But we don't model the proportions in a count model, we model the raw counts (that have no sum-to-one constraint), and include terms (row effects in this case) in the model to control for library size. So the negative correlation is pretty much being handled by terms in the mean model, the row effects, rather than by distributional assumptions on the response. But note that these methods assume a (reduced rank) unstructured covariance matrix across responses so correlation across responses was never going to be an issue for them anyway. Negative correlation across responses is something that could only become a problem if you ignored it and did separate univariate analyses. Regarding the code and results you showed - as you say all methods seemed to be able to separate the different samples as expected (?). I didn't see any qualitative differences in results, except maybe better separation of Imperial from Spring in model-based ordinations. |
Beta Was this translation helpful? Give feedback.
-
@dwarton thanks for your explanation. While we're on this, I have a related question re: Dirichlet-multinomial. Say we fit a Dirichlet-multinomial to model relative abundances instead. Would you still include the row effects? I wasn't sure about this because trial size and/or the sum-to-one constrain of the Dirichlet-multinomial could already be controlling for library size / sampling effort? Thanks. |
Beta Was this translation helpful? Give feedback.
-
Thanks Mike! |
Beta Was this translation helpful? Give feedback.
I don't know how to give any more definitive an answer than I already have, I feel like this is going around in circles. Maybe you missed a previous post on this? gllvms and copulas, with a row effect in the model, are fine for compositional data. More than fine - I see adding a row effect to a count dataset (with a log-link) as the appropriate way to handle compositional counts. Doing this conditions on library size hence all other terms in the mean model can be interpreted as describing effects on relative abundance not total abundance, hence are modelling compositional effects.
I think the confusion comes from a sub-literature that talks about composition as inducing negative correlati…