Validity of gllvm (and copulas) on compositional data #126

mike-kratz · 2023-06-08T17:10:39Z

mike-kratz
Jun 8, 2023

Hi there!

@dwarton @JenniNiku @BertvanderVeen
I know people have asked this question in the past but I'm not sure if I ever saw a definitive answer. Does the fact that microbial 16S sequencing data is compositional (i.e., it can only be thought of in terms of relative abundance rather than absolute) and therefore non of the species/taxa have independence lead to invalid covariance/correlation structures from gllvms and copulas?

One way to look at this is if one species' (sp. A) actual abundance in the environment increases while all other taxa (spp. B-Z) stay at the same abundance, sequencing data would show a negative correlation between A and all B-Z even though this is just an artifact of the sequencing method; the relative abundance of A increases with the relative abundance of B-Z decrease even though their abundances haven't changed. Unfortunately, the amount of reads returned for a sample has little to due with the amount of microbial cells, so this causes a whole host of statistical issues and requires all methods to deal with the data being compositional rather than absolute counts.

Included are some plots I've generated for the same dataset using various ordination methods, they seem to be relatively consistent in the pattern they are showing. Therefore, I am curious if the dependence structure of this data type is actually leading to issues with model-based ordinations or if most of the species having spurious correlations "cancel each other out": I'm not entirely sure how this works out mathematically. This is just a single case example, but I've seen it in some of my other datasets too. Just for background information, this data if from 16S rRNA microbial data taken along two rivers, Spring Creek and the Imperial River on three dates in Florida. Orange and black points are from the "dry season" from both rivers and blue points are from the wet season in one river. Samples on the same date were collected at somewhat evenly distributed sampling points along the rivers from downstream to upstream. There was a marked salinity gradient seen in the physicochemical dataset, particularly from dry season samples, which is seen along at least one axis from each ordination.

NMDS with relative abundance data with "Bray-Curtis" (really percentage difference):
filt.bray.ra = vegdist(filt.ra,
method = "bray")
set.seed(1)
NMDS.filt.ra = metaMDS(filt.bray.ra,
k = 2,
autotransform = FALSE)
#Stress = 0.071 (obviously this is dependent on sample size, but sometimes it is useful))

PCoA with relative abundance data with "Bray-Curtis" (really percentage difference):

Distance-based redundancy analysis (DB-RDA) with relative abundance data with "Bray-Curtis" (really percentage difference) and four constraining environmental variables: Nitrate+Nitrite, electrical conductivity, total organic nitrogen, and sucralose:
final.db.rda = vegan::dbrda(filt.ra ~ EC+NOx+TON+Sucr,
data = impute.sim.meta$ximp,
distance = "bray")

Unconstrained GLLVM with random intercepts using the negative binomial distribution and sequencing depth of each sample as an offset:
Mean-variance plot

set.seed(1)
fit.gllvm.genera = gllvm(y = filt.counts,
family = "negative.binomial",
offset = log(filt.meta$library_size),
row.eff = "random",
num.lv = 2,
sd.errors = FALSE)

Copula using the negative binomial distribution and sequencing depth of each sample as an offset:
set.seed(1)
my.glms = manyglm(my.mvabund ~ 1 + offset(log(library_size)),
data = sim.meta,
family = "negative.binomial")
my.copula = ecoCopula::cord(my.glms)

Answered by dwarton

Jun 8, 2023

I don't know how to give any more definitive an answer than I already have, I feel like this is going around in circles. Maybe you missed a previous post on this? gllvms and copulas, with a row effect in the model, are fine for compositional data. More than fine - I see adding a row effect to a count dataset (with a log-link) as the appropriate way to handle compositional counts. Doing this conditions on library size hence all other terms in the mean model can be interpreted as describing effects on relative abundance not total abundance, hence are modelling compositional effects.

I think the confusion comes from a sub-literature that talks about composition as inducing negative correlati…

View full answer

BertvanderVeen · 2023-06-08T17:15:51Z

BertvanderVeen
Jun 8, 2023
Collaborator

Thanks Mike! Since this is not an issue, but a discussion, I will convert it to that (see discussions tab). I will have a more thorough read of your post tomorrow!

1 reply

mike-kratz Jun 8, 2023
Author

@BertvanderVeen Oh that makes sense, thank you for your help!

dwarton · 2023-06-08T23:23:46Z

dwarton
Jun 8, 2023

I don't know how to give any more definitive an answer than I already have, I feel like this is going around in circles. Maybe you missed a previous post on this? gllvms and copulas, with a row effect in the model, are fine for compositional data. More than fine - I see adding a row effect to a count dataset (with a log-link) as the appropriate way to handle compositional counts. Doing this conditions on library size hence all other terms in the mean model can be interpreted as describing effects on relative abundance not total abundance, hence are modelling compositional effects.

I think the confusion comes from a sub-literature that talks about composition as inducing negative correlation (as mentioned in the above post) if we model proportions or other transformations of the data to a compositional scale. Sure, it does. But we don't model the proportions in a count model, we model the raw counts (that have no sum-to-one constraint), and include terms (row effects in this case) in the model to control for library size. So the negative correlation is pretty much being handled by terms in the mean model, the row effects, rather than by distributional assumptions on the response. But note that these methods assume a (reduced rank) unstructured covariance matrix across responses so correlation across responses was never going to be an issue for them anyway. Negative correlation across responses is something that could only become a problem if you ignored it and did separate univariate analyses.

Regarding the code and results you showed - as you say all methods seemed to be able to separate the different samples as expected (?). I didn't see any qualitative differences in results, except maybe better separation of Imperial from Spring in model-based ordinations.

2 replies

dwarton Jun 8, 2023

Oh also I wouldn't both with an offset for library size in the gllvm, if you already have row effects in there, because the row effects capture the library size effect (that is their purpose)

mike-kratz Jun 9, 2023
Author

Hi @dwarton , thank you for your helpful reply! I'm sorry this was a redundant question, I had checked the other composition discussion on this github page last week and didn't realize until now there were people actively asking questions in there again. So I'm sorry you had to repeat what you said in that discussion; I'm glad that the correlation structure induced using the row effect doesn't lead to issues in the ordination output. That makes sense that the negative correlation effect is only for univariate rather than multivariate analyses. Yes the ordinations all showed what was expected (although it's an exploratory study) and the gllvm typically led to the most separation. I will switch to just using row effects instead of including the offset, thank you for the advice.

I look forward to teaching my other lab members about using gllvmfor their research now that my concerns using it are gone!

hrlai · 2023-06-08T23:35:48Z

hrlai
Jun 8, 2023

@dwarton thanks for your explanation. While we're on this, I have a related question re: Dirichlet-multinomial. Say we fit a Dirichlet-multinomial to model relative abundances instead. Would you still include the row effects? I wasn't sure about this because trial size and/or the sum-to-one constrain of the Dirichlet-multinomial could already be controlling for library size / sampling effort? Thanks.

3 replies

dwarton Jun 8, 2023

no I wouldn't use a row effect then because the multinomial has already conditioned on trial size (which is what the row effect is intended for). Note that the multinomial with trial size N assumes independence of each of the N events being counted, which I'm not sure would be a good plan here

hrlai Jun 9, 2023

Thanks @dwarton . Would you be able to explain how the counts are "coupled" / non-independent in, say, the negative binomial case? My best guess is that across columns the counts are only linked by stuff in the mean model (i.e., species/column intercepts and latent variables when included), but the negative binomial distribution does not induce non-independence across columns? If so, although the multinomial assumes independence among N events, can we still join them together using column/species effects? Sorry if I missed anything...

dwarton Jun 9, 2023

I guess there are two possible places where dependence can arise - dependence of counts across columns, and dependence of events within a count. For events within a count, a Poisson or multinomial assumes these events are independent, whereas a negative binomial makes some allowance for clustering. For dependence of counts across columns, yes you are right that the mean model is where this is done. Specifically, a gllvm/factor analytic copula model uses latent variables to induce correlation. Marginally we then have a multivariate lognormal-negative binomial distribution (if using normally distributed factor scores), with the factor analytical structure inducing correlation but also some additional overdispersion. Presumably it would also be possible to use a Dirichlet structure in a similar way, depending how it was implemented.

adamgender0 · 2023-06-27T20:52:25Z

adamgender0
Jun 27, 2023

Thanks Mike!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validity of gllvm (and copulas) on compositional data #126

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Validity of gllvm (and copulas) on compositional data #126

Replies: 4 comments · 6 replies

BertvanderVeen Jun 8, 2023 Collaborator

mike-kratz Jun 8, 2023 Author

mike-kratz Jun 9, 2023 Author

Replies: 4 comments 6 replies

BertvanderVeen
Jun 8, 2023
Collaborator

mike-kratz Jun 8, 2023
Author

mike-kratz Jun 9, 2023
Author