Use vegan in rarefy #561

TuomasBorman · 2024-05-28T15:25:06Z

Should we switch using vegan::rrafefy (it is fast since it is programmed with C)? One downside is that it does not support replace. Moreover, one thing to discuss (as I am not the expert of the terminology), could it make more sense to do iterations? As I have understood, the criticism towards rafefying is mostly related to bias and missing data caused by rarefaction single time. However, if enough iterations is done, that would solve the issue.

@antagomir

The text was updated successfully, but these errors were encountered:

antagomir · 2024-05-28T16:16:07Z

vegan::rrarefy : missing support for replace is not good (rarefaction is often done with replacement) but perhaps not so critical; our case does seem prohibitively slow and subsampling is a simple operation so I am not sure if this is needed. Fast is good but it would help to know how much faster, if there is real advantage (in this or other regards).
iterations: this was validated by Pat Schloss in the context of alpha and beta diversity calculation over several iterations and those PRs are already open. I am not familiar with supporting iterations to just get rarefied data. The iterations would not solve the issue that the samples with more reads will have a higher resolution for that information even after iterations. I would refrain from implementing this until there is reference or experiments demonstrating benefits in a general case.

antagomir · 2024-08-05T20:32:04Z

Action point: compare the standard rarefaction and rrarefy. If there are significant gains in speed with essentially similar outcomes then consider switching.

TuomasBorman · 2024-08-05T20:48:28Z

Maybe @Daenarys8 when time allows?

Daenarys8 · 2024-08-07T08:31:42Z

Here is some update:

>library(mia)
>library(vegan)
>library(microbenchmark)
> data("GlobalPatterns")
> tse <- GlobalPatterns
> benchmark <- microbenchmark(
     vegan_run = rrarefy(assay(tse, "counts"), 1000),
     mia_run = rarefyAssay(tse, 1000, assay.type = "counts"),
     times = 100
 )
> benchmark
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval cld
 vegan_run 205.8761 232.9813 256.7301 247.6536 262.2308 1071.452   100  a 
   mia_run 699.3646 754.3515 844.8951 806.7249 841.4128 1639.997   100   b

> vegan_run_1 <- rrarefy(assay(tse, "counts"), 1000)
> dim(vegan_run_1)
[1] 19216    26
> mia_run_1 <- rarefyAssay(tse, 1000, assay.type = "counts")
> dim(assay(mia_run_1, "subsampled"))
[1] 12630    26

Using 100 evaluations, the benchmark indicates that vegan::rrarefy is at least 3x faster that mia::rarefyAssay. The latter removes samples for each run e.g 6479 features removed because they are not present in all samples after subsampling. which could reduce the quality of data though I'm not certain to what extend this applies to our sub-sampling needs.

antagomir · 2024-08-07T08:50:00Z

Great!

Some points:

It would be useful to switch to rrarefy (or another fast rarification method, if such are available?)
However, one question was that rrarefy does sampling without replacement and this potential shortcoming (I think current mia implementation does with replacement); could you additionally compare the resulting abundance profiles (e.g. just scatterplot + correlation for some randomly selected values from the matrices), and/or alpha diversity measures calculated after each version? It would be helpful to see how well the two versions will match
Regardless of the specific implementation, it would be imo useful to make sure that the implementation returns assay that has the same dimensions than the original assay. Even if some rows are empty. We could then have a separate argument for dropping empty rows or something but changing the assay dimensions can be problematic as default behavior.

Daenarys8 · 2024-08-08T09:12:04Z

For reference this is original count alpha diversity index:

antagomir · 2024-08-08T09:43:17Z

Instead of showing comparable figures, could just directly compare the Shannon indices generated with different methods? That will much more directly show if there are actual differences.

Daenarys8 · 2024-08-08T10:54:19Z

here is the comparison. I am not sure if this is the comparison you meant, but I can provide more otherwise.

  mia_shan_index vegan_shan_index
1        6.530872         7.434587
2        6.740669         7.666198
3        6.455570         7.083545
4        3.804205         5.942800
5        3.276185         5.725215
6        4.271874         6.230349
7        4.836306         6.521300
8        4.870576         6.122613
9        2.671566         5.235301
10       3.880199         5.717510
11       3.073093         5.969551
12       3.626456         5.786333
13       3.520547         7.296105
14       3.362396         7.358466
15       4.005807         7.333662
16       4.212748         5.610023
17       4.474838         6.312663
18       4.552245         6.069222
19       6.129420         6.473994
20       4.834957         6.597324
21       5.429571         6.532632
22       4.104690         5.895798
23       3.438804         5.714199
24       4.069417         6.112186
25       3.946154         5.949208
26       3.986965         5.841633

antagomir · 2024-08-08T12:59:33Z

Great!

Unless I am mistaken, with mia you can do rarefaction both with and without replacement. These are expected to differ a bit. Is it possible to show both of those mia rarefaction versions in the comparisons?

Daenarys8 · 2024-08-08T16:18:43Z

data
  mia_shan_index mia_shan_index_no_replace vegan_shan_index
1        6.530872                  6.522523         7.434587
2        6.740669                  6.725551         7.666198
3        6.455570                  6.462569         7.083545
4        3.804205                  3.810478         5.942800
5        3.276185                  3.279123         5.725215
6        4.271874                  4.269514         6.230349
7        4.836306                  4.812235         6.521300
8        4.870576                  4.852545         6.122613
9        2.671566                  2.658947         5.235301
10       3.880199                  3.904216         5.717510
11       3.073093                  3.074214         5.969551
12       3.626456                  3.630293         5.786333
13       3.520547                  3.521134         7.296105
14       3.362396                  3.327065         7.358466
15       4.005807                  3.981118         7.333662
16       4.212748                  4.202858         5.610023
17       4.474838                  4.462008         6.312663
18       4.552245                  4.547767         6.069222
19       6.129420                  6.157462         6.473994
20       4.834957                  4.845896         6.597324
21       5.429571                  5.429087         6.532632
22       4.104690                  4.123554         5.895798
23       3.438804                  3.450313         5.714199
24       4.069417                  4.058023         6.112186
25       3.946154                  3.948297         5.949208
26       3.986965                  3.995497         5.841633
>

Daenarys8 · 2024-08-08T16:19:09Z

antagomir · 2024-08-09T08:36:32Z

Hmm interesting. W/o replacement seems to generate similar outcomes with mia. Could you show the code, it would be more relevant than showing the numbers (the figures are enough for that now)

To check: what other differences there are besides w/o replacement?

Daenarys8 · 2024-08-09T09:14:58Z

Hmm interesting. W/o replacement seems to generate similar outcomes with mia. Could you show the code, it would be more relevant than showing the numbers (the figures are enough for that now)

To check: what other differences there are besides w/o replacement?

library(mia)
library(vegan)

# load data
data("GlobalPatterns")
tse <- GlobalPatterns

# vegan::rrarefy
vegan_mat <- rrarefy(assay(tse, assay.type = "counts"), 1000)
vegan_run <- tse
assay(vegan_run, "vegan_mat") <- vegan_mat

# mia::rarefyAssay no_replace
mia_run_no_replace <- rarefyAssay(tse, assay.type = "counts", name = "mia_mat_no_replace", 1000, replace = FALSE)

# mia::rarefyAssay replace
mia_run <- rarefyAssay(tse, assay.type = "counts", name = "mia_mat", 1000)

# Calculate alpha-diversity of the matrices
    mia_run_no_replace <- .estimate_diversity(mia_run_no_replace, assay.type = "mia_mat_no_replace", index = 
                                       "shannon", name = "mia_shannon_no_replace")
    mia_run <- .estimate_diversity(mia_run, assay.type = "mia_mat", index = 
                                       "shannon", name = "mia_shannon")
    vegan_run <- .estimate_diversity(vegan_run, assay.type = "vegan_mat", index = 
                                         "shannon", name = "vegan_shannon")

mia_shan_index_no_replace <- colData(mia_run_no_replace)[, "mia_shannon_no_replace"]
mia_shan_index <- colData(mia_run)[, "mia_shannon"]
vegan_shan_index <- colData(vegan_run)[, "vegan_shannon"]

data <- data.frame(mia_shan_index, vegan_shan_index, mia_shan_index_no_replace )
pairs(data)

antagomir · 2024-08-09T13:18:02Z

Hmm.. .estimate_diversity is an internal function in mia and we are not supposed to call that (it is not exported).

Can we directly run addAlpha() function for the rarified assays?
However, addAlpha() also supports rarefaction directly (see argument "niter" https://microbiome.github.io/mia/reference/addAlpha.html ).

I didnt check if the user can pass the arguments to do this both with and without replacement but ideally that should be possible. Can we include in the comparisons both (1) and (2)? In addition to vegan rrarefy.

Could you check if there any explanation for the differing results between mia and vegan could be found?

Daenarys8 · 2024-08-10T07:44:25Z

Yes we can directly run addAlpha()
Yes, user can pass replace param to rarefyAssay

....
mia_run <- tse
mia_run_no_replace <- tse
vegan_run <- tse

# vegan::rrarefy
vegan_mat <- rrarefy(assay(vegan_run, assay.type = "counts"), 1000)
assay(vegan_run, "vegan_mat") <- vegan_mat

# mia::rarefyAssay no-replace
mia_run_no_replace <- rarefyAssay(mia_run_no_replace, assay.type = "counts", name = "mia_mat_no_replace", 1000, replace = FALSE)

# mia::rarefyAssay replace
mia_run <- rarefyAssay(mia_run, assay.type = "counts", name = "mia_mat", 1000)

# addAlpha
vegan_run <- addAlpha(vegan_run, index = "shannon", assay.type = "vegan_mat")
mia_run<- addAlpha(mia_run, index = "shannon", assay.type = "mia_mat")
mia_run_no_replace <- addAlpha(mia_run_no_replace, index = "shannon", assay.type = "mia_mat_no_replace")


mia_alpha_rarefy <- tse
mia_alpha_rarefy_no_replace <- tse

# addAlpha rarefaction
mia_alpha_rarefy_no_replace <- addAlpha(mia_alpha_rarefy_no_replace, index = "shannon", niter = 1, sample=1000, replace=FALSE)
mia_alpha_rarefy <- addAlpha(mia_alpha_rarefy, index = "shannon", niter = 1, sample=1000)

data <- data.frame(vegan_run$shannon, mia_run$shannon, mia_run_no_replace$shannon, mia_alpha_rarefy$shannon, mia_alpha_rarefy_no_replace$shannon)
pairs(data)

A possible explanation for this differences could be;

in vegan, rowSums less than the intended sample are not not rarefied.
in mia, features not present in any sample after sampling are dropped.

antagomir · 2024-08-10T08:04:35Z

Great.

When using mia and alpha diversity calculations, let's stick to using addAlpha with rarefaction arguments, whenever we can. This is more straightfwd. The rarefyAssay function is available for more general purposes as needed.

Regarding the explanations:

Samples that have rowSums less than intended sample should be dropped out by default, I expect that at least mia does this (and should throw a warning in such case). But do we have any cases like this in our current example above, or is this a more theoretical issue?
if features not present in any sample after sampling are dropped by default in mia, is this something the user can modify? Can we run mia without dropping those samples and then compare with vegan?

Daenarys8 · 2024-09-17T11:04:58Z

vegan and mia each handle randomness differently, leading to variations in their results. In vegan, the process involves generating a random index and then decrementing the count of the species associated with that index. This approach selects items sequentially from the counts available at each step.. This results in a rarefied sample where counts are adjusted directly according to the random draws and decrementing iteration.

On the other hand, mia constructs a vector where indices are repeated according to their counts and then uses the base sample function to draw with(using probability weights) or without replacement. Although the sampling for the diversity calculation above is done without replacement( also features are not dropped) in both methods, the way mia builds its sample pool by expanding the vector can lead to different random results.

if vegan is a benchmark, we can consider wrapping the rrarefy (which is much faster with the c implementation of do_rrarefy)

antagomir · 2024-11-17T21:02:51Z

Sounds good to use rrarefy, in principle.

Some remarks:

the results from mia using with and without replacement seem identical. Is this the case? If yes, does it work as it should (I am expecting to see differences)
we could enable both with and without replacement; for the latter one we could use vegan::rrarefy if this is straightfwd to switch (no major rewriting). That could be made a default (without changing output dimension). But i would keep the option of with replacement, too. Unit tests should be included to make sure all goes as expected.

Could we do & complete this PR?

TuomasBorman · 2024-11-18T07:06:59Z

Sounds feasible, however, it must be clearly stated which function is utilized

antagomir mentioned this issue Aug 10, 2024

New dissimilarity functions #609

Merged

TuomasBorman linked a pull request Jan 8, 2025 that will close this issue

Utilize vegan::rrarefy #670

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vegan in rarefy #561

Use vegan in rarefy #561

TuomasBorman commented May 28, 2024

antagomir commented May 28, 2024

antagomir commented Aug 5, 2024

TuomasBorman commented Aug 5, 2024

Daenarys8 commented Aug 7, 2024

antagomir commented Aug 7, 2024

Daenarys8 commented Aug 8, 2024

antagomir commented Aug 8, 2024

Daenarys8 commented Aug 8, 2024

antagomir commented Aug 8, 2024

Daenarys8 commented Aug 8, 2024

Daenarys8 commented Aug 8, 2024

antagomir commented Aug 9, 2024

Daenarys8 commented Aug 9, 2024 •

edited

Loading

antagomir commented Aug 9, 2024

Daenarys8 commented Aug 10, 2024

antagomir commented Aug 10, 2024

Daenarys8 commented Sep 17, 2024

antagomir commented Nov 17, 2024

TuomasBorman commented Nov 18, 2024

Use vegan in rarefy #561

Use vegan in rarefy #561

Comments

TuomasBorman commented May 28, 2024

antagomir commented May 28, 2024

antagomir commented Aug 5, 2024

TuomasBorman commented Aug 5, 2024

Daenarys8 commented Aug 7, 2024

antagomir commented Aug 7, 2024

Daenarys8 commented Aug 8, 2024

antagomir commented Aug 8, 2024

Daenarys8 commented Aug 8, 2024

antagomir commented Aug 8, 2024

Daenarys8 commented Aug 8, 2024

Daenarys8 commented Aug 8, 2024

antagomir commented Aug 9, 2024

Daenarys8 commented Aug 9, 2024 • edited Loading

antagomir commented Aug 9, 2024

Daenarys8 commented Aug 10, 2024

antagomir commented Aug 10, 2024

Daenarys8 commented Sep 17, 2024

antagomir commented Nov 17, 2024

TuomasBorman commented Nov 18, 2024

Daenarys8 commented Aug 9, 2024 •

edited

Loading