Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vegan in rarefy #561

Open
TuomasBorman opened this issue May 28, 2024 · 19 comments · May be fixed by #670
Open

Use vegan in rarefy #561

TuomasBorman opened this issue May 28, 2024 · 19 comments · May be fixed by #670

Comments

@TuomasBorman
Copy link
Contributor

Should we switch using vegan::rrafefy (it is fast since it is programmed with C)? One downside is that it does not support replace. Moreover, one thing to discuss (as I am not the expert of the terminology), could it make more sense to do iterations? As I have understood, the criticism towards rafefying is mostly related to bias and missing data caused by rarefaction single time. However, if enough iterations is done, that would solve the issue.

@antagomir

@antagomir
Copy link
Member

  1. vegan::rrarefy : missing support for replace is not good (rarefaction is often done with replacement) but perhaps not so critical; our case does seem prohibitively slow and subsampling is a simple operation so I am not sure if this is needed. Fast is good but it would help to know how much faster, if there is real advantage (in this or other regards).
  2. iterations: this was validated by Pat Schloss in the context of alpha and beta diversity calculation over several iterations and those PRs are already open. I am not familiar with supporting iterations to just get rarefied data. The iterations would not solve the issue that the samples with more reads will have a higher resolution for that information even after iterations. I would refrain from implementing this until there is reference or experiments demonstrating benefits in a general case.

@antagomir
Copy link
Member

Action point: compare the standard rarefaction and rrarefy. If there are significant gains in speed with essentially similar outcomes then consider switching.

@TuomasBorman
Copy link
Contributor Author

Maybe @Daenarys8 when time allows?

@Daenarys8
Copy link
Collaborator

Here is some update:

>library(mia)
>library(vegan)
>library(microbenchmark)
> data("GlobalPatterns")
> tse <- GlobalPatterns
> benchmark <- microbenchmark(
     vegan_run = rrarefy(assay(tse, "counts"), 1000),
     mia_run = rarefyAssay(tse, 1000, assay.type = "counts"),
     times = 100
 )
> benchmark
Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval cld
 vegan_run 205.8761 232.9813 256.7301 247.6536 262.2308 1071.452   100  a 
   mia_run 699.3646 754.3515 844.8951 806.7249 841.4128 1639.997   100   b

> vegan_run_1 <- rrarefy(assay(tse, "counts"), 1000)
> dim(vegan_run_1)
[1] 19216    26
> mia_run_1 <- rarefyAssay(tse, 1000, assay.type = "counts")
> dim(assay(mia_run_1, "subsampled"))
[1] 12630    26

Using 100 evaluations, the benchmark indicates that vegan::rrarefy is at least 3x faster that mia::rarefyAssay. The latter removes samples for each run e.g 6479 features removed because they are not present in all samples after subsampling. which could reduce the quality of data though I'm not certain to what extend this applies to our sub-sampling needs.

@antagomir
Copy link
Member

Great!

Some points:

  1. It would be useful to switch to rrarefy (or another fast rarification method, if such are available?)
  2. However, one question was that rrarefy does sampling without replacement and this potential shortcoming (I think current mia implementation does with replacement); could you additionally compare the resulting abundance profiles (e.g. just scatterplot + correlation for some randomly selected values from the matrices), and/or alpha diversity measures calculated after each version? It would be helpful to see how well the two versions will match
  3. Regardless of the specific implementation, it would be imo useful to make sure that the implementation returns assay that has the same dimensions than the original assay. Even if some rows are empty. We could then have a separate argument for dropping empty rows or something but changing the assay dimensions can be problematic as default behavior.

@Daenarys8
Copy link
Collaborator

For reference this is original count alpha diversity index:
count_shannon

@antagomir
Copy link
Member

Instead of showing comparable figures, could just directly compare the Shannon indices generated with different methods? That will much more directly show if there are actual differences.

@Daenarys8
Copy link
Collaborator

here is the comparison. I am not sure if this is the comparison you meant, but I can provide more otherwise.

pairs

  mia_shan_index vegan_shan_index
1        6.530872         7.434587
2        6.740669         7.666198
3        6.455570         7.083545
4        3.804205         5.942800
5        3.276185         5.725215
6        4.271874         6.230349
7        4.836306         6.521300
8        4.870576         6.122613
9        2.671566         5.235301
10       3.880199         5.717510
11       3.073093         5.969551
12       3.626456         5.786333
13       3.520547         7.296105
14       3.362396         7.358466
15       4.005807         7.333662
16       4.212748         5.610023
17       4.474838         6.312663
18       4.552245         6.069222
19       6.129420         6.473994
20       4.834957         6.597324
21       5.429571         6.532632
22       4.104690         5.895798
23       3.438804         5.714199
24       4.069417         6.112186
25       3.946154         5.949208
26       3.986965         5.841633

@antagomir
Copy link
Member

Great!

Unless I am mistaken, with mia you can do rarefaction both with and without replacement. These are expected to differ a bit. Is it possible to show both of those mia rarefaction versions in the comparisons?

@Daenarys8
Copy link
Collaborator

data
  mia_shan_index mia_shan_index_no_replace vegan_shan_index
1        6.530872                  6.522523         7.434587
2        6.740669                  6.725551         7.666198
3        6.455570                  6.462569         7.083545
4        3.804205                  3.810478         5.942800
5        3.276185                  3.279123         5.725215
6        4.271874                  4.269514         6.230349
7        4.836306                  4.812235         6.521300
8        4.870576                  4.852545         6.122613
9        2.671566                  2.658947         5.235301
10       3.880199                  3.904216         5.717510
11       3.073093                  3.074214         5.969551
12       3.626456                  3.630293         5.786333
13       3.520547                  3.521134         7.296105
14       3.362396                  3.327065         7.358466
15       4.005807                  3.981118         7.333662
16       4.212748                  4.202858         5.610023
17       4.474838                  4.462008         6.312663
18       4.552245                  4.547767         6.069222
19       6.129420                  6.157462         6.473994
20       4.834957                  4.845896         6.597324
21       5.429571                  5.429087         6.532632
22       4.104690                  4.123554         5.895798
23       3.438804                  3.450313         5.714199
24       4.069417                  4.058023         6.112186
25       3.946154                  3.948297         5.949208
26       3.986965                  3.995497         5.841633
> 

@Daenarys8
Copy link
Collaborator

shan_index

@antagomir
Copy link
Member

Hmm interesting. W/o replacement seems to generate similar outcomes with mia. Could you show the code, it would be more relevant than showing the numbers (the figures are enough for that now)

To check: what other differences there are besides w/o replacement?

@Daenarys8
Copy link
Collaborator

Daenarys8 commented Aug 9, 2024

Hmm interesting. W/o replacement seems to generate similar outcomes with mia. Could you show the code, it would be more relevant than showing the numbers (the figures are enough for that now)

To check: what other differences there are besides w/o replacement?

library(mia)
library(vegan)

# load data
data("GlobalPatterns")
tse <- GlobalPatterns

# vegan::rrarefy
vegan_mat <- rrarefy(assay(tse, assay.type = "counts"), 1000)
vegan_run <- tse
assay(vegan_run, "vegan_mat") <- vegan_mat

# mia::rarefyAssay no_replace
mia_run_no_replace <- rarefyAssay(tse, assay.type = "counts", name = "mia_mat_no_replace", 1000, replace = FALSE)

# mia::rarefyAssay replace
mia_run <- rarefyAssay(tse, assay.type = "counts", name = "mia_mat", 1000)

# Calculate alpha-diversity of the matrices
    mia_run_no_replace <- .estimate_diversity(mia_run_no_replace, assay.type = "mia_mat_no_replace", index = 
                                       "shannon", name = "mia_shannon_no_replace")
    mia_run <- .estimate_diversity(mia_run, assay.type = "mia_mat", index = 
                                       "shannon", name = "mia_shannon")
    vegan_run <- .estimate_diversity(vegan_run, assay.type = "vegan_mat", index = 
                                         "shannon", name = "vegan_shannon")

mia_shan_index_no_replace <- colData(mia_run_no_replace)[, "mia_shannon_no_replace"]
mia_shan_index <- colData(mia_run)[, "mia_shannon"]
vegan_shan_index <- colData(vegan_run)[, "vegan_shannon"]

data <- data.frame(mia_shan_index, vegan_shan_index, mia_shan_index_no_replace )
pairs(data)

@antagomir
Copy link
Member

Hmm.. .estimate_diversity is an internal function in mia and we are not supposed to call that (it is not exported).

  1. Can we directly run addAlpha() function for the rarified assays?

  2. However, addAlpha() also supports rarefaction directly (see argument "niter" https://microbiome.github.io/mia/reference/addAlpha.html ).

I didnt check if the user can pass the arguments to do this both with and without replacement but ideally that should be possible. Can we include in the comparisons both (1) and (2)? In addition to vegan rrarefy.

Could you check if there any explanation for the differing results between mia and vegan could be found?

@Daenarys8
Copy link
Collaborator

  1. Yes we can directly run addAlpha()
  2. Yes, user can pass replace param to rarefyAssay
....
mia_run <- tse
mia_run_no_replace <- tse
vegan_run <- tse

# vegan::rrarefy
vegan_mat <- rrarefy(assay(vegan_run, assay.type = "counts"), 1000)
assay(vegan_run, "vegan_mat") <- vegan_mat

# mia::rarefyAssay no-replace
mia_run_no_replace <- rarefyAssay(mia_run_no_replace, assay.type = "counts", name = "mia_mat_no_replace", 1000, replace = FALSE)

# mia::rarefyAssay replace
mia_run <- rarefyAssay(mia_run, assay.type = "counts", name = "mia_mat", 1000)

# addAlpha
vegan_run <- addAlpha(vegan_run, index = "shannon", assay.type = "vegan_mat")
mia_run<- addAlpha(mia_run, index = "shannon", assay.type = "mia_mat")
mia_run_no_replace <- addAlpha(mia_run_no_replace, index = "shannon", assay.type = "mia_mat_no_replace")


mia_alpha_rarefy <- tse
mia_alpha_rarefy_no_replace <- tse

# addAlpha rarefaction
mia_alpha_rarefy_no_replace <- addAlpha(mia_alpha_rarefy_no_replace, index = "shannon", niter = 1, sample=1000, replace=FALSE)
mia_alpha_rarefy <- addAlpha(mia_alpha_rarefy, index = "shannon", niter = 1, sample=1000)

data <- data.frame(vegan_run$shannon, mia_run$shannon, mia_run_no_replace$shannon, mia_alpha_rarefy$shannon, mia_alpha_rarefy_no_replace$shannon)
pairs(data)

plot

A possible explanation for this differences could be;

  • in vegan, rowSums less than the intended sample are not not rarefied.
  • in mia, features not present in any sample after sampling are dropped.

@antagomir
Copy link
Member

Great.

When using mia and alpha diversity calculations, let's stick to using addAlpha with rarefaction arguments, whenever we can. This is more straightfwd. The rarefyAssay function is available for more general purposes as needed.

Regarding the explanations:

  • Samples that have rowSums less than intended sample should be dropped out by default, I expect that at least mia does this (and should throw a warning in such case). But do we have any cases like this in our current example above, or is this a more theoretical issue?
  • if features not present in any sample after sampling are dropped by default in mia, is this something the user can modify? Can we run mia without dropping those samples and then compare with vegan?

@Daenarys8
Copy link
Collaborator

Rplot

vegan and mia each handle randomness differently, leading to variations in their results. In vegan, the process involves generating a random index and then decrementing the count of the species associated with that index. This approach selects items sequentially from the counts available at each step.. This results in a rarefied sample where counts are adjusted directly according to the random draws and decrementing iteration.

On the other hand, mia constructs a vector where indices are repeated according to their counts and then uses the base sample function to draw with(using probability weights) or without replacement. Although the sampling for the diversity calculation above is done without replacement( also features are not dropped) in both methods, the way mia builds its sample pool by expanding the vector can lead to different random results.

if vegan is a benchmark, we can consider wrapping the rrarefy (which is much faster with the c implementation of do_rrarefy)

@antagomir
Copy link
Member

Sounds good to use rrarefy, in principle.

Some remarks:

  1. the results from mia using with and without replacement seem identical. Is this the case? If yes, does it work as it should (I am expecting to see differences)

  2. we could enable both with and without replacement; for the latter one we could use vegan::rrarefy if this is straightfwd to switch (no major rewriting). That could be made a default (without changing output dimension). But i would keep the option of with replacement, too. Unit tests should be included to make sure all goes as expected.

Could we do & complete this PR?

@TuomasBorman
Copy link
Contributor Author

Sounds feasible, however, it must be clearly stated which function is utilized

@TuomasBorman TuomasBorman linked a pull request Jan 8, 2025 that will close this issue
@TuomasBorman TuomasBorman linked a pull request Jan 8, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants