Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

massive slowdown read.genepop from 2.0.0 to 2.0.1 #114

Closed
thierrygosselin opened this issue Dec 3, 2015 · 22 comments
Closed

massive slowdown read.genepop from 2.0.0 to 2.0.1 #114

thierrygosselin opened this issue Dec 3, 2015 · 22 comments

Comments

@thierrygosselin
Copy link
Collaborator

I apologize for the long message.

I'm using the latest github version of adegenet (v.2.0.1)

  1. First problem: urgent
    adegenet functions that uses df2genind (genetix, Fstat, Genepop, STRUCTURE) lost the argument 'missing', in v2.0.0. This argument was used inside df2genind (now NA.char). Consequently, they are all currently treating missing values as real data!

    This might not be a huge problem with microsatellite markers, but it is for GBS/RAD data that have intrinsically high levels of missing values...

    This also impact loci2genind in the package pegas @emmanuelparadis and could be problematic in some new functions in hierfstat @jgx65 that uses genind object directly.

    Quick solution: The way around this problem, for me, was to use the genind constructor directly, with allele count, instead of using df2genind.

  2. Second problem: df2genind, could we make df2genind faster ?

    I know most of adegenet coding is done with base R. Very purist :) but some codes inside df2genind are less optimal for large GBS/RAD datasets with lots of missing data. e.g. file import.R the loop in line 293 to 300. That part takes forever with some datasets I tested (below).

  3. Downloading the files:

    # For 1000 SNP and 170 individuals in 17 pops:
    https://www.dropbox.com/s/x1c3n9hin8zr5bf/test.df2genind.gbs.data.subsample1000.tsv?dl=0
    # For 4000 SNP and 170 individuals in 17 pops: 
    https://www.dropbox.com/s/biy6y35lpm6kocv/test.df2genind.gbs.data.subsample4000.tsv?dl=0
    # For 9000 SNP and 170 individuals in 17 pops: 
    https://www.dropbox.com/s/ww9hxyngmctz4dz/test.df2genind.gbs.data.subsample9000.tsv?dl=0
  4. Details about the data. It's originally a VCF made tidy (each variable in a separate column and observation in rows). The GT in the VCF were converted this way:

    # might not be optimal coding for conversion, but for more than 500 individuals and 10 000 markers it takes 25s to convert...
    require(dplyr)
    require(stringi)
    gbs.data <- vcf %>% 
    mutate(
      REF= stri_replace_all_fixed(str = REF, pattern = c("A", "C", "G", "T"), replacement = c("1", "2", "3", "4"), vectorize_all = FALSE), # replace nucleotide with numbers
      ALT = stri_replace_all_fixed(str = ALT, pattern = c("A", "C", "G", "T"), replacement = c("1", "2", "3", "4"), vectorize_all = FALSE),# replace nucleotide with numbers
      GT = ifelse(GT == "0/0", stri_c(REF, REF, sep = "_"),
                  ifelse(GT == "1/1",  stri_c(ALT, ALT, sep = "_"),
                         ifelse(GT == "0/1", stri_c(REF, ALT, sep = "_"),
                                ifelse(GT == "1/0", stri_c(ALT, REF, sep = "_"), "0_0")
                         )
                  )
      )
    ) %>% 
    arrange(MARKERS, POP_ID) %>% 
    select(-c(REF, ALT))%>%
    dcast(INDIVIDUALS + POP_ID ~ MARKERS, value.var = "GT") %>% # make a wide format
    arrange(POP_ID, INDIVIDUALS)
  5. importing the files

    # Here with 1000 SNP
    require(readr)
    gbs.data <- read_tsv(file = "test.df2genind.gbs.data.subsample1000.tsv", col_names = TRUE, progress = interactive())
  6. conversion to genind using adegenet::df2genind

    require(dplyr)
    ind <- as.character(gbs.data$INDIVIDUALS)
    pop <- gbs.data$POP_ID
    genind.df <- gbs.data %>% select(-c(INDIVIDUALS, POP_ID))
    rownames(genind.df) <- ind
    loc.names <- colnames(genind.df)
    system.time(res <- adegenet::df2genind(X = genind.df, sep = "_", ncode = 2, ind.names = ind, loc.names = loc.names, pop = pop, NA.char = "0_0", ploidy = 2, type = "codom", strata = NULL, hierarchy = NULL))
    res # works
  7. Results

    timing goes from 21s with 1000 SNP (no problem) to more than 10 min with larger datasets
  8. My way around this is shown below. I'll use a dataset with 586 individuals and 10156 markers. It takes 30s on my macbook pro to import data in R, change the coding of the tidy VCF (0/1, ./., etc) to allele count and create the genind object! Similar coding for multiallelic markers are also fast.

    https://www.dropbox.com/s/qkvqk647li1giuc/tidy_vcf.tsv?dl=0
    genind.prep <- read_tsv(file = "tidy_vcf.tsv", col_names = TRUE, progress = interactive()) %>%
    mutate(GT = stri_replace_all_fixed(str = GT, pattern = "/", replacement = ":", vectorize_all = FALSE)) %>% 
    mutate(GT = stri_replace_all_fixed(str = GT, pattern = c("0:0", "1:1", "0:1", "1:0", ".:."), replacement = c("2_0", "0_2", "1_1", "1_1", "NA_NA"), vectorize_all = FALSE)) %>% 
    select(-c(REF, ALT)) %>%
    tidyr::separate(col = GT, into = c("A1", "A2"), sep = "_", extra = "drop", remove = TRUE) %>%
    tidyr::gather(key = ALLELES, value = COUNT, -c(MARKERS, INDIVIDUALS, POP_ID)) %>% # make tidy
    tidyr::unite(MARKERS_ALLELES, MARKERS, ALLELES, sep = ".") %>%
    mutate(COUNT = replace(COUNT, which(COUNT == "NA"), NA)) %>% 
    group_by(POP_ID, INDIVIDUALS) %>%
    tidyr::spread(data = ., key = MARKERS_ALLELES, value = COUNT)
    # convert to genind with genind constructor directly
    ind <- as.character(genind.prep$INDIVIDUALS)
    pop <- genind.prep$POP_ID
    genind.df <- genind.prep %>%
    select(-c(INDIVIDUALS, POP_ID))
    rownames(genind.df) <- ind
    prevcall <- match.call()
    res <- adegenet::genind(tab = genind.df, pop = pop, prevcall = prevcall, ploidy = 2, type = "codom", strata = NULL, hierarchy = NULL)

Hope this helps!
Best,
Thierry

@thibautjombart
Copy link
Owner

OK, let's deal with one thing at a time. Missing data and spead issues should be in 2 different threads - one is a potential bug, the other one an improvement.

We have been fixing a bunch of NA related issues over the last weeks. In the current version, you can try:

library(adegenet)
example(read.fstat)
summary(obj)

and get

> summary(obj)

// Number of individuals: 237
// Group sizes: 10 11 20 14 13 17 11 12 13 22 12 23 15 11 14 10 9
// Number of alleles per locus: 16 11 10 9 12 8 12 12 18
// Number of alleles per group: 36 46 70 52 44 61 42 40 35 53 50 67 48 56 42 54 43
// Percentage of missing data: 2.34 %
// Observed heterozygosity: 0.67 0.67 0.68 0.71 0.63 0.57 0.65 0.62 0.45
// Expected heterozygosity: 0.87 0.79 0.8 0.76 0.87 0.69 0.82 0.76 0.61
> 

I have the same thing with the other read.[...].
Can you update to the current devel and check if the NA problem is still there?

@thierrygosselin
Copy link
Collaborator Author

Bonjour Thibault, mon erreur! Désolé.

So the problem with NA is in adegenet CRAN v.2.0.0 not in the devel version 2.0.1.
Sorry about that.

So what remains is an enhancement issue.
With the genepop file included below, version 2.0.1 takes 100x more time (1578 sec) to process the file then version 2.0.0. (15 sec)...but I do see the percentage of missing values with the summary function.

https://www.dropbox.com/s/aqdsk2ibzz2uhti/genepop.8pop.gen?dl=0

Best
Thierry

@thibautjombart thibautjombart changed the title missing values problems: df2genind and adegenet functions to import genetix, Fstat, Genepop and STRUCTURE files massive slowdown read.genepop from 2.0.0 to 2.0.1 Dec 7, 2015
@thibautjombart
Copy link
Owner

OK I have renamed the issue accordingly. Thanks @thierrygosselin for the report. We need to track this down before submitting the new version to CRAN.

@thibautjombart
Copy link
Owner

@zkamvar @romunov do you remember making any change to df2genind / genind constructor that may affect speed? I suspect I have, but worth asking.

@romunov
Copy link
Collaborator

romunov commented Dec 7, 2015

I did change the way an object is subsetted which may affect speed but I would have to run a few tests to make sure.

@thibautjombart
Copy link
Owner

I can't think these changes (NA replacement fixed) can be responsible for the slowdown, but I can't see subsetting in this commit?

@romunov
Copy link
Collaborator

romunov commented Dec 7, 2015

Technically the only thing changed is the way subsetting vector is calculated.

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 7, 2015

I'll check it with a profiler to see where the bottleneck is when I get into the office today.

@thibautjombart
Copy link
Owner

Thanks!

On 7 December 2015 at 15:32, Zhian N. Kamvar notifications@github.com
wrote:

I'll check it with a profiler to see where the bottleneck is when I get
into the office today.


Reply to this email directly or view it on GitHub
#114 (comment)
.

Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology
Imperial College London
https://sites.google.com/site/thibautjombart/
Twitter: @thibautjombart https://twitter.com/thibautjombart?lang=en-gb

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 7, 2015

@romunov, I hate to be the bearer of bad news, but I think your change may be the culprit:

image

I believe we MAY be able to actually fix this in C, perhaps. I will look into this a bit later.

@romunov
Copy link
Collaborator

romunov commented Dec 7, 2015

Road to hell is paved with good intentions. :) Perhaps there's another way of robust subsetting.

@romunov
Copy link
Collaborator

romunov commented Dec 7, 2015

Another option would be to induce internal rownames (something that can't be coerced to numeric, e.g. "sam1", "sam2"...) and revert back to original names just prior to finish. In that case old code would probably work.

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 7, 2015

Road to hell is paved with good intentions. :) Perhaps there's another way of robust subsetting.

Indeed. Besides, the issue you fixed with that commit was complicated as hell, so there's no worries here. It's better to have the package slow at getting the right result than fast at getting the wrong one 😆

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 7, 2015

I started a new branch at f8ba946 called speedup-missing-import

zkamvar referenced this issue Dec 7, 2015
Since each sample will have a unique name, I
replaced checking the rownames with simply
subsetting by the name. On my machine, the 1000
locus sample increases speed from 27s to 14s
@KlausVigo
Copy link
Collaborator

Hi guys,
I think the problem is that rownames(out) %in% NA.ind[i] creates a vector and you have to run internally which(... == TRUE) on it to get the index back, which is likely adding O(n) or worse in each iteration. You can compute the indices outside the loop, this should speed thinks up.
So replace

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        for (i in 1:length(NA.ind)) {
            loc <- paste0(NA.locus[i], "\\.")
            out[rownames(out) %in% NA.ind[i], grep(loc, out.colnames)] <- NA
        }
    }

with

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
# one could check if this produces NA itself
        loc <- paste0(NA.locus, "\\.")
        for (i in 1:length(NA.ind)) {
            out[NA.row[i], grep(loc[i], out.colnames)] <- NA
        }
    }

or maybe even faster is computing the column indices only once outside the loop:

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
        loc <- paste0(NA.locus, "\\.")
        uloc <- unique(loc)
        loc.list <- lapply(uloc, grep, out.colnames)
        NA.col <- match(uloc, loc)
        for (i in 1:length(NA.ind)) {
            out[NA.row[i], loc.list[[ NA.col[i] ]]] <- NA
        }
    }

and finally you can get rid of the loop, if you like:

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
        loc <- paste0(NA.locus, "\\.")
        uloc <- unique(loc)
        loc.list <- lapply(uloc, grep, out.colnames)
        NA.col <- match(loc, uloc)
        out[cbind(rep(NA.row, unlist(lapply(loc.list, length))[NA.col]), unlist(loc.list[NA.col]))] <- NA
    }

Results should be identical, but I have not tested which one is the fastest. May depends on the size and distribution of NA's.

Cheers,
Klaus

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 7, 2015

I like your thinking, @KlausVigo. I'll set up a switch in the function and do some tests.

May the fastest algorithm win!

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 7, 2015

As it turns out (and was expected), the one without the loop is the fastest!

It turns out that you can use a 2 column matrix as coordinates for subsetting another matrix! I'm going to test some further improvements, push the results to the branch, and then make a pull request.

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 8, 2015

@KlausVigo, I have modified your last algorithm to make it a bit more explicit what's happening:

      if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
        loc <- paste0(NA.locus, "\\.")
        uloc <- unique(loc)
        loc.list <- lapply(uloc, grep, out.colnames)
        NA.col <- match(loc, uloc)

        # Coordinates for missing rows
        missing.ind <- vapply(loc.list, length, integer(1))[NA.col]
        missing.ind <- rep(NA.row, missing.ind)
        # Coordinates for missing columns
        missing.loc <- unlist(loc.list[NA.col], use.names = FALSE)

        missing_coordinates <- matrix(0L, nrow = length(missing.ind), ncol = 2L)
        missing_coordinates[, 1] <- missing.ind
        missing_coordinates[, 2] <- missing.loc

        out[missing_coordinates] <- NA
      }

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 8, 2015

As of 0113ce6, the timings are now:

1k loci:

   user  system elapsed 
  1.477   0.042   1.538 

4k loci:

   user  system elapsed 
 18.210   0.242  18.655

9k loci:

   user  system elapsed 
 89.109   1.475  92.881

@zkamvar
Copy link
Collaborator

zkamvar commented Dec 8, 2015

Of course, 90 seconds is not 30 seconds, as @thierrygosselin reported in his implementation. The only catch is that his implementation would require additionally importing tidyr.

@thierrygosselin
Copy link
Collaborator Author

The timing will likely change with the amount of missing data.
I'll test the new implementation with different dataset

Thanks!

@zkamvar
Copy link
Collaborator

zkamvar commented Feb 1, 2016

@thierrygosselin, has this issue been resolved to your satisfaction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants