massive slowdown read.genepop from 2.0.0 to 2.0.1 #114

thierrygosselin · 2015-12-03T19:52:34Z

I apologize for the long message.

I'm using the latest github version of adegenet (v.2.0.1)

First problem: urgent
adegenet functions that uses df2genind (genetix, Fstat, Genepop, STRUCTURE) lost the argument 'missing', in v2.0.0. This argument was used inside df2genind (now NA.char). Consequently, they are all currently treating missing values as real data!

This might not be a huge problem with microsatellite markers, but it is for GBS/RAD data that have intrinsically high levels of missing values...

This also impact loci2genind in the package pegas @emmanuelparadis and could be problematic in some new functions in hierfstat @jgx65 that uses genind object directly.

Quick solution: The way around this problem, for me, was to use the genind constructor directly, with allele count, instead of using df2genind.
Second problem: df2genind, could we make df2genind faster ?

I know most of adegenet coding is done with base R. Very purist :) but some codes inside df2genind are less optimal for large GBS/RAD datasets with lots of missing data. e.g. file import.R the loop in line 293 to 300. That part takes forever with some datasets I tested (below).

Downloading the files:

# For 1000 SNP and 170 individuals in 17 pops:
https://www.dropbox.com/s/x1c3n9hin8zr5bf/test.df2genind.gbs.data.subsample1000.tsv?dl=0

# For 4000 SNP and 170 individuals in 17 pops: 
https://www.dropbox.com/s/biy6y35lpm6kocv/test.df2genind.gbs.data.subsample4000.tsv?dl=0

# For 9000 SNP and 170 individuals in 17 pops: 
https://www.dropbox.com/s/ww9hxyngmctz4dz/test.df2genind.gbs.data.subsample9000.tsv?dl=0

Details about the data. It's originally a VCF made tidy (each variable in a separate column and observation in rows). The GT in the VCF were converted this way:

# might not be optimal coding for conversion, but for more than 500 individuals and 10 000 markers it takes 25s to convert...
require(dplyr)
require(stringi)
gbs.data <- vcf %>% 
mutate(
  REF= stri_replace_all_fixed(str = REF, pattern = c("A", "C", "G", "T"), replacement = c("1", "2", "3", "4"), vectorize_all = FALSE), # replace nucleotide with numbers
  ALT = stri_replace_all_fixed(str = ALT, pattern = c("A", "C", "G", "T"), replacement = c("1", "2", "3", "4"), vectorize_all = FALSE),# replace nucleotide with numbers
  GT = ifelse(GT == "0/0", stri_c(REF, REF, sep = "_"),
              ifelse(GT == "1/1",  stri_c(ALT, ALT, sep = "_"),
                     ifelse(GT == "0/1", stri_c(REF, ALT, sep = "_"),
                            ifelse(GT == "1/0", stri_c(ALT, REF, sep = "_"), "0_0")
                     )
              )
  )
) %>% 
arrange(MARKERS, POP_ID) %>% 
select(-c(REF, ALT))%>%
dcast(INDIVIDUALS + POP_ID ~ MARKERS, value.var = "GT") %>% # make a wide format
arrange(POP_ID, INDIVIDUALS)

importing the files

# Here with 1000 SNP
require(readr)
gbs.data <- read_tsv(file = "test.df2genind.gbs.data.subsample1000.tsv", col_names = TRUE, progress = interactive())

conversion to genind using adegenet::df2genind

require(dplyr)
ind <- as.character(gbs.data$INDIVIDUALS)
pop <- gbs.data$POP_ID
genind.df <- gbs.data %>% select(-c(INDIVIDUALS, POP_ID))
rownames(genind.df) <- ind
loc.names <- colnames(genind.df)
system.time(res <- adegenet::df2genind(X = genind.df, sep = "_", ncode = 2, ind.names = ind, loc.names = loc.names, pop = pop, NA.char = "0_0", ploidy = 2, type = "codom", strata = NULL, hierarchy = NULL))
res # works

Results

timing goes from 21s with 1000 SNP (no problem) to more than 10 min with larger datasets

My way around this is shown below. I'll use a dataset with 586 individuals and 10156 markers. It takes 30s on my macbook pro to import data in R, change the coding of the tidy VCF (0/1, ./., etc) to allele count and create the genind object! Similar coding for multiallelic markers are also fast.

https://www.dropbox.com/s/qkvqk647li1giuc/tidy_vcf.tsv?dl=0

genind.prep <- read_tsv(file = "tidy_vcf.tsv", col_names = TRUE, progress = interactive()) %>%
mutate(GT = stri_replace_all_fixed(str = GT, pattern = "/", replacement = ":", vectorize_all = FALSE)) %>% 
mutate(GT = stri_replace_all_fixed(str = GT, pattern = c("0:0", "1:1", "0:1", "1:0", ".:."), replacement = c("2_0", "0_2", "1_1", "1_1", "NA_NA"), vectorize_all = FALSE)) %>% 
select(-c(REF, ALT)) %>%
tidyr::separate(col = GT, into = c("A1", "A2"), sep = "_", extra = "drop", remove = TRUE) %>%
tidyr::gather(key = ALLELES, value = COUNT, -c(MARKERS, INDIVIDUALS, POP_ID)) %>% # make tidy
tidyr::unite(MARKERS_ALLELES, MARKERS, ALLELES, sep = ".") %>%
mutate(COUNT = replace(COUNT, which(COUNT == "NA"), NA)) %>% 
group_by(POP_ID, INDIVIDUALS) %>%
tidyr::spread(data = ., key = MARKERS_ALLELES, value = COUNT)
# convert to genind with genind constructor directly
ind <- as.character(genind.prep$INDIVIDUALS)
pop <- genind.prep$POP_ID
genind.df <- genind.prep %>%
select(-c(INDIVIDUALS, POP_ID))
rownames(genind.df) <- ind
prevcall <- match.call()
res <- adegenet::genind(tab = genind.df, pop = pop, prevcall = prevcall, ploidy = 2, type = "codom", strata = NULL, hierarchy = NULL)

Hope this helps!
Best,
Thierry

The text was updated successfully, but these errors were encountered:

thibautjombart · 2015-12-03T22:36:43Z

OK, let's deal with one thing at a time. Missing data and spead issues should be in 2 different threads - one is a potential bug, the other one an improvement.

We have been fixing a bunch of NA related issues over the last weeks. In the current version, you can try:

library(adegenet)
example(read.fstat)
summary(obj)

and get

> summary(obj)

// Number of individuals: 237
// Group sizes: 10 11 20 14 13 17 11 12 13 22 12 23 15 11 14 10 9
// Number of alleles per locus: 16 11 10 9 12 8 12 12 18
// Number of alleles per group: 36 46 70 52 44 61 42 40 35 53 50 67 48 56 42 54 43
// Percentage of missing data: 2.34 %
// Observed heterozygosity: 0.67 0.67 0.68 0.71 0.63 0.57 0.65 0.62 0.45
// Expected heterozygosity: 0.87 0.79 0.8 0.76 0.87 0.69 0.82 0.76 0.61
>

I have the same thing with the other read.[...].
Can you update to the current devel and check if the NA problem is still there?

thierrygosselin · 2015-12-04T17:36:34Z

Bonjour Thibault, mon erreur! Désolé.

So the problem with NA is in adegenet CRAN v.2.0.0 not in the devel version 2.0.1.
Sorry about that.

So what remains is an enhancement issue.
With the genepop file included below, version 2.0.1 takes 100x more time (1578 sec) to process the file then version 2.0.0. (15 sec)...but I do see the percentage of missing values with the summary function.

https://www.dropbox.com/s/aqdsk2ibzz2uhti/genepop.8pop.gen?dl=0

Best
Thierry

thibautjombart · 2015-12-07T12:39:35Z

OK I have renamed the issue accordingly. Thanks @thierrygosselin for the report. We need to track this down before submitting the new version to CRAN.

thibautjombart · 2015-12-07T12:40:58Z

@zkamvar @romunov do you remember making any change to df2genind / genind constructor that may affect speed? I suspect I have, but worth asking.

romunov · 2015-12-07T12:47:39Z

I did change the way an object is subsetted which may affect speed but I would have to run a few tests to make sure.

thibautjombart · 2015-12-07T12:52:53Z

I can't think these changes (NA replacement fixed) can be responsible for the slowdown, but I can't see subsetting in this commit?

romunov · 2015-12-07T13:00:55Z

Technically the only thing changed is the way subsetting vector is calculated.

zkamvar · 2015-12-07T15:31:59Z

I'll check it with a profiler to see where the bottleneck is when I get into the office today.

thibautjombart · 2015-12-07T15:43:27Z

Thanks!

On 7 December 2015 at 15:32, Zhian N. Kamvar notifications@github.com
wrote:

I'll check it with a profiler to see where the bottleneck is when I get
into the office today.

—
Reply to this email directly or view it on GitHub
#114 (comment)
.

Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology
Imperial College London
https://sites.google.com/site/thibautjombart/
Twitter: @thibautjombart https://twitter.com/thibautjombart?lang=en-gb

zkamvar · 2015-12-07T19:55:58Z

@romunov, I hate to be the bearer of bad news, but I think your change may be the culprit:

I believe we MAY be able to actually fix this in C, perhaps. I will look into this a bit later.

romunov · 2015-12-07T20:53:33Z

Road to hell is paved with good intentions. :) Perhaps there's another way of robust subsetting.

romunov · 2015-12-07T21:07:54Z

Another option would be to induce internal rownames (something that can't be coerced to numeric, e.g. "sam1", "sam2"...) and revert back to original names just prior to finish. In that case old code would probably work.

zkamvar · 2015-12-07T22:17:10Z

Road to hell is paved with good intentions. :) Perhaps there's another way of robust subsetting.

Indeed. Besides, the issue you fixed with that commit was complicated as hell, so there's no worries here. It's better to have the package slow at getting the right result than fast at getting the wrong one 😆

zkamvar · 2015-12-07T22:51:23Z

I started a new branch at f8ba946 called speedup-missing-import

Since each sample will have a unique name, I replaced checking the rownames with simply subsetting by the name. On my machine, the 1000 locus sample increases speed from 27s to 14s

KlausVigo · 2015-12-07T22:53:03Z

Hi guys,
I think the problem is that rownames(out) %in% NA.ind[i] creates a vector and you have to run internally which(... == TRUE) on it to get the index back, which is likely adding O(n) or worse in each iteration. You can compute the indices outside the loop, this should speed thinks up.
So replace

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        for (i in 1:length(NA.ind)) {
            loc <- paste0(NA.locus[i], "\\.")
            out[rownames(out) %in% NA.ind[i], grep(loc, out.colnames)] <- NA
        }
    }

with

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
# one could check if this produces NA itself
        loc <- paste0(NA.locus, "\\.")
        for (i in 1:length(NA.ind)) {
            out[NA.row[i], grep(loc[i], out.colnames)] <- NA
        }
    }

or maybe even faster is computing the column indices only once outside the loop:

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
        loc <- paste0(NA.locus, "\\.")
        uloc <- unique(loc)
        loc.list <- lapply(uloc, grep, out.colnames)
        NA.col <- match(uloc, loc)
        for (i in 1:length(NA.ind)) {
            out[NA.row[i], loc.list[[ NA.col[i] ]]] <- NA
        }
    }

and finally you can get rid of the loop, if you like:

    if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
        loc <- paste0(NA.locus, "\\.")
        uloc <- unique(loc)
        loc.list <- lapply(uloc, grep, out.colnames)
        NA.col <- match(loc, uloc)
        out[cbind(rep(NA.row, unlist(lapply(loc.list, length))[NA.col]), unlist(loc.list[NA.col]))] <- NA
    }

Results should be identical, but I have not tested which one is the fastest. May depends on the size and distribution of NA's.

Cheers,
Klaus

zkamvar · 2015-12-07T23:03:00Z

I like your thinking, @KlausVigo. I'll set up a switch in the function and do some tests.

May the fastest algorithm win!

zkamvar · 2015-12-07T23:47:54Z

As it turns out (and was expected), the one without the loop is the fastest!

It turns out that you can use a 2 column matrix as coordinates for subsetting another matrix! I'm going to test some further improvements, push the results to the branch, and then make a pull request.

zkamvar · 2015-12-08T00:02:12Z

@KlausVigo, I have modified your last algorithm to make it a bit more explicit what's happening:

      if (length(NA.posi) > 0) {
        out.colnames <- colnames(out)
        NA.row <- match(NA.ind, rownames(out))
        loc <- paste0(NA.locus, "\\.")
        uloc <- unique(loc)
        loc.list <- lapply(uloc, grep, out.colnames)
        NA.col <- match(loc, uloc)

        # Coordinates for missing rows
        missing.ind <- vapply(loc.list, length, integer(1))[NA.col]
        missing.ind <- rep(NA.row, missing.ind)
        # Coordinates for missing columns
        missing.loc <- unlist(loc.list[NA.col], use.names = FALSE)

        missing_coordinates <- matrix(0L, nrow = length(missing.ind), ncol = 2L)
        missing_coordinates[, 1] <- missing.ind
        missing_coordinates[, 2] <- missing.loc

        out[missing_coordinates] <- NA
      }

zkamvar · 2015-12-08T00:12:11Z

As of 0113ce6, the timings are now:

1k loci:

   user  system elapsed 
  1.477   0.042   1.538

4k loci:

   user  system elapsed 
 18.210   0.242  18.655

9k loci:

   user  system elapsed 
 89.109   1.475  92.881

zkamvar · 2015-12-08T00:18:03Z

Of course, 90 seconds is not 30 seconds, as @thierrygosselin reported in his implementation. The only catch is that his implementation would require additionally importing tidyr.

thierrygosselin · 2015-12-08T02:13:26Z

The timing will likely change with the amount of missing data.
I'll test the new implementation with different dataset

Thanks!

zkamvar · 2016-02-01T17:14:56Z

@thierrygosselin, has this issue been resolved to your satisfaction?

zkamvar added the bug label Dec 3, 2015

thibautjombart mentioned this issue Dec 3, 2015

read.genepop unused argument issue #113

Closed

thibautjombart added enhancement and removed bug labels Dec 3, 2015

thibautjombart changed the title ~~missing values problems: df2genind and adegenet functions to import genetix, Fstat, Genepop and STRUCTURE files~~ massive slowdown read.genepop from 2.0.0 to 2.0.1 Dec 7, 2015

thibautjombart added the priority label Dec 7, 2015

zkamvar referenced this issue Dec 7, 2015

speedup df2genind, maybe?

f8ba946

Since each sample will have a unique name, I replaced checking the rownames with simply subsetting by the name. On my machine, the 1000 locus sample increases speed from 27s to 14s

zkamvar added a commit that referenced this issue Dec 8, 2015

implement Klaus' algorithm to address #114

0113ce6

zkamvar added this to the Next CRAN release: 2.0.1 milestone Dec 8, 2015

zkamvar mentioned this issue Dec 8, 2015

Speedup missing import #115

Merged

thierrygosselin closed this as completed Jan 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

massive slowdown read.genepop from 2.0.0 to 2.0.1 #114

massive slowdown read.genepop from 2.0.0 to 2.0.1 #114

thierrygosselin commented Dec 3, 2015

thibautjombart commented Dec 3, 2015

thierrygosselin commented Dec 4, 2015

thibautjombart commented Dec 7, 2015

thibautjombart commented Dec 7, 2015

romunov commented Dec 7, 2015

thibautjombart commented Dec 7, 2015

romunov commented Dec 7, 2015

zkamvar commented Dec 7, 2015

thibautjombart commented Dec 7, 2015

zkamvar commented Dec 7, 2015

romunov commented Dec 7, 2015

romunov commented Dec 7, 2015

zkamvar commented Dec 7, 2015

zkamvar commented Dec 7, 2015

KlausVigo commented Dec 7, 2015

zkamvar commented Dec 7, 2015

zkamvar commented Dec 7, 2015

zkamvar commented Dec 8, 2015

zkamvar commented Dec 8, 2015

zkamvar commented Dec 8, 2015

thierrygosselin commented Dec 8, 2015

zkamvar commented Feb 1, 2016

massive slowdown read.genepop from 2.0.0 to 2.0.1 #114

massive slowdown read.genepop from 2.0.0 to 2.0.1 #114

Comments

thierrygosselin commented Dec 3, 2015

thibautjombart commented Dec 3, 2015

thierrygosselin commented Dec 4, 2015

thibautjombart commented Dec 7, 2015

thibautjombart commented Dec 7, 2015

romunov commented Dec 7, 2015

thibautjombart commented Dec 7, 2015

romunov commented Dec 7, 2015

zkamvar commented Dec 7, 2015

thibautjombart commented Dec 7, 2015

zkamvar commented Dec 7, 2015

romunov commented Dec 7, 2015

romunov commented Dec 7, 2015

zkamvar commented Dec 7, 2015

zkamvar commented Dec 7, 2015

KlausVigo commented Dec 7, 2015

zkamvar commented Dec 7, 2015

zkamvar commented Dec 7, 2015

zkamvar commented Dec 8, 2015

zkamvar commented Dec 8, 2015

zkamvar commented Dec 8, 2015

thierrygosselin commented Dec 8, 2015

zkamvar commented Feb 1, 2016