-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
massive slowdown read.genepop from 2.0.0 to 2.0.1 #114
Comments
OK, let's deal with one thing at a time. Missing data and spead issues should be in 2 different threads - one is a potential bug, the other one an improvement. We have been fixing a bunch of NA related issues over the last weeks. In the current version, you can try: library(adegenet)
example(read.fstat)
summary(obj) and get > summary(obj)
// Number of individuals: 237
// Group sizes: 10 11 20 14 13 17 11 12 13 22 12 23 15 11 14 10 9
// Number of alleles per locus: 16 11 10 9 12 8 12 12 18
// Number of alleles per group: 36 46 70 52 44 61 42 40 35 53 50 67 48 56 42 54 43
// Percentage of missing data: 2.34 %
// Observed heterozygosity: 0.67 0.67 0.68 0.71 0.63 0.57 0.65 0.62 0.45
// Expected heterozygosity: 0.87 0.79 0.8 0.76 0.87 0.69 0.82 0.76 0.61
> I have the same thing with the other |
Bonjour Thibault, mon erreur! Désolé. So the problem with NA is in adegenet CRAN v.2.0.0 not in the devel version 2.0.1. So what remains is an enhancement issue. https://www.dropbox.com/s/aqdsk2ibzz2uhti/genepop.8pop.gen?dl=0 Best |
OK I have renamed the issue accordingly. Thanks @thierrygosselin for the report. We need to track this down before submitting the new version to CRAN. |
I did change the way an object is subsetted which may affect speed but I would have to run a few tests to make sure. |
I can't think these changes (NA replacement fixed) can be responsible for the slowdown, but I can't see subsetting in this commit? |
Technically the only thing changed is the way subsetting vector is calculated. |
I'll check it with a profiler to see where the bottleneck is when I get into the office today. |
Thanks! On 7 December 2015 at 15:32, Zhian N. Kamvar notifications@github.com
Dr Thibaut Jombart |
@romunov, I hate to be the bearer of bad news, but I think your change may be the culprit: I believe we MAY be able to actually fix this in C, perhaps. I will look into this a bit later. |
Road to hell is paved with good intentions. :) Perhaps there's another way of robust subsetting. |
Another option would be to induce internal rownames (something that can't be coerced to numeric, e.g. "sam1", "sam2"...) and revert back to original names just prior to finish. In that case old code would probably work. |
Indeed. Besides, the issue you fixed with that commit was complicated as hell, so there's no worries here. It's better to have the package slow at getting the right result than fast at getting the wrong one 😆 |
I started a new branch at f8ba946 called speedup-missing-import |
Since each sample will have a unique name, I replaced checking the rownames with simply subsetting by the name. On my machine, the 1000 locus sample increases speed from 27s to 14s
Hi guys,
with
or maybe even faster is computing the column indices only once outside the loop:
and finally you can get rid of the loop, if you like:
Results should be identical, but I have not tested which one is the fastest. May depends on the size and distribution of NA's. Cheers, |
I like your thinking, @KlausVigo. I'll set up a switch in the function and do some tests. May the fastest algorithm win! |
As it turns out (and was expected), the one without the loop is the fastest! It turns out that you can use a 2 column matrix as coordinates for subsetting another matrix! I'm going to test some further improvements, push the results to the branch, and then make a pull request. |
@KlausVigo, I have modified your last algorithm to make it a bit more explicit what's happening: if (length(NA.posi) > 0) {
out.colnames <- colnames(out)
NA.row <- match(NA.ind, rownames(out))
loc <- paste0(NA.locus, "\\.")
uloc <- unique(loc)
loc.list <- lapply(uloc, grep, out.colnames)
NA.col <- match(loc, uloc)
# Coordinates for missing rows
missing.ind <- vapply(loc.list, length, integer(1))[NA.col]
missing.ind <- rep(NA.row, missing.ind)
# Coordinates for missing columns
missing.loc <- unlist(loc.list[NA.col], use.names = FALSE)
missing_coordinates <- matrix(0L, nrow = length(missing.ind), ncol = 2L)
missing_coordinates[, 1] <- missing.ind
missing_coordinates[, 2] <- missing.loc
out[missing_coordinates] <- NA
} |
As of 0113ce6, the timings are now: 1k loci:
4k loci:
9k loci:
|
Of course, 90 seconds is not 30 seconds, as @thierrygosselin reported in his implementation. The only catch is that his implementation would require additionally importing tidyr. |
The timing will likely change with the amount of missing data. Thanks! |
@thierrygosselin, has this issue been resolved to your satisfaction? |
I apologize for the long message.
I'm using the latest github version of adegenet (v.2.0.1)
First problem: urgent
adegenet functions that uses
df2genind
(genetix, Fstat, Genepop, STRUCTURE) lost the argument 'missing', in v2.0.0. This argument was used insidedf2genind
(now NA.char). Consequently, they are all currently treating missing values as real data!This might not be a huge problem with microsatellite markers, but it is for GBS/RAD data that have intrinsically high levels of missing values...
This also impact
loci2genind
in the package pegas @emmanuelparadis and could be problematic in some new functions in hierfstat @jgx65 that uses genind object directly.Quick solution: The way around this problem, for me, was to use the genind constructor directly, with allele count, instead of using
df2genind
.Second problem:
df2genind
, could we make df2genind faster ?I know most of adegenet coding is done with base R. Very purist :) but some codes inside
df2genind
are less optimal for large GBS/RAD datasets with lots of missing data. e.g. fileimport.R
the loop in line 293 to 300. That part takes forever with some datasets I tested (below).Downloading the files:
Details about the data. It's originally a VCF made tidy (each variable in a separate column and observation in rows). The GT in the VCF were converted this way:
importing the files
conversion to genind using
adegenet::df2genind
Results
My way around this is shown below. I'll use a dataset with 586 individuals and 10156 markers. It takes 30s on my macbook pro to import data in R, change the coding of the tidy VCF (0/1, ./., etc) to allele count and create the genind object! Similar coding for multiallelic markers are also fast.
Hope this helps!
Best,
Thierry
The text was updated successfully, but these errors were encountered: