-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nullmodel changes in 2.5-0 #255
Labels
Comments
Some information about timing. We have some speed-up within 2.4-0 development:
In 2.5-0 the major changes:
Benchmarking was performed with microbenchmark package. Times are averages of 100 replicates and they give the average running time in seconds per 100 simulations with
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The
nullmodel()
functions have changed much during 2.5-0 development, and they are just now partly in volatile changes: there are some functions and choices that may be removed before the release, and some choices may need rethinking. Here is the update of major changes against 2.4-x:.Call()
interface and proper registration. The effect on speed is probably marginal, but the interface is much cleaner.quasiswap
,swap
,tswap
,quasiswap_count
,swap_count
andabuswap_*
plus methods usingquasiswap
internally (swsh_*
). Profiling (also of compiled code) showed that these methods spent most of their time in generating random numbers. We used four random numbers for a 2x2 matrix (two row indices, two column indices). Now we have two alternative schemes:"3minus"
finds first element directly from the matrix, and then a second row (2 numbers), and if these give a submatrix that cannot be swapped (both 1 or both 0) or quasiswapped (both 0), it skips finding the second column and starts a new cycle. This give 3 or 2 random numbers per cycle (see analysis in commit message 83281ae). In"2x2"
scheme we find directly two diagonal elements with 2 random numbers and the antidiagonal elements from their row and column indices (implemented in 33e9813). This always finds only 2 random numbers. This was implemented in quantitative swap methods (810d902), but surprisingly it was usually slower and at best only marginallytfaster in binary quasiswaps, and there we use the"3minus"
scheme (analysis in commit message 72b2d93).greedyqswap
the first element is picked from >1 elements increasing chances of sum-of-squares reducing quasiswaps. The search for last >1 elements takes most of the time inquasiswap
, and being greedy gives a huge speed-up. However, greedy steps are biased, but if we thin, say, to 1% greedy steps among ordinary quasiswaps, we can still double the speed with little risk of bias. Another method isboostedqswap
which is based on the same idea ascurveball
: Incurveball
we find the set of unique species that occur only in one of two compared sites, and inboostedqswap
we find species that are more abundant in one of sites (1 against 0, 2 against 1 etc), and quasiswap equal number of these up and down on two rows. My first tests indicate that this is biased (although similarcurveball
should be unbiased). If so, this will be removed and not released. Both methods need testing, and neither will be released if they appear to be biased (we don't have a lack of poor null models in this world).backtracking
use now compiled code with a huge speed-up. Backtracking is biased, but it is a classic method that may be useful for comparative purposes.make.commsim()
interface, but it can be called as.Call("do_rcfill", n, rs, cs)
wheren
is the number of simulated matrices, andrs
andcs
are row and column sums. The main reason for first implementing this method was that posts in R-devel and StackOverflow claimed thatstats::r2dtable()
(that we use much internally) give too regular data. I checked the Miklós & Podani paper, and found out that they used a function giving more dispersed data as initial matrix in their quasiswap, and implemented that method. My analysis indicates that the huge number of steps we need in quasiswap guarantees that the initial matrix does not influence the result, but the new function is faster thanr2dtable()
and may speed up simulations. Another use for this function is as a nullmodel that generates count data with larger variance thanr2dtable
that we now have as a null model. However, I still hesitate with releasing this function, because we really do not have a lack of poor null models (and now I think all quantitative nullmodels for counts are poor).The text was updated successfully, but these errors were encountered: