-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884
Conversation
Test failure is due to doc build with Sphinx on py3.7, attempting to work through it. |
...okay, now waiting on resources to test this. @qiyunzhu, Antonio is out-of-office for a while. Is there any chance you could do a review of this PR? The change is relatively straightforward: instead of aggregating data in a dense numpy matrix, we aggregate into a "dict of keys" sparse matrix followed by conversion to compressed sparse column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wasade This looks impressive! I think it is highly appreciable that you are continuing to develop the BIOM package to enable high-performance table operations. The use of SciPy's sparse matrix could mean significant advantage over other implementations (especially the R solutions).
I have one question concerning the data types: woltka's one-to-many collapsing can optionally divide read counts by feature counts. If the current method has a similar mechanism, then it could implicate that integers become floats. Will that cause any problem in BIOM? Both practically and theoretically (because it is no longer a "contingency table").
In line 2673, I think np.float
should be okay. But just wanted to remind if relevant that you may consider casting it to a more explicit type (like np.float64
, which basically is the same as Python float
), and think a bit to make sure there isn't a concern of overflow in extreme cases (for example, if the data type before casting is np.uint64
, u
since all numbers are non-negative).
The current method supports divide, and no problem for BIOM. Good call on The divide may trigger an underflow rather than an overflow. In which case, numpy would warn: In [2]: v = np.float64(1e1000)
In [3]: v /= 1e1000
<ipython-input-3-4eac3ab25e8a>:1: RuntimeWarning: invalid value encountered in double_scalars
v /= 1e1000
In [4]: v
Out[4]: nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me now!
Thanks! My test is still running so will hold off merge for the time being |
Avoid dense representation on one_to_many. This code path was implicated in
woltka collapse
.Note this is a WIP, I am asserting the memory reduction in
woltka collapse
right now. However, the high memory requirement is nearly certainly due to the dense memory representation previously used.cc @antgonza @qiyunzhu