-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyranges slow on many chromosomes. Use chromsweep algorithm? #44
Comments
I think I have succeeded 🥳 This is something we should polish and include in the paper @marco-mariotti. These are insane speedups. There is nothing revolutionary in my algorithm, it is just very optimized. The nice thing about my algorithm is that it is just as easy to use with polars as with pandas. See https://github.com/endrebak/ruranges for a beginning. Previous records, 10 mill reads, hg38, max read length 1000: pyranges: 18.22 New record, ruranges (this includes getting the data from python to rust, doing the operation, getting the result back to python): 2.48 seconds Previous records, 10 mill reads, proteome (11k chroms!), max read length 100: genomicranges: 36.70 New record, ruranges (this includes getting the data from python to rust, doing the operation, getting the result back to python): 7.8 seconds |
Note that ruranges is a supporting library I'll use with pyranges, but it should be made available to use directly from rust (and R?) too. |
The memory usage is the exact same as before. |
Thanks Endre! Let's do a zoom soon after the Christmas break so you catch
me up and we plan the paper.
Ciao!
Marco Mariotti
…On Wed, Dec 25, 2024, 12:28 endrebak.ada ***@***.***> wrote:
The memory usage is the exact same as before.
—
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTXRIHKUGG77HWYLRSW46D2HKJFTAVCNFSM6AAAAABUFMLNUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRHA2DQMZXGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, let's do so. Sorting is now 4 times faster than the old pyranges for 20 mill intervals - 3 vs 13 seconds Closest is 6 times faster than bioframe for 20 mill intervals (k=2, include overlaps) - 7 vs 42 seconds. The two hardest algorithms are now implemented (overlaps and k-nearest). I think the only other binary algorithms missing are subtract and complement. The rest can be implemented using these. |
This slowness is due to the groupby that splits on chromosome and handles each df individually.
I should try implemeting the chromsweep algorithm in rust. We can sort the entire dataframes with radix sort and do a quick chromsweep pass. This would incur no penalty for multiple chromosomes.
The text was updated successfully, but these errors were encountered: