Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyranges slow on many chromosomes. Use chromsweep algorithm? #44

Open
endrebak opened this issue Dec 25, 2024 · 5 comments
Open

pyranges slow on many chromosomes. Use chromsweep algorithm? #44

endrebak opened this issue Dec 25, 2024 · 5 comments

Comments

@endrebak
Copy link
Collaborator

This slowness is due to the groupby that splits on chromosome and handles each df individually.

I should try implemeting the chromsweep algorithm in rust. We can sort the entire dataframes with radix sort and do a quick chromsweep pass. This would incur no penalty for multiple chromosomes.

@endrebak
Copy link
Collaborator Author

endrebak commented Dec 25, 2024

I think I have succeeded 🥳 This is something we should polish and include in the paper @marco-mariotti. These are insane speedups.

There is nothing revolutionary in my algorithm, it is just very optimized. The nice thing about my algorithm is that it is just as easy to use with polars as with pandas. See https://github.com/endrebak/ruranges for a beginning.

Previous records, 10 mill reads, hg38, max read length 1000:

pyranges: 18.22
genomicranges: 27.85
bioframe: 60.19
bedtools: 95.08

New record, ruranges (this includes getting the data from python to rust, doing the operation, getting the result back to python):

2.48 seconds

Previous records, 10 mill reads, proteome (11k chroms!), max read length 100:

genomicranges: 36.70
pyranges: 95.45
bedtools: 274.37
bioframe: 320.50

New record, ruranges (this includes getting the data from python to rust, doing the operation, getting the result back to python):

7.8 seconds

@endrebak
Copy link
Collaborator Author

Note that ruranges is a supporting library I'll use with pyranges, but it should be made available to use directly from rust (and R?) too.

@endrebak
Copy link
Collaborator Author

The memory usage is the exact same as before.

@marco-mariotti
Copy link
Member

marco-mariotti commented Dec 25, 2024 via email

@endrebak
Copy link
Collaborator Author

Yes, let's do so.

Sorting is now 4 times faster than the old pyranges for 20 mill intervals - 3 vs 13 seconds

Closest is 6 times faster than bioframe for 20 mill intervals (k=2, include overlaps) - 7 vs 42 seconds.

The two hardest algorithms are now implemented (overlaps and k-nearest). I think the only other binary algorithms missing are subtract and complement. The rest can be implemented using these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants