Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issues: Timeout error for large data set snakemake pipeline #529

Open
jolvhull opened this issue Jan 6, 2025 · 0 comments
Open

Comments

@jolvhull
Copy link

jolvhull commented Jan 6, 2025

Dear SCENIC+ team,

I was wondering what options there are to speed up or increase the scalability for the SCENIC+ snakemake pipeline.
I have a data set of around 117000 cells and 340000 peaks. I was able to run the pycisTopic pipeline by using the updated code from the polars_1xx branch which resolved out of memory issues. I am now trying to run the snakemake pipeline, but the job does not finish within the 72 hours (out of time issue), which is the maximum I can get at our HPC infrastructure for running jobs.

I provided 48 cores, 480 gb ram and 72:00:00 wall time. In issue #453 I saw fewer resources were recommended for more cells but for me the process gets killed at the localrule region_to_gene step at only 51% because of a Timeout (> 72:00:00) error. I have tried running the snakemake step using both the development and main branch scenicplus versions, but the time issue remains the same.

It is possible that the speed-issues might also be HPC related, but I was wondering whether there are ways to split the pipeline into multiple steps or resume the pipeline where it stopped in order to circumvent the 72 hour time limit?

Or would running on a GPU cluster speed up the pipeline? Or are there other ways to speed up the snakemake pipeline?

As an alternative, I also tried to continue the pipeline locally (as I don't have a time limit here) by copying the ACC_GEX.h5mu file that had already been generated in the first steps of the pipeline. Unfortunately, because it has a file size of >160gb I get an 'ArrayMemoryError: Unable to allocate 150 GiB for an array with shape (117000x340000) and data type int32' error, since I only have 128 gb ram available. Is there a way to for example read in this file in chunks to prevent running out of memory?

Thank you in advance!

Version information
SCENIC+: 1.0a1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant