Scalability issues: Timeout error for large data set snakemake pipeline #529

jolvhull · 2025-01-06T16:18:49Z

Dear SCENIC+ team,

I was wondering what options there are to speed up or increase the scalability for the SCENIC+ snakemake pipeline.
I have a data set of around 117000 cells and 340000 peaks. I was able to run the pycisTopic pipeline by using the updated code from the polars_1xx branch which resolved out of memory issues. I am now trying to run the snakemake pipeline, but the job does not finish within the 72 hours (out of time issue), which is the maximum I can get at our HPC infrastructure for running jobs.

I provided 48 cores, 480 gb ram and 72:00:00 wall time. In issue #453 I saw fewer resources were recommended for more cells but for me the process gets killed at the localrule region_to_gene step at only 51% because of a Timeout (> 72:00:00) error. I have tried running the snakemake step using both the development and main branch scenicplus versions, but the time issue remains the same.

It is possible that the speed-issues might also be HPC related, but I was wondering whether there are ways to split the pipeline into multiple steps or resume the pipeline where it stopped in order to circumvent the 72 hour time limit?

Or would running on a GPU cluster speed up the pipeline? Or are there other ways to speed up the snakemake pipeline?

As an alternative, I also tried to continue the pipeline locally (as I don't have a time limit here) by copying the ACC_GEX.h5mu file that had already been generated in the first steps of the pipeline. Unfortunately, because it has a file size of >160gb I get an 'ArrayMemoryError: Unable to allocate 150 GiB for an array with shape (117000x340000) and data type int32' error, since I only have 128 gb ram available. Is there a way to for example read in this file in chunks to prevent running out of memory?

Thank you in advance!

Version information
SCENIC+: 1.0a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability issues: Timeout error for large data set snakemake pipeline #529

Scalability issues: Timeout error for large data set snakemake pipeline #529

jolvhull commented Jan 6, 2025

Scalability issues: Timeout error for large data set snakemake pipeline #529

Scalability issues: Timeout error for large data set snakemake pipeline #529

Comments

jolvhull commented Jan 6, 2025