find_highly_variable_features [PERFORMANCE] #156

maithermbarros · 2024-08-21T19:31:12Z

Hello. Thanks for developing SCENIC+, it is super dope and it is giving me nice results so far.

What type of problem are you experiencing and which function is you problem related too
While preparing multiome datasets to run SCENIC+, when running find_highly_variable_features my python process gets killed. I am running a python script directly in my workstation which has good memory:

Is this problem data set related? If so, provide information on the problematic data set
It works without issues on another dataset of mine (~26k cells, ~ 480k regions) except in this larger dataset with cells ~45k cells and ~540k regions

Describe alternatives you've considered
I tried running this step within Rstudio through reticulate() and in a python jupyter notebook too but it also gets killed because of memory.

Additional context
I am running all of this in a conda environment where I installed scenicplus, pycistopic and pycistarget.
I tried running this step of the pipeline using a python script:

import scenicplus
import pycisTopic 
import scanpy as sc
import pandas as pd
import os
import pickle
import numpy as np

from pycisTopic.diff_features import (
  impute_accessibility,
  normalize_scores,
  find_highly_variable_features,
  find_diff_features)

import matplotlib.pyplot as plt

# Directories and working paths
projDir = "/mnt/data/SCENICplus_P1-B1-B2P1-B2P3R2/"
outDir = "/mnt/data/SCENICplus_P1-B1-B2P1-B2P3R2/output"
work_dir = "/mnt/data/SCENICplus_P1-B1-B2P1-B2P3R2/"
tmpDir = "/mnt/scratch/SCENICplus_temp/"

# Load the imputed accessibility object
with open(os.path.join(outDir, 'DARs', 'imputed_acc_obj.pkl'), 'rb') as infile:
    imputed_acc_obj = pickle.load(infile)

# Normalize the imputed data
normalized_imputed_acc_obj = normalize_scores(imputed_acc_obj, scale_factor=10**4)

# Find highly variable features without plotting to save memory
variable_regions = find_highly_variable_features(normalized_imputed_acc_obj,
                                                 min_disp=0.05,
                                                 min_mean=0.0125,
                                                 max_mean=3,
                                                 max_disp=np.inf,
                                                 n_bins=20,
                                                 n_top_features=None,
                                                 plot=True,
                                           	 save= outDir + '/DARs/HVR_plot.pdf')

# Save the results
with open(os.path.join(outDir, "DARs", "variable_regions.pkl"), "wb") as outfile:
    pickle.dump(variable_regions, outfile)

Then it gets killed:

I also tried running normalize_scores first, save the output as a pkl file to then run find_highly_variable_features but it doesn't work either.

Version information
Report versions of modules relevant to this error

Any help/insight would be greatly appreciated as I really need to finish preparing this file to then run SCENIC+. Thank you!

The text was updated successfully, but these errors were encountered:

ghuls · 2024-08-22T13:25:29Z

There is some ongoing work to improve memory usage for this code and some other memory intensive functions in pycisTopic that will eventually appear in the polars_1xx branch of pycisTopic.
https://github.com/aertslab/pycisTopic/tree/polars_1xx

maithermbarros · 2024-08-23T15:18:24Z

Thanks for getting back to me. Do you have any workarounds for the meantime? I would really like to be able to run SCENIC+ on this dataset and to do so I need to run this step. I tried using multiprocessing in python but it still doesn't work.

ghuls · 2024-08-29T13:06:27Z

Thanks for getting back to me. Do you have any workarounds for the meantime? I would really like to be able to run SCENIC+ on this dataset and to do so I need to run this step. I tried using multiprocessing in python but it still doesn't work.

No workarounds for now, but likely in a few weeks. Topic modeling with Mallet got some speedup and reduced memory usage. diff_features code will be next.

eliascrapa · 2024-09-15T17:02:57Z

Hi,
just wanted to ask if there are any updates. I used 780GB of memory on the HPC but still not able to get normalize_scores running.
" Unable to allocate 710. GiB for an array with shape (680104, 140085) and data type float64."

ghuls · 2024-09-19T12:29:07Z

Not yet. Last weeks other projects had higher priority.

yojetsharma · 2024-10-11T11:29:12Z

Would reading the data in chunks help for now? Or downsampling the number of cells before running this step?

ghuls · 2024-10-14T12:49:34Z

Would reading the data in chunks help for now? Or downsampling the number of cells before running this step?

Downsampling the number of cells would help.

yojetsharma · 2024-10-14T12:52:58Z

I managed it as a job the HPC though without having to downsample :) thank you for your response!

ghuls · 2024-11-06T16:21:19Z

I finally had some time to work on it. Now it would be possible to theoretically even run it on a laptop:
#179 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find_highly_variable_features [PERFORMANCE] #156

find_highly_variable_features [PERFORMANCE] #156

maithermbarros commented Aug 21, 2024

ghuls commented Aug 22, 2024

maithermbarros commented Aug 23, 2024

ghuls commented Aug 29, 2024

eliascrapa commented Sep 15, 2024

ghuls commented Sep 19, 2024

yojetsharma commented Oct 11, 2024

ghuls commented Oct 14, 2024

yojetsharma commented Oct 14, 2024

ghuls commented Nov 6, 2024

find_highly_variable_features [PERFORMANCE] #156

find_highly_variable_features [PERFORMANCE] #156

Comments

maithermbarros commented Aug 21, 2024

ghuls commented Aug 22, 2024

maithermbarros commented Aug 23, 2024

ghuls commented Aug 29, 2024

eliascrapa commented Sep 15, 2024

ghuls commented Sep 19, 2024

yojetsharma commented Oct 11, 2024

ghuls commented Oct 14, 2024

yojetsharma commented Oct 14, 2024

ghuls commented Nov 6, 2024