Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1Tb memory is insufficient for normalize_scores #179

Open
JinKyu-Cheong opened this issue Oct 19, 2024 · 8 comments
Open

1Tb memory is insufficient for normalize_scores #179

JinKyu-Cheong opened this issue Oct 19, 2024 · 8 comments

Comments

@JinKyu-Cheong
Copy link

Hi

I tried to run snakemake pipeline without having celltype DAR bed files, but it didn't work. In order to generate celltype DAR, I needed to run impute_accessibility, and run normalize_scores on imputed_acc_obj. However the normalize_scores is keep failing due to insufficient memory. I tried with maximum 1050gb but it is still not working.
My data has 271624 cells, and 504783 regions. Is there a way to run this function or skip this step and still perform celltype DAR?

Thank you!

@yojetsharma
Copy link

What worked for me was reducing the cores and demanding more vmem to get this done.
Or you could dump the imputed pkl obj and load it for the normalising step. this is a memory intensive step.

@JinKyu-Cheong
Copy link
Author

What worked for me was reducing the cores and demanding more vmem to get this done. Or you could dump the imputed pkl obj and load it for the normalising step. this is a memory intensive step.

Thank you for your suggestion.
I’ll try using fewer cores. I’ve been asking for 15gb for each core, and requested for 72 cores. I wonder how I could use vmem?

I successfully saved imputed matrix. But still running normalize-score on loaded imputed-matrix is not working.

Thank you

@yojetsharma
Copy link

I ran this on sun grid engine which has an option of allocating virtual memory (-l h_vmem).
you could try converting the imputed_obj.mtx=csr.matrix(imputed_obj.mtx)
Saving this as a pickle obj and then loading it again for normalising step.

@yojetsharma
Copy link

Were you able to solve this? If so, how?

@JinKyu-Cheong
Copy link
Author

Were you able to solve this? If so, how?

I downsampled the adata, and using its barcode I further subset the cistopic object. that's how i generated normalized impute data and generate celltype DAR.

@ghuls
Copy link
Member

ghuls commented Nov 6, 2024

Good news for you all.

I finally had some time to rewrite the old code and it is a lot more efficient and would be able to run even on a modest computer.

Maximum memory usage in GiB is determined now by `(chunk_size * n_cells * 4) / 1024^3), no matter how many regions you have.

Getting highly variable regions only requires region_topic (as numpy float32 array), cell_topic (as numpy float32 array) and region names from the pycisTopic object. Imputed accessibility and normalized imputed accessibility is now calculated internally in chunks.

For example for a very big dataset we have (1292285 regions x 1483207 cells), just one copy of the imputed accessibility matrix would require 7.140 TiB of memory (and the old code would need this more than once).

With the new code and chunk_size=20000 we only needed 111GiB:

CPU times: user 5h 13min 20s, sys: 1h 27min, total: 6h 40min 20s
Wall time: 2h 36min 59s

With the new code and chunk_size=1000 we only needed 5.5GiB.
With the new code and chunk_size=500 we only needed 2.7GiB.

For benchmarking only 40000 regions were used, but more regions would scale linearly (just would result in more chunks to process).

# Memory usage of full code (only 2 chunk iterations for testing (40000 regions))
40000 regions and 1483207 cells
(randomly generated, no pycistopic object loaded).

region_topic = np.random.rand(40000*80).astype(np.float32).reshape(40000, 80)
cell_topic = np.random.rand(80*1483207).astype(np.float32).reshape(80, 1483207)

# Running with only 20000 regions in one chunk ==> maximum 111G (= chunk_size * n_cells *4 / 1024^3):
In [4]: %%time
   ...: (
   ...:         impute_acc_per_region_mean,
   ...:         impute_acc_per_region_dispersion,
   ...:         region_names_to_keep,
   ...:         region_idx_to_keep,
   ...: ) = impute_acc_normalized_stats_trial8(
   ...:     region_topic=region_topic,
   ...:     cell_topic=cell_topic,
   ...:     region_names=[f"reg{i}" for i in range(1,40001)],
   ...:     scale_factor1 = 10**6,
   ...:     scale_factor2 = 10**4,
   ...:     chunk_size=20000,
   ...: )
   ...: 
   ...: 
region_topic shape (40000, 80)
2024-11-06 12:02:28,580 cisTopic     INFO     Calculate total imputed accessibility per cell.
2024-11-06 12:02:28,580 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 0-20000 (out of 40000).
2024-11-06 12:02:28,580 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:02:39,095 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:02:49,811 cisTopic     INFO       - Only keep integer part.
2024-11-06 12:03:00,834 cisTopic     INFO       - Get non-zero regions.
2024-11-06 12:03:00,893 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 12:03:36,336 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 20000-40000 (out of 40000).
2024-11-06 12:03:36,336 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:03:46,828 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:03:57,408 cisTopic     INFO       - Only keep integer part.
2024-11-06 12:04:08,164 cisTopic     INFO       - Get non-zero regions.
2024-11-06 12:04:08,229 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 12:04:43,070 cisTopic     INFO     Keeping 40000 of 40000 (non_zero) regions.
region_idx_to_keep shape before removing non_zeros (40000,)
region_idx_to_keep shape after removing non_zeros (40000,)
region_topic shape after removing non_zeros (40000, 80)
2024-11-06 12:04:43,076 cisTopic     INFO     Scale total imputed accessibility per cell by dividing by 10000.
2024-11-06 12:04:43,080 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility per region.
2024-11-06 12:04:43,080 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 0-20000 (out of 40000).
2024-11-06 12:04:43,080 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:04:53,547 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:05:04,296 cisTopic     INFO       - Only keep integer part.
2024-11-06 12:05:15,042 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 12:05:30,262 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 12:05:57,162 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 12:06:06,830 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 20000-40000 (out of 40000).
2024-11-06 12:06:06,830 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:06:17,222 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:06:27,923 cisTopic     INFO       - Only keep integer part.
2024-11-06 12:06:38,633 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 12:06:53,764 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 12:07:20,275 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 12:07:29,441 cisTopic     INFO     Finished Calculating  mean and dispersion of imputed accessibility per region.
CPU times: user 11min 47s, sys: 2min 22s, total: 14min 10s
Wall time: 5min

# Running with only 1000 regions in one chunk ==> maximum 5.5G (= chunk_size * n_cells *4 / 1024^3):
In [4]: %%time
   ...: (
   ...:         impute_acc_per_region_mean,
   ...:         impute_acc_per_region_dispersion,
   ...:         region_names_to_keep,
   ...:         region_idx_to_keep,
   ...: ) = impute_acc_normalized_stats_trial8(
   ...:     region_topic=region_topic,
   ...:     cell_topic=cell_topic,
   ...:     region_names=[f"reg{i}" for i in range(1,40001)],
   ...:     scale_factor1 = 10**6,
   ...:     scale_factor2 = 10**4,
   ...:     chunk_size=1000,
   ...: )
   ...: 
   ...: 
...
CPU times: user 12min 22s, sys: 3min 4s, total: 15min 26s
Wall time: 4min 54s


# Running with only 500 regions in one chunk ==> maximum 2.76G (= chunk_size * n_cells *4 / 1024^3):
In [4]: %%time
   ...: (
   ...:         impute_acc_per_region_mean,
   ...:         impute_acc_per_region_dispersion,
   ...:         region_names_to_keep,
   ...:         region_idx_to_keep,
   ...: ) = impute_acc_normalized_stats_trial8(
   ...:     region_topic=region_topic,
   ...:     cell_topic=cell_topic,
   ...:     region_names=[f"reg{i}" for i in range(1,40001)],
   ...:     scale_factor1 = 10**6,
   ...:     scale_factor2 = 10**4,
   ...:     chunk_size=500,
   ...: )
   ...: 
   ...: 
...
2024-11-06 14:14:41,916 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 39000-39500 (out of 40000).
2024-11-06 14:14:41,916 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:14:42,228 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:14:42,501 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:14:42,770 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:14:43,155 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:14:43,710 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:14:43,941 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 39500-40000 (out of 40000).
2024-11-06 14:14:43,941 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:14:44,253 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:14:44,527 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:14:44,797 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:14:45,183 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:14:45,738 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:14:45,970 cisTopic     INFO     Finished Calculating  mean and dispersion of imputed accessibility per region.
CPU times: user 13min 24s, sys: 3min 58s, total: 17min 22s
Wall time: 5min 1s

@ghuls
Copy link
Member

ghuls commented Nov 7, 2024

You can find it now in the polars_1xx branch:
2fa4464#diff-b1472e8f1bcc609e65e2db670ba21144389567776cf0a1089ecf4e0c5749fa22R1024-R1050

@ghuls
Copy link
Member

ghuls commented Dec 12, 2024

@JinKyu-Cheong @yojetsharma Running impute_accessibility and normalize_scores is now not necessary anymore.

See: #195 (comment) for instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants