-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1Tb memory is insufficient for normalize_scores #179
Comments
What worked for me was reducing the cores and demanding more vmem to get this done. |
Thank you for your suggestion. I successfully saved imputed matrix. But still running normalize-score on loaded imputed-matrix is not working. Thank you |
I ran this on sun grid engine which has an option of allocating virtual memory (-l h_vmem). |
Were you able to solve this? If so, how? |
I downsampled the adata, and using its barcode I further subset the cistopic object. that's how i generated normalized impute data and generate celltype DAR. |
Good news for you all. I finally had some time to rewrite the old code and it is a lot more efficient and would be able to run even on a modest computer. Maximum memory usage in GiB is determined now by `(chunk_size * n_cells * 4) / 1024^3), no matter how many regions you have. Getting highly variable regions only requires For example for a very big dataset we have (1292285 regions x 1483207 cells), just one copy of the imputed accessibility matrix would require With the new code and
With the new code and For benchmarking only 40000 regions were used, but more regions would scale linearly (just would result in more chunks to process). # Memory usage of full code (only 2 chunk iterations for testing (40000 regions))
40000 regions and 1483207 cells
(randomly generated, no pycistopic object loaded).
region_topic = np.random.rand(40000*80).astype(np.float32).reshape(40000, 80)
cell_topic = np.random.rand(80*1483207).astype(np.float32).reshape(80, 1483207)
# Running with only 20000 regions in one chunk ==> maximum 111G (= chunk_size * n_cells *4 / 1024^3):
In [4]: %%time
...: (
...: impute_acc_per_region_mean,
...: impute_acc_per_region_dispersion,
...: region_names_to_keep,
...: region_idx_to_keep,
...: ) = impute_acc_normalized_stats_trial8(
...: region_topic=region_topic,
...: cell_topic=cell_topic,
...: region_names=[f"reg{i}" for i in range(1,40001)],
...: scale_factor1 = 10**6,
...: scale_factor2 = 10**4,
...: chunk_size=20000,
...: )
...:
...:
region_topic shape (40000, 80)
2024-11-06 12:02:28,580 cisTopic INFO Calculate total imputed accessibility per cell.
2024-11-06 12:02:28,580 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 0-20000 (out of 40000).
2024-11-06 12:02:28,580 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:02:39,095 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:02:49,811 cisTopic INFO - Only keep integer part.
2024-11-06 12:03:00,834 cisTopic INFO - Get non-zero regions.
2024-11-06 12:03:00,893 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 12:03:36,336 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 20000-40000 (out of 40000).
2024-11-06 12:03:36,336 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:03:46,828 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:03:57,408 cisTopic INFO - Only keep integer part.
2024-11-06 12:04:08,164 cisTopic INFO - Get non-zero regions.
2024-11-06 12:04:08,229 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 12:04:43,070 cisTopic INFO Keeping 40000 of 40000 (non_zero) regions.
region_idx_to_keep shape before removing non_zeros (40000,)
region_idx_to_keep shape after removing non_zeros (40000,)
region_topic shape after removing non_zeros (40000, 80)
2024-11-06 12:04:43,076 cisTopic INFO Scale total imputed accessibility per cell by dividing by 10000.
2024-11-06 12:04:43,080 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility per region.
2024-11-06 12:04:43,080 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 0-20000 (out of 40000).
2024-11-06 12:04:43,080 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:04:53,547 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:05:04,296 cisTopic INFO - Only keep integer part.
2024-11-06 12:05:15,042 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 12:05:30,262 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 12:05:57,162 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 12:06:06,830 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 20000-40000 (out of 40000).
2024-11-06 12:06:06,830 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 12:06:17,222 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 12:06:27,923 cisTopic INFO - Only keep integer part.
2024-11-06 12:06:38,633 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 12:06:53,764 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 12:07:20,275 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 12:07:29,441 cisTopic INFO Finished Calculating mean and dispersion of imputed accessibility per region.
CPU times: user 11min 47s, sys: 2min 22s, total: 14min 10s
Wall time: 5min
# Running with only 1000 regions in one chunk ==> maximum 5.5G (= chunk_size * n_cells *4 / 1024^3):
In [4]: %%time
...: (
...: impute_acc_per_region_mean,
...: impute_acc_per_region_dispersion,
...: region_names_to_keep,
...: region_idx_to_keep,
...: ) = impute_acc_normalized_stats_trial8(
...: region_topic=region_topic,
...: cell_topic=cell_topic,
...: region_names=[f"reg{i}" for i in range(1,40001)],
...: scale_factor1 = 10**6,
...: scale_factor2 = 10**4,
...: chunk_size=1000,
...: )
...:
...:
...
CPU times: user 12min 22s, sys: 3min 4s, total: 15min 26s
Wall time: 4min 54s
# Running with only 500 regions in one chunk ==> maximum 2.76G (= chunk_size * n_cells *4 / 1024^3):
In [4]: %%time
...: (
...: impute_acc_per_region_mean,
...: impute_acc_per_region_dispersion,
...: region_names_to_keep,
...: region_idx_to_keep,
...: ) = impute_acc_normalized_stats_trial8(
...: region_topic=region_topic,
...: cell_topic=cell_topic,
...: region_names=[f"reg{i}" for i in range(1,40001)],
...: scale_factor1 = 10**6,
...: scale_factor2 = 10**4,
...: chunk_size=500,
...: )
...:
...:
...
2024-11-06 14:14:41,916 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 39000-39500 (out of 40000).
2024-11-06 14:14:41,916 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:14:42,228 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:14:42,501 cisTopic INFO - Only keep integer part.
2024-11-06 14:14:42,770 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:14:43,155 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:14:43,710 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:14:43,941 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 39500-40000 (out of 40000).
2024-11-06 14:14:43,941 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:14:44,253 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:14:44,527 cisTopic INFO - Only keep integer part.
2024-11-06 14:14:44,797 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:14:45,183 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:14:45,738 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:14:45,970 cisTopic INFO Finished Calculating mean and dispersion of imputed accessibility per region.
CPU times: user 13min 24s, sys: 3min 58s, total: 17min 22s
Wall time: 5min 1s |
You can find it now in the |
@JinKyu-Cheong @yojetsharma Running See: #195 (comment) for instructions. |
Hi
I tried to run snakemake pipeline without having celltype DAR bed files, but it didn't work. In order to generate celltype DAR, I needed to run impute_accessibility, and run normalize_scores on imputed_acc_obj. However the normalize_scores is keep failing due to insufficient memory. I tried with maximum 1050gb but it is still not working.
My data has 271624 cells, and 504783 regions. Is there a way to run this function or skip this step and still perform celltype DAR?
Thank you!
The text was updated successfully, but these errors were encountered: