Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pycisTopic find_diff_features function stopping abruptly at Starting a local Ray instance #195

Open
ptk1601 opened this issue Dec 3, 2024 · 4 comments

Comments

@ptk1601
Copy link

ptk1601 commented Dec 3, 2024

Hello!

I am having issues with the find_diff_features function in the pycisTopic library as a part of the SCENIC+ workflow. I have not had any issues generating the previous, required dataframes in the workflow according to the live lecture aside from perhaps a few NaNs in the Seurat_cell_type column of my cistopic_obj. But for some reason when I run find_diff_features, Jupyter Notebook tells me that a local Ray instance has started but then cuts off right there without any additional processes, and I am left with an empty markers_dict. Is there something I can check to fix this? Here is the code I am running.

import ray

markers_dict= find_diff_features(
cistopic_obj,
imputed_acc_obj,
variable='Seurat_cell_type',
var_features=variable_regions,
contrasts=None,
adjpval_thr=0.05,
log2fc_thr=np.log2(1.5),
n_cpu=5,
_temp_dir='/tmp',
split_pattern = None
)

@SeppeDeWinter
Copy link
Collaborator

Hi @ptk1601

Can you check if you are running out of memory perhaps?
@ghuls has also been updating this code. It should run much smoother with his new version.

Best,

Seppe

@ptk1601
Copy link
Author

ptk1601 commented Dec 5, 2024

Hi! I actually managed to figure this one out, turns out there was a discrepancy between barcode names that was causing this issue!

@JulieDeMan
Copy link

Hi @ptk1601,

Here's an example on finding variable features with the updated code from @ghuls. This code is faster, uses less memory and handles the imputed accessibility differently. The code for identifying differentially accessible regions (DARs) between groups is not compatible yet with this new implementation, but will be updated soon.

For now, it means that for DAR calculation, the original impute_accessibility function is still required. This function should only use the calculated variable regions, which already saves quite some memory and time compared to using the full region set.

EXAMPLE CODE

Requirements: https://github.com/aertslab/pycisTopic/tree/polars_1xx

## load object and initialise variables
from pycisTopic.cistopic_class import *
from pycisTopic.diff_features import *
import pickle

infile = open('cistopic_object.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()

topic_region_ = cistopic_obj.selected_model.topic_region.to_numpy(dtype=np.float32)
cell_topic_ = cistopic_obj.selected_model.cell_topic.to_numpy(dtype=np.float32)
regions = cistopic_obj.selected_model.topic_region.index.tolist()

## Get mean and dispersion of normalized imputed accessibility per region.
(region_names_to_keep,
per_region_means_on_normalized_imputed_acc,
per_region_dispersions_on_normalized_imputed_acc,) = 
calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc(
          region_topic=topic_region_,
          cell_topic=cell_topic_,
          region_names=regions,
          scale_factor1 = 10**6,
          scale_factor2 = 10**4,
          regions_chunk_size=20000,)

## Optional: save output
import numpy as np
import pandas as pd
np.savez_compressed("impute_acc_per_region_mean.npz", arr=impute_acc_per_region_mean)
np.savez_compressed("impute_acc_per_region_dispersion.npz", arr=impute_acc_per_region_dispersion)
np.savez_compressed(" region_idx_to_keep.npz", arr=region_idx_to_keep)
pd.DataFrame(region_names_to_keep, columns=["regions"]).to_csv("region_names_to_keep.csv", index=False)
region_topic shape (1292285, 80)
2024-11-06 13:38:12,600 cisTopic     INFO     Calculate total imputed accessibility per cell.
2024-11-06 13:38:12,600 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 0-20000 (out of 1292285).
2024-11-06 13:38:12,601 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:38:21,890 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:38:33,077 cisTopic     INFO       - Only keep integer part.
2024-11-06 13:38:44,036 cisTopic     INFO       - Get non-zero regions.
2024-11-06 13:38:44,242 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:39:19,566 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 20000-40000 (out of 1292285).
2024-11-06 13:39:19,567 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:39:28,928 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:39:40,103 cisTopic     INFO       - Only keep integer part.
2024-11-06 13:39:51,093 cisTopic     INFO       - Get non-zero regions.
2024-11-06 13:39:51,146 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:40:25,185 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 40000-60000 (out of 1292285).
2024-11-06 13:40:25,185 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:40:34,535 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:40:45,140 cisTopic     INFO       - Only keep integer part.
2024-11-06 13:40:57,225 cisTopic     INFO       - Get non-zero regions.
2024-11-06 13:40:58,244 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:41:32,441 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 60000-80000 (out of 1292285).
2024-11-06 13:41:32,441 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:41:42,069 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:41:53,982 cisTopic     INFO       - Only keep integer part.
2024-11-06 13:42:07,238 cisTopic     INFO       - Get non-zero regions.
2024-11-06 13:42:07,416 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:42:43,969 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 80000-100000 (out of 1292285).
2024-11-06 13:42:43,970 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:42:53,047 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:43:03,882 cisTopic     INFO       - Only keep integer part.
2024-11-06 13:43:17,189 cisTopic     INFO       - Get non-zero regions.
2024-11-06 13:43:17,828 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:43:56,578 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 100000-120000 (out of 1292285).
2024-11-06 13:43:56,579 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:44:05,636 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:44:16,481 cisTopic     INFO       - Only keep integer part.
2024-11-06 13:44:27,535 cisTopic     INFO       - Get non-zero regions.
2024-11-06 13:44:27,898 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
...
2024-11-06 14:49:23,279 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 1260000-1280000 (out of 1292285).
2024-11-06 14:49:23,280 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:49:31,821 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:49:42,047 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:49:53,865 cisTopic     INFO       - Get non-zero regions.
2024-11-06 14:49:54,500 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 14:50:31,988 cisTopic     INFO     Calculate (partial) imputed accessibility per cell for regions 1280000-1300000 (out of 1292285).
2024-11-06 14:50:31,989 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:50:37,342 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:50:44,149 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:50:52,495 cisTopic     INFO       - Get non-zero regions.
2024-11-06 14:50:53,017 cisTopic     INFO       - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 14:51:13,964 cisTopic     INFO     Keeping 1290880 of 1292285 (non_zero) regions.
region_idx_to_keep shape before removing non_zeros (1292285,)
region_idx_to_keep shape after removing non_zeros (1290880,)
region_topic shape after removing non_zeros (1290880, 80)
2024-11-06 14:51:14,135 cisTopic     INFO     Scale total imputed accessibility per cell by dividing by 10000.
2024-11-06 14:51:14,140 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility per region.
2024-11-06 14:51:14,140 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 0-20000 (out of 1290880).
2024-11-06 14:51:14,141 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:51:22,786 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:51:33,128 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:51:44,702 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:52:00,313 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:52:17,304 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:52:26,246 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 20000-40000 (out of 1290880).
2024-11-06 14:52:26,246 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:52:34,903 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:52:45,269 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:52:56,430 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:53:12,427 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:53:31,357 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:53:39,282 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 40000-60000 (out of 1290880).
2024-11-06 14:53:39,283 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:53:47,914 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:53:58,282 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:54:09,453 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:54:25,709 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:54:44,669 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:54:52,616 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 60000-80000 (out of 1290880).
2024-11-06 14:54:52,618 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:55:01,233 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:55:12,710 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:55:23,373 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:55:38,710 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:55:56,254 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:56:03,917 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 80000-100000 (out of 1290880).
2024-11-06 14:56:03,918 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:56:13,061 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:56:24,124 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:56:38,653 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:56:56,205 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:57:12,180 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:57:20,304 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 100000-120000 (out of 1290880).
2024-11-06 14:57:20,307 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:57:29,095 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:57:40,723 cisTopic     INFO       - Only keep integer part.
2024-11-06 14:57:52,244 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:58:06,778 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:58:24,206 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
..
2024-11-06 16:13:07,266 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 1260000-1280000 (out of 1290880).
2024-11-06 16:13:07,268 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 16:13:16,119 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 16:13:26,693 cisTopic     INFO       - Only keep integer part.
2024-11-06 16:13:40,386 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 16:14:02,385 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 16:14:19,737 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 16:14:27,796 cisTopic     INFO     Calculate mean and dispersion of normalized imputed accessibility for regions 1280000-1300000 (out of 1290880).
2024-11-06 16:14:27,796 cisTopic     INFO       - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 16:14:32,622 cisTopic     INFO       - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 16:14:38,472 cisTopic     INFO       - Only keep integer part.
2024-11-06 16:14:44,173 cisTopic     INFO       - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 16:14:53,695 cisTopic     INFO       - Add pseudocount of 1 and apply log normalization.
2024-11-06 16:15:07,633 cisTopic     INFO       - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 16:15:11,838 cisTopic     INFO     Finished Calculating  mean and dispersion of imputed accessibility per region.
CPU times: user 5h 13min 20s, sys: 1h 27min, total: 6h 40min 20s
Wall time: 2h 36min 59s
## Find highly variable features.
var_features = find_highly_variable_features(
          features=region_names_to_keep,
          per_region_means_on_normalized_imputed_acc=per_region_means_on_normalized_imputed_acc,
          per_region_dispersions_on_normalized_imputed_acc=per_region_dispersions_on_normalized_imputed_acc,
          min_disp = 0.05,
          min_mean = 0.0125,
          max_disp = float("inf"),
          max_mean = 3,
          n_bins = 20,
          n_top_features = None,
          plot = True,)

## Optional: save regions
pd.DataFrame(var_features, columns=["regions"]).to_csv("var_features.csv", index=False)
2024-11-06 16:29:23,756 cisTopic     INFO     Done!

Calculate DARs (old code still, WIP):

## read data
infile = open('cistopic_object.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()

var_features_path = "var_features.csv"
var_features_df = pd.read_csv(var_features_path)
var_features = var_features_df['regions'].tolist()

## use old impute_accessibility function with variable features only
## note that this step may still require quite some resources
## this is only an intermediate solution and will be updated soon 
imputed_acc_obj = impute_accessibility(
          cistopic_obj,
          selected_cells=None,
          selected_regions=var_features,
          scale_factor=10**6
)

## calculate DARs
markers_dict= find_diff_features(
          cistopic_obj,
          imputed_acc_obj,
          variable='cell_type',
          var_features=var_features,
          contrasts = None,
          adjpval_thr=0.05,
          log2fc_thr=np.log2(1.5),
          n_cpu=1,
          _temp_dir='tmp_dir'
)

## Optional: save DARs
with open('DARs.pkl', 'wb') as f:
          pickle.dump(markers_dict, f)

@genomic-coder
Copy link

Hi,
I am new to this, How do i install https://github.com/aertslab/pycisTopic/tree/polars_1xx?

I understand its a branch of pycistopic, but how do i access this branch when i have the latest main branch of scenic+ installed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants