Incorrect preparation of non-multiome data causes error in the step 'Calculating region to gene importance, using GMB method' #371

AthanasiaSt · 2024-04-25T16:46:38Z

AthanasiaSt
Apr 25, 2024

Hello,

Thank you for the development of scenicplus and it's helpful documentation.

I am trying to use scenicplus on non-multiome data that I have previously integrated by utilizing ArchR. By following the corresponding tutorials of scRNA and scATAC preprocessing I have reached the point of running scenicplus through the snakefile by altering accordingly the config.yaml file. More importantly, I made sure that the anndata and cistopic objects both contained a variable under the name 'ACC:RNA_barcodes' with the same cell_names based on the integration of the two modalities (in total 5956 cells).

The pipeline is progressing smoothly until it reaches the point of calculating the region to gene importance, when it gives out the following error:

2024-04-25 16:01:14,238 SCENIC+      INFO     Reading search space
2024-04-25 16:01:14,741 R2G          INFO     Calculating region to gene importances, using GBM method
Running using 20 cores:   1%|▌                                          | 180/12438 [00:06<04:14, 48.10it/s]Traceback (most recent call last):
  File "/home/astavropoulou/anaconda3/envs/scenicplus/bin/scenicplus", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 1137, in main
    args.func(args)
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 328, in TF_to_gene
    infer_region_to_gene(
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/cli/commands.py", line 501, in infer_region_to_gene
    adj = calculate_regions_to_genes_relationships(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/enhancer_to_gene.py", line 261, in calculate_regions_to_genes_relationships
    region_to_gene_importances = _score_regions_to_genes(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/scenicplus/enhancer_to_gene.py", line 219, in _score_regions_to_genes
    joblib.Parallel(
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
    return future.result(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/astavropoulou/anaconda3/envs/scenicplus/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
ValueError: Found array with 0 sample(s) (shape=(0, 121)) while a minimum of 1 is required by GradientBoostingRegressor.

By trying to figure out what went wrong, I realized that the resulting ACC_GEX.h5mu file that should contain the two modalities is not prepared correctly as it seems to lack both the cell names and the expression/fragment matrices, as show here:


MuData object with n_obs × n_vars = 0 × 316886 backed at '/home/astavropoulou/scenicplus_final/scplus_pipeline/Snakemake/ACC_GEX.h5mu'
  2 modalities
    scRNA:	0 x 16049
      obs:	'ACC:RNA_barcodes'
      var:	'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    scATAC:	0 x 300837
      obs:	'ACC:RNA_barcodes'
      var:	'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc'

No error occurred during the step of preparing the non-multiome data, as all cells were found in both modalities, but there is a suspicious output during the procedure of ingestion as shown here:

[Thu Apr 25 15:56:12 2024]
Finished job 8.
2 of 14 steps (14%) done
Select jobs to execute...
2024-04-25 15:56:17,072 Ingesting non-multiome data INFO     Automatically set `nr_metacells` to: AKP_APC_AOMDSS_cnt_aom_AAACCCAGTCCACGCA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACCCATCAAGCTTG-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAACACACCAGC-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAACATGACTGT-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAAGTCCCACGA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAAGTGACGCCT-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGAATCGTACCTC-1: 0, AKP_APC_AOMDSS_cnt_aom_AAACGCTTCAAGTGGG-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGATAGCCTGCCA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGATAGTACTCGT-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGATAGTTAACGA-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGGCAGCTGGCTC-1: 0, AKP_APC_AOMDSS_cnt_aom_AAAGGGCAGGGAGAAT-1: 0....

Do the zeros next to the cell names mean that no metacells and no pseudo multi-ome data are created?

I tried to run the pipeline with slight modifications in the anndata and cistopic objects but could not figure out the problem.

Do you maybe have any idea on why this problem comes up?
Also could you specify in greater detail the procedure of preparing non-multiome data for scenicplus? Should the anndata and the cistopic objects have the exact same cell_names in the corresponding matrices or a single variable/column that is common between the two modalities is enough for the two to be finally combined?

Thank you!

I am using:
Python version: Python 3.11.8
scenicplus version: 1.0a1

Answered by SeppeDeWinter

Apr 29, 2024

Hi @AthanasiaSt

In the case of non-multiome data the cell barcodes between RNA and ATAC are not matching. For that reason, we integrate the samples based on common cell type (or state) labels. This label is the variable you should provide in the yaml file (where you provided ACC:RNA_barcodes). Based on these labels multiome data will be simulated by sampling cells from each label and for each modality.

In case you did the integration elsewhere and you have one-to-one cell barcode matches, as I believe is the case for you (?), you can run the analysis as if the data is multiome. However, in this case the cell barcodes of both modalities should have exactly the same names.

I hope this helps?

…

View full answer

SeppeDeWinter · 2024-04-29T16:11:27Z

SeppeDeWinter
Apr 29, 2024
Maintainer

Hi @AthanasiaSt

In the case of non-multiome data the cell barcodes between RNA and ATAC are not matching. For that reason, we integrate the samples based on common cell type (or state) labels. This label is the variable you should provide in the yaml file (where you provided ACC:RNA_barcodes). Based on these labels multiome data will be simulated by sampling cells from each label and for each modality.

In case you did the integration elsewhere and you have one-to-one cell barcode matches, as I believe is the case for you (?), you can run the analysis as if the data is multiome. However, in this case the cell barcodes of both modalities should have exactly the same names.

I hope this helps?

All the best,

Seppe

2 replies

AthanasiaSt Apr 30, 2024
Author

Hi @SeppeDeWinter

Thank you very much for your help!

Indeed , I have one-to-one cell matches between the two modalities, besides the cell type labels from the RNA to the ATAC, so I got a bit confused on which 'matching labels' to use.

The pipeline finished successfully by using the cell type labels with the non-multiome mode!

Since in the dataset that I am using there are mixed conditions (e.g normal vs tumor) , would you recommend to create a variable that characterizes both the celltype and the corresponding condition? otherwise cells that correspond to different conditions will probably be matched. I guess this is a naive question but I thought I should ask either way.

Best,
Athanasia

SeppeDeWinter May 2, 2024
Maintainer

Hi @AthanasiaSt

You are welcome!

Since in the dataset that I am using there are mixed conditions (e.g normal vs tumor) , would you recommend to create a variable that characterizes both the celltype and the corresponding condition? otherwise cells that correspond to different conditions will probably be matched.

Yes, that sounds like a good idea. I would do this indeed.

Best,

Seppe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect preparation of non-multiome data causes error in the step 'Calculating region to gene importance, using GMB method' #371

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Incorrect preparation of non-multiome data causes error in the step 'Calculating region to gene importance, using GMB method' #371

AthanasiaSt Apr 25, 2024

Replies: 1 comment · 2 replies

SeppeDeWinter Apr 29, 2024 Maintainer

AthanasiaSt Apr 30, 2024 Author

SeppeDeWinter May 2, 2024 Maintainer

AthanasiaSt
Apr 25, 2024

Replies: 1 comment 2 replies

SeppeDeWinter
Apr 29, 2024
Maintainer

AthanasiaSt Apr 30, 2024
Author

SeppeDeWinter May 2, 2024
Maintainer