Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adata.raw overwritten with normalized counts in SCENIC+ pipeline #484

Open
cjiang310437 opened this issue Oct 17, 2024 · 1 comment
Open

Comments

@cjiang310437
Copy link

Hello,

First of all, thank you for developing such a powerful and intuitive tool.

Describe the bug
According to the documentation, the RNA count input for the SCENIC+ pipeline should consist of raw counts, which are expected to be stored in the adata.raw slot. However, after following the tutorial's steps, I observed that the adata.raw slot seems to be overwritten with normalized counts instead of retaining the raw counts.

Here are the key details:

  • I confirmed that my raw count matrix was correctly loaded into the AnnData object initially.
  • After running the normalization steps as described in the tutorial, I noticed that adata.raw now contains the normalized data, not the raw counts.
  • This appears to contradict the documentation, which specifies that the adata.raw slot should contain raw counts and that these should be used as input for the SCENIC+ pipeline.

Additionally, I tested running the pipeline using both raw and normalized RNA counts, and the results were significantly different. The results generated using normalized counts seem more promising. Could you kindly clarify which input (raw or normalized counts) is appropriate for running the SCENIC+ pipeline? It would also be helpful to understand why one input should be preferred over the other and how this impacts the pipeline results.

I appreciate your guidance and look forward to your response. Thank you again for your continued efforts in developing and maintaining this tool.

To Reproduce

adata.raw = adata
print(adata.raw.X.max())
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
print(adata.raw.X.max())

Error output
1384.0
6.714874659931793

Expected behavior
The expected is that adata.raw before and after normalization should be the same.

Screenshots
image

Version:

  • Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
  • Scanpy version: 1.8.2
  • SCENIC+ version: 1.0a1
@Neofita22
Copy link

First, thank you @SeppeDeWinter and Aertslab for your amazing tool!

I am wondering myself the same question. Also, I was analyzing my gene expression matrix data before and after performing Simulation Perturbation. What I have noticed is that before I run the Perturbation, the data I obtain is normalized data. I thought that for this analysis, raw data would also be included and that the function would internally normalize and/or transform it:

raw_data = adata.to_df()
scanpy_gex
gex_scenic = scplus_mdata["scRNA_counts"].to_df() 
base

The perturbed matrices take this data as a basis for their simulation analysis, and the perturbed data, I think already appears normalized, like this:

simulation_scenic = perturbation_over_iter[5]
simulation

So, I am not sure if this (starting the analysis with normalized data) is significantly altering the Perturbation analysis by already using normalized data? Or is it fine to use it this way? I dont have big experience analyzing data, especially knockouts, but any information would be very useful.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants