-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve scanpy compatibility #774
Conversation
Sync changes from `development` into `main`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this all looks good, I just found a few places we might want to add or update checks.
I also think it would be good to test this using the default run IDs that we have in the CCDL profile if you haven't already.
bin/reformat_anndata.py
Outdated
adata.var["highly_variable"] = adata.var.gene_ids.isin( | ||
adata.uns["highly_variable_genes"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might want to only do this if args.mask_var != "None"
? And does this fail if adata.uns["highly_variable_genes"]
does not exist? If so, we should add a check for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct, as I realized when looking at the first comment!
move_counts_anndata.py --anndata_file ${rna_h5ad_file} | ||
${has_adt ? "move_counts_anndata.py --anndata_file ${feature_h5ad_file}" : ''} | ||
reformat_anndata.py --anndata_file ${rna_h5ad_file} | ||
${has_adt ? "reformat_anndata.py --anndata_file ${feature_h5ad_file}" : ''} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to include the PCA information in these objects too like you do for the processed individual objects? I believe we do re-calculate PCs after merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't planning to change anything for the merged files. We don't move the reduced dims to X_PCA
and X_UMAP
for those, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do move them in those objects.
Line 70 in f215046
reducedDimNames(sce) <- glue::glue("X_{reducedDimNames(sce)}") |
I can see both arguments, but I think for consistency between objects we would want to do this for merged objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should make the names match. I think I don't want to add the .uns["pca"]
object for merged data though, since we use a different function for the PCA calculation for merged data that I am not sure would be equivalent to what ScanPy would do. So I think probably better to just leave that off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. So for the merged object the only change should be the lower case naming.
Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>
Test run here completed successfully, and at first examination the outputs are as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes #773
I'm filing this as a draft because I have not yet really tested it, but I wanted to get something up while it was fresh in my mind.
The goal here is to improve the compatibility with scanpy as described in #773, so I have done 4 main things:
X_pca
andX_umap
to match scanpy.highly_variable
column to thevar
tableuns
The last step involves exporting the variance explained data and then importing that separately, with the assumption that the PCA was centered and highly variable genes were used. I could (and probably should) make those assumptions into arguments for the script just to be safe and maybe a bit future-proof.
There are probably also a few places where I am making other assumptions that I should check more explicitly. As I said, a draft...
I also renamed the script to be a bit more generally named, but annoyingly there were apparently enough changes that GitHub is not displaying it as a rename but as a deletion and recreation. Probably because it was run through a code formatter automatically.