non-Rhabdomyosarcoma Soft Tissue Sarcoma Dataset Annotation (SCPCP000013) #604

yutarohtanaka · 2024-07-12T17:13:56Z

yutarohtanaka
Jul 12, 2024

Proposed analysis

We plan to perform the annotation of snRNA-seq samples of different non-rhabdomyosarcoma soft tissue sarcomas in the SCPCP000013 (n=34) dataset. Our processing and cell type annotation will include filtering for ambient and background RNA, filtering for low quality nuclei and doublets, cell type annotation, and malignant cell annotation.

Scientific goals

To share a validated, curated set of cell type annotations for the wilms tumor samples in this dataset.

Methods or approach

Filtering for Ambient RNA
CellBender is a computational tool that is able to remove the ambient / background RNA from count matrices. We will compare the performance of CellBender to the DropletUtils::emptyDropsCellRanger() (which we understand has been performed for the “filtered counts” provided) to evaluate the best performing method on this data, and remove all potential background RNA.
(More than happy to skip this step if it would be preferable to use the emptyDrops-filtered matrices)
Filtering for Low Quality Nuclei
Here, “low quality nuclei” are defined as nuclei with less than 300 genes or 500 UMI counts expressed, or more than 6000 genes or 50,000 UMI counts expressed. Additionally, we filter out nuclei that have no ribosomal gene expression, more than 20% and 5% of mitochondrial and hemoglobin genes respectively, over total expressed genes. We use scanpy built-in functions to perform this.
We will also filter out any sparsely expressed genes that are expressed in less than 5 cells.
Filtering for Doublets
We have primarily used scrublet in our prior work, and found that it is able to identify doublets (and multiplets) with reasonable confidence. Here, we plan to use scrublet to call and filter out any potential doublets in each sample.
Annotating Cell Types
We will perform two separate methods of cell type annotation - a manually curated marker cell identification based approach, and a supervised machine learning approach - to increase confidence and granularity of cell types annotated.

Marker Cell Based Approach
We will use lists of 5-10 genes for each cell type, and use the decoupler AUCell tool to score all of the cells in the samples on these genes. The genes for each cell type will be curated from existing literature and datasets such as CellxGene. Each cell will be assigned the cell type that it scores the highest on.
Supervised Machine Learning
We plan to use the CellTypist package, a supervised, adaptable cell type prediction method. We will run this by using the package-provided pretrained models for the immune, epithelial, and endothelial cells, and will fine-tune an additional model to provide annotations for cells on a compiled single-cell expression object of CellxGene kidney datasets.
Each cell will be assigned the cell type that the model returns the highest confidence score for.
In our experience, we have found that these two approaches largely agree (especially in annotating non-malignant cells), and we were able to increase our confidence in using two complementary approaches. We can provide a detailed cell type table with calls from each of these methods as supplementary data, if needed. When cells are assigned conflicting cell types, we will further conduct manual review using an expanded set of marker genes, and revision of the confidence scores computed by the machine learning method.

Validating Cell Types

In step 4, we annotate the cell types on a per-nuclei basis. To confirm these annotations, we perform PCA, leiden clustering, and UMAP visualization. We expect nuclei of the same annotated type to cluster together. If nuclei cluster with nuclei annotated as a different type, we refine the annotations made.

Identifying Malignant Cells

We run inferCNVpy per sample to infer copy number alterations. The "non-malignant" cell types used as control will be determined on a per-cancer type basis, adjusted for their known cell type of origin.

Validation through Cohort Merging

After the normal and tumor cell types have been annotated per-sample in steps 1-6, we will merge the whole cohort into one expression matrix, and perform PCA, leiden clustering and UMAP visualization on the whole cohort object.
Given known inter tumor heterogeneity, we expect that normal cell types identified will cluster together, and tumor cells should predominantly cluster per individual sample.
We will adjust any cell type annotations that have already been made if any cells cluster with other cell types.

Existing modules

This processing and cell type annotation workflow largely follows the existing documentation in #292 (Ewing Sarcoma), with some adaptations. We expect to follow the same processes as described in #601 and #602.

Input data

We will start with the count matrices (the .h5ad “unfiltered counts file”) provided in the SCPCP000006 data repository. The analysis will be conducted using publicly available packages, and we will provide a final curated table of cell type markers including references used along with the cell types annotation files.

Scientific literature

(CellBender) https://www.nature.com/articles/s41592-023-01943-7
(scanpy) https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0
(decouplerpy) https://doi.org/10.1093/bioadv/vbac016
(CellTypist) https://www.cell.com/cell/fulltext/S0092-8674(23)01312-0
(CellxGene) https://doi.org/10.1101/2023.10.30.563174
(inferCNVpy) https://github.com/icbi-lab/infercnvpy

Other details

All of this analysis will be able to be performed on our local and cloud environments, and will predominantly be conducted in Python.
We plan to have all of this annotation performed and available to share within the next two months.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-Rhabdomyosarcoma Soft Tissue Sarcoma Dataset Annotation (SCPCP000013) #604

{{title}}

Replies: 0 comments

Select a reply

non-Rhabdomyosarcoma Soft Tissue Sarcoma Dataset Annotation (SCPCP000013) #604

yutarohtanaka Jul 12, 2024

Proposed analysis

Scientific goals

Methods or approach

Existing modules

Input data

Scientific literature

Other details

Replies: 0 comments

yutarohtanaka
Jul 12, 2024