Explore expression of marker genes for tumor cell states in 2 samples #981

allyhawkins · 2025-01-10T18:28:47Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #940

What is the goal of this pull request?

Here I'm adding a notebook that looks at expression of marker genes for tumor cell states in 2 different Ewing libraries, SCPCL000822 and SCPCL000824. Generally I am using this notebook to answer the following questions:

Do we see expression of the genes in the tumor cell state marker gene list in our samples?
Do we see variation in expression of these genes across tumor cells?
Using these marker gene lists are there any clear cell states that we can define?

Briefly describe the general approach you took to achieve this goal.

The first thing I did was pick out the tumor cells. To do this I'm looking at the output from aucell-singler-annotation.sh and evaluate-clusters.sh. The aucell-singler-annotation.sh workflow identifies tumor cells using both SingleR and AUCell and the clustering workflow looks at different cluster assignments and gene expression for tumor marker genes across the clusters. I use the output of both to pick out clusters of cells that are tumor cells.

After getting the cell type assignments, I looked at the expression of the marker genes in the tumor-cell-state-markers.tsv file and the genes in the custom gene signatures we added as part of #971. To do this I calculated the sum of all marker genes in the gene set and used that for plotting throughout the notebook. I noticed that for both samples there was a cluster of cells that had high expression of EWS-FLI1 low targets, so I pulled those out as tumor cells too.

Once I had the tumor cells, I re-did clustering on just those cells. Then I looked at expression of the gene sets in each of those clusters to see if I could assign clusters to different cell states (mainly EWS-FLI1 low/high) based on expression of that gene set. I was able to easily pick out low vs. high cells, but there looks like there might be some variation and potentially a group of "EWS-FLI1 middle" cells that express EWS-FLI1 high marker genes but to a lesser degree.

A few key observations that I made along the way:

There isn't much expression of the proliferative gene set, so I'm not sure that's really going to be useful.
There is really clear separation between low and high, but then there is some in-between expression that's hard to call. Expression of the EWS-FLI1 high targets has been shown to lie on a continuum so this is what I would expect. If we want to come up with more classifications we would have to think about how to bin the cells into different EWS-FLI1 levels and do things that are outside the scope of this notebook.
I did think about using something like AUCell, but you really need some clear separation of gene expression and we don't have a lot of that outside of EWS high/low, which may be fine 🤷‍♀️ . I also thought this was getting a little long, so AUCell should be a separate notebook if we do want to use it.
I think the two custom gene sets that we identified from Aynaud and Wrenn are the most helpful. We see more variation in expression in those than the gene list we came up with, so I'm inclined to use those moving forward.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

I think the first thing we want to do is finalize cell type annotations using the SingleR/ clustering results and then classify cells as EWS-FLI1 high/low. I think we can add in the "middle" group later.

Result

What types of results does your code produce (e.g., table, figure)?

Here's the rendered copy of the notebook for review:
06-tumor-cell-state-assignment.html.zip

As part of the notebook I also saved the annotations to a TSV file.

What is your summary of the results?

There are clear groups of EWS-FLI1 high and low cells
It's hard to identify any proliferative cells since expression of those genes is pretty non-existent
There may be a group of cells that lie in the middle. They don't have expression of the low marker genes and they don't have quite as high expression of the high marker genes. I think if we want to classify these moving forward we should make some rules/ be more stringent about how we do it.

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…plore-cell-states-ewings

sjspielman

Overall I think this is a really reasonable exploration of results! I very much agree that if you want to consider AUCell, that's a separate notebook; this is long enough.

One additional item I wanted to point out beyond the comments I left was that, I'm not sure we can rule out proliferative being useful based on 2 libraries alone, although I do agree based on the density plots that there seems to be a lot less going in within that gene set here. I did leave a comment about the relationship between proliferative and high, which might also help to get a sense of how much we want to use proliferative going forward if at all!

sjspielman · 2025-01-10T18:59:57Z

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

+)
+
+# set seed
+set.seed(2024)


the future is now 🎉 (obviously this is not a needed change though)

Suggested change

set.seed(2024)

set.seed(2025)

Maybe by February, I'll remember it's a new year 😂

I am very much still at the <types 2024...deletes the 4 and types 5> stage! 🥴

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

analyses/cell-type-ewings/exploratory_analysis/README.md

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

sjspielman · 2025-01-10T19:22:28Z

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

+        aspect.ratio = 1
+      )
+  }) |> 
+  patchwork::wrap_plots(ncol = 2)


I'm not sure I'd wrap these vs just print them one on each line, since they are a bit small in the HTML compared to other plots, and these are important ones to be able to see

sjspielman · 2025-01-10T19:45:42Z

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

+I'm using the same parameters we use for the whole sample, leiden, modularity, resolution of 0.5, and nearest neighbors of 20. 
+I think if we really wanted to be stringent we would evaluate clusters here and find the best parameters to use, but I think starting with these is good. 


I don't think there are "perfect" answers here since clustering is its own special can of worms. But, I do want to say that I wonder what the difference would be between using the same parameters on subsetted data, vs using the full dataset with a higher resolution. I agree this is a decent enough place to start though, since it's anyone's guess how to subset clusters anyways!

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

sjspielman · 2025-01-10T19:55:40Z

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

+
+```
+
+It looks like for both samples we have some clear division between EWS high and EWS low cells. 


Hmm.. what do we make of the fact that it barely looks like annotation (which is the sub clusters - can you modify the heatmap code to get that labeled as "sub_cluster" and not as "annotation"?) is clustered on the 2nd heatmap? Of course, the hierachical clustering algorithm that the heatmap is using is decently different from the one used to cluster in the first place, but the total lack of relationship I find odd. I'm curious, what does the heatmap look like if we first arrange the data by subcluster, and then turn off x-axis clustering so that the subclusters indeed are grouped together?

Ok, now I'm looking at the next version of this heatmap after you group into putative cell states, and I feel less concerned! The rainbow look, shall we say, was probably just brought on by some very similar sub-clusters.

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>

allyhawkins · 2025-01-10T22:26:07Z

One additional item I wanted to point out beyond the comments I left was that, I'm not sure we can rule out proliferative being useful based on 2 libraries alone, although I do agree based on the density plots that there seems to be a lot less going in within that gene set here. I did leave a comment about the relationship between proliferative and high, which might also help to get a sense of how much we want to use proliferative going forward if at all!

This comment made me take another look at this and I do see some expression of the proliferative gene set, but it looks so much lower because the values are lower. This is probably a side effect of having marker gene sets with varying numbers of genes in them. Because of that, I actually think we want to use the mean here. It doesn't change the overall conclusions and the plots mostly look the same, but it does keep things on a similar scale. It also makes it easier to see all the gene sets in the heatmaps.

In revisiting this, I think there is one group of cells that are "proliferative" in SCPCL000824 that also line up with EWS-high. I feel comfortable labeling these as such. The one thing that does change when using mean is that the differences between EWS-mid and EWS-high are not as striking to see in the heatmap, but I see it in the UMAP. Maybe that's okay and we could also just label cells as low, high, or proliferative?

Just an FYI that I wrote a new function to calculate means rather than modifying the existing function to calculate sums because we use that function in lots of places and I was trying not to break any previous things that use it.

I also updated the heatmap to not cluster the columns, but I wasn't able to change the name of the annotation bar. That also gets used by a lot of other notebooks and would require moving the definition of the ComplexHeatmap annotation out of the function. If we really wanted to do it then I would need to move the code into this notebook and there's already so much in this notebook. I did add a note about what the annotation label means and can add it if you feel like it's really necessary.

New rendered notebook:
06-tumor-cell-state-assignment.html.zip

sjspielman

The changes you've made here seem good to me (and totally fine about the annotation label, it's good enough!)!

The one thing that does change when using mean is that the differences between EWS-mid and EWS-high are not as striking to see in the heatmap, but I see it in the UMAP. Maybe that's okay and we could also just label cells as low, high, or proliferative?

I like the idea of labeling as high-proliferative as you have done (for that one cluster at least), but I think we should bear in mind the caveat that this does mean other high cells aren't also proliferative. That said, I was intrigued that a cluster came out for proliferative here at all, and in reminding myself of the marker genes we're using, I wondered if we just don't have enough proliferative markers for sufficient power. So I did another quick search and uncovered* the contender VRK1 from https://doi.org/10.1016/j.ccell.2014.10.004. Locally, I added this to the marker gene TSV and re-rendered the notebook. I think this may be a contender to include and achieve a bit more signal, what do you think?
06-tumor-cell-state-assignment-with-VRK1.html.zip

analyses/cell-type-ewings/scripts/utils/tumor-validation-helpers.R

sjspielman · 2025-01-13T12:51:17Z

analyses/cell-type-ewings/scripts/utils/tumor-validation-helpers.R

+
+  df <- data.frame(
+    barcodes = names(mean_exp),
+    sum_exp = mean_exp


this name will change anyway, but for clarity in the meantime...

Suggested change

sum_exp = mean_exp

mean_exp = mean_exp

sjspielman · 2025-01-13T12:52:47Z

analyses/cell-type-ewings/exploratory_analysis/README.md

@@ -20,6 +20,9 @@ This includes comparing annotations to those obtained from marker gene annotatio
 5. `05-cluster-exploration.Rmd`: This notebook looks at clustering across different parameters to choose the most optimal clusters for `SCPCL000822` and `SCPCL000824`. 
 Additionally, we look at expression of marker genes from `references/visser-all-marker-genes.tsv` across all clusters and use that to refine tumor cell annotations obtained from running `aucell-singler-annotation.sh`. 

+6. `06-tumor-cell-state-assignments.Rmd`: This notebook looks at assigning tumor cell states in `SCPCL000822` and `SCPCL000824`. 
+Tumor cells are grouped into `EWS-FLI` high, `EWS-FLI` low, and `EWS-FLI` middle based on expression of marker genes. 


and high proliferative?

sjspielman · 2025-01-13T12:55:17Z

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

+## Conclusions 
+
+- There is a clear division between cells that are `EWS-FLI` high and `EWS-FLI1` low. 
+- Most tumor cells in both samples are `EWS-FLI` high, which is consistent with the literature. 


sub-bullet for proliferative

sjspielman · 2025-01-13T15:31:40Z

analyses/cell-type-ewings/exploratory_analysis/06-tumor-cell-state-assignment.Rmd

+Based on this I would make the following assignments: 
+
+`SCPCL000822`: 
+`EWS-FLI1` low  - 2
+`EWS-FLI1` middle - 1 & 4
+`EWS-FLI1` high - 3 
+
+`SCPCL000824` : 
+`EWS-FLI1` low - 1
+`EWS-FLI1` middle - 6 
+`EWS-FLI1` high - 3, 4, 5, 7
+`EWS-FLI1` high proliferative - 2


Forgot to add this comment before - this is each appearing on one line so let's actually make it bullets

Suggested change

Based on this I would make the following assignments:

`SCPCL000822`:

`EWS-FLI1` low - 2

`EWS-FLI1` middle - 1 & 4

`EWS-FLI1` high - 3

`SCPCL000824` :

`EWS-FLI1` low - 1

`EWS-FLI1` middle - 6

`EWS-FLI1` high - 3, 4, 5, 7

`EWS-FLI1` high proliferative - 2

Based on this I would make the following assignments:

`SCPCL000822`:

* `EWS-FLI1` low - 2

* `EWS-FLI1` middle - 1 & 4

* `EWS-FLI1` high - 3

`SCPCL000824` :

* `EWS-FLI1` low - 1

* `EWS-FLI1` middle - 6

* `EWS-FLI1` high - 3, 4, 5, 7

* `EWS-FLI1` high proliferative - 2

allyhawkins · 2025-01-13T16:57:47Z

I like the idea of labeling as high-proliferative as you have done (for that one cluster at least), but I think we should bear in mind the caveat that this does mean other high cells aren't also proliferative. That said, I was intrigued that a cluster came out for proliferative here at all, and in reminding myself of the marker genes we're using, I wondered if we just don't have enough proliferative markers for sufficient power. So I did another quick search and uncovered* the contender VRK1 from https://doi.org/10.1016/j.ccell.2014.10.004.

I think we should figure out what exactly we mean as "proliferative". It's been shown that EWS-FLI1 high cells have a more proliferative phenotype and that EWS-FLI1 low cells have a more metastatic phenotype (see https://www.nature.com/articles/onc2016498). So if we use that logic, then all EWS-FLI1 high cells should be proliferative. And then if you add in VRK1, an EWS-FLI1 target which has been shown to contribute to tumor cell growth and proliferation in the paper you posted, then you see more of the EWS-FLI1 high cells overlapping with proliferative cells. I think if we want to use that logic, then we don't need to have the "proliferative" category at all and just categorize by EWS-FLI1 levels.

However, those studies were all done in bulk and all cell lines. The "proliferative" class that I was trying to characterize is based on the "Ewing sarcoma proliferating cells" labeled in https://www.biorxiv.org/content/10.1101/2024.01.18.576251v2.full. These were specifically identified by looking at the two markers that we have on the list and are a subset of Ewing tumor cells in patient tumor cells using single-cell. That doesn't mean it should be the gold standard, but I think if we want to isolate a specific "proliferating" population, then we can't just use EWS-FLI1 targets, since those should be high regardless.

I did revisit Figure 2 from Goodspeed et al,. and I wonder if we want to add TOP2A to the list?
I think if we want to add VRK1 it should be to the EWS-FLI high list.

Also, I think we should add ICAM1 based on Franzetti et al. I completely forgot about that paper until now!

Here's an example of the report with ICAM1, TOP2A, and VRK1.
06-tumor-cell-state-assignment.html.zip

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>

sjspielman

That doesn't mean it should be the gold standard, but I think if we want to isolate a specific "proliferating" population, then we can't just use EWS-FLI1 targets, since those should be high regardless.

Got it, thanks for the biology! Totally makes sense.

I did revisit Figure 2 from Goodspeed et al,. and I wonder if we want to add TOP2A to the list? I think if we want to add VRK1 it should be to the EWS-FLI high list.

If we're including the others from Goodspeed, then I agree TOP2A should be in there per that linked heatmap, and I agree with adding VRK1 to the high list.

Looking at the updated report, it all looks good to me at this point actually and I think we have a pretty solid gene set here to move forward with. The only changes to make are to add the additional DOI we have to the references/README.md where the marker gene file is described, and we do want to get bullets in that one spot I commented on since otherwise it's all on the same line in the report -

Otherwise, good to go! 🐈

allyhawkins · 2025-01-13T18:18:01Z

Looking at the updated report, it all looks good to me at this point actually and I think we have a pretty solid gene set here to move forward with. The only changes to make are to add the additional DOI we have to the references/README.md where the marker gene file is described, and we do want to get bullets in that one spot I commented on since otherwise it's all on the same line in the report -

Sorry about that, I thought I had added the bullets but they should be there now! And then I updated the README to include the new references we're using.

Since I made some small changes to functions that are used in other places used by some of the scripts run in CI, I'm going to make sure checks pass before merging this.

allyhawkins added 6 commits January 9, 2025 09:53

initiate notebook for looking at tumor cell states

c53fac5

add some helpful images for defining tumor cells

988db2c

subclustering and more plots

294b8dd

render the notebook

2f599ef

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/ex…

be1c672

…plore-cell-states-ewings

update readme

13f71b0

allyhawkins requested a review from jaclyn-taroni as a code owner January 10, 2025 18:28

allyhawkins requested review from sjspielman and removed request for jaclyn-taroni January 10, 2025 18:29

sjspielman reviewed Jan 10, 2025

View reviewed changes

allyhawkins and others added 4 commits January 10, 2025 14:36

Apply suggestions from code review

5af762f

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>

turn off clustering

70c00e2

update umaps and heatmaps

e2e8a1a

use mean

becb6af

allyhawkins requested a review from sjspielman January 10, 2025 22:26

sjspielman reviewed Jan 13, 2025

View reviewed changes

allyhawkins added 3 commits January 13, 2025 10:55

respond to reviews

51ad6e2

add top2a, icam1 and vrk1

811b3ad

re-render

85386f6

allyhawkins requested a review from sjspielman January 13, 2025 16:57

Use type

dc3ec76

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>

sjspielman approved these changes Jan 13, 2025

View reviewed changes

allyhawkins added 2 commits January 13, 2025 12:14

add new refs to readme

b402db7

actually use bullets and re-render

7888eb9

allyhawkins merged commit 40d6db1 into AlexsLemonade:main Jan 14, 2025
3 checks passed

allyhawkins deleted the allyhawkins/explore-cell-states-ewings branch January 14, 2025 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore expression of marker genes for tumor cell states in 2 samples #981

Explore expression of marker genes for tumor cell states in 2 samples #981

allyhawkins commented Jan 10, 2025

sjspielman left a comment

sjspielman Jan 10, 2025

allyhawkins Jan 10, 2025

sjspielman Jan 10, 2025

sjspielman Jan 10, 2025

sjspielman Jan 10, 2025

sjspielman Jan 10, 2025

sjspielman Jan 10, 2025

allyhawkins commented Jan 10, 2025

sjspielman left a comment •

edited

Loading

sjspielman Jan 13, 2025

sjspielman Jan 13, 2025

sjspielman Jan 13, 2025

sjspielman Jan 13, 2025

allyhawkins commented Jan 13, 2025

sjspielman left a comment

allyhawkins commented Jan 13, 2025

		I'm using the same parameters we use for the whole sample, leiden, modularity, resolution of 0.5, and nearest neighbors of 20.
		I think if we really wanted to be stringent we would evaluate clusters here and find the best parameters to use, but I think starting with these is good.


		```

		It looks like for both samples we have some clear division between EWS high and EWS low cells.

Explore expression of marker genes for tumor cell states in 2 samples #981

Explore expression of marker genes for tumor cell states in 2 samples #981

Conversation

allyhawkins commented Jan 10, 2025

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Result

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Author checklists

Analysis module and review

Reproducibility checklist

sjspielman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Jan 10, 2025

sjspielman left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Jan 13, 2025

sjspielman left a comment

Choose a reason for hiding this comment

allyhawkins commented Jan 13, 2025

sjspielman left a comment •

edited

Loading