Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore expression of marker genes for tumor cell states in 2 samples #981

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #940

What is the goal of this pull request?

Here I'm adding a notebook that looks at expression of marker genes for tumor cell states in 2 different Ewing libraries, SCPCL000822 and SCPCL000824. Generally I am using this notebook to answer the following questions:

  • Do we see expression of the genes in the tumor cell state marker gene list in our samples?
  • Do we see variation in expression of these genes across tumor cells?
  • Using these marker gene lists are there any clear cell states that we can define?

Briefly describe the general approach you took to achieve this goal.

The first thing I did was pick out the tumor cells. To do this I'm looking at the output from aucell-singler-annotation.sh and evaluate-clusters.sh. The aucell-singler-annotation.sh workflow identifies tumor cells using both SingleR and AUCell and the clustering workflow looks at different cluster assignments and gene expression for tumor marker genes across the clusters. I use the output of both to pick out clusters of cells that are tumor cells.

After getting the cell type assignments, I looked at the expression of the marker genes in the tumor-cell-state-markers.tsv file and the genes in the custom gene signatures we added as part of #971. To do this I calculated the sum of all marker genes in the gene set and used that for plotting throughout the notebook. I noticed that for both samples there was a cluster of cells that had high expression of EWS-FLI1 low targets, so I pulled those out as tumor cells too.

Once I had the tumor cells, I re-did clustering on just those cells. Then I looked at expression of the gene sets in each of those clusters to see if I could assign clusters to different cell states (mainly EWS-FLI1 low/high) based on expression of that gene set. I was able to easily pick out low vs. high cells, but there looks like there might be some variation and potentially a group of "EWS-FLI1 middle" cells that express EWS-FLI1 high marker genes but to a lesser degree.

A few key observations that I made along the way:

  • There isn't much expression of the proliferative gene set, so I'm not sure that's really going to be useful.
  • There is really clear separation between low and high, but then there is some in-between expression that's hard to call. Expression of the EWS-FLI1 high targets has been shown to lie on a continuum so this is what I would expect. If we want to come up with more classifications we would have to think about how to bin the cells into different EWS-FLI1 levels and do things that are outside the scope of this notebook.
  • I did think about using something like AUCell, but you really need some clear separation of gene expression and we don't have a lot of that outside of EWS high/low, which may be fine 🤷‍♀️ . I also thought this was getting a little long, so AUCell should be a separate notebook if we do want to use it.
  • I think the two custom gene sets that we identified from Aynaud and Wrenn are the most helpful. We see more variation in expression in those than the gene list we came up with, so I'm inclined to use those moving forward.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

I think the first thing we want to do is finalize cell type annotations using the SingleR/ clustering results and then classify cells as EWS-FLI1 high/low. I think we can add in the "middle" group later.

Result

What types of results does your code produce (e.g., table, figure)?

Here's the rendered copy of the notebook for review:
06-tumor-cell-state-assignment.html.zip

As part of the notebook I also saved the annotations to a TSV file.

What is your summary of the results?

  • There are clear groups of EWS-FLI1 high and low cells
  • It's hard to identify any proliferative cells since expression of those genes is pretty non-existent
  • There may be a group of cells that lie in the middle. They don't have expression of the low marker genes and they don't have quite as high expression of the high marker genes. I think if we want to classify these moving forward we should make some rules/ be more stringent about how we do it.

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@allyhawkins allyhawkins requested review from sjspielman and removed request for jaclyn-taroni January 10, 2025 18:29
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is a really reasonable exploration of results! I very much agree that if you want to consider AUCell, that's a separate notebook; this is long enough.

One additional item I wanted to point out beyond the comments I left was that, I'm not sure we can rule out proliferative being useful based on 2 libraries alone, although I do agree based on the density plots that there seems to be a lot less going in within that gene set here. I did leave a comment about the relationship between proliferative and high, which might also help to get a sense of how much we want to use proliferative going forward if at all!

)

# set seed
set.seed(2024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the future is now 🎉 (obviously this is not a needed change though)

Suggested change
set.seed(2024)
set.seed(2025)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe by February, I'll remember it's a new year 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very much still at the <types 2024...deletes the 4 and types 5> stage! 🥴

analyses/cell-type-ewings/exploratory_analysis/README.md Outdated Show resolved Hide resolved
aspect.ratio = 1
)
}) |>
patchwork::wrap_plots(ncol = 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd wrap these vs just print them one on each line, since they are a bit small in the HTML compared to other plots, and these are important ones to be able to see

Comment on lines +381 to +382
I'm using the same parameters we use for the whole sample, leiden, modularity, resolution of 0.5, and nearest neighbors of 20.
I think if we really wanted to be stringent we would evaluate clusters here and find the best parameters to use, but I think starting with these is good.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are "perfect" answers here since clustering is its own special can of worms. But, I do want to say that I wonder what the difference would be between using the same parameters on subsetted data, vs using the full dataset with a higher resolution. I agree this is a decent enough place to start though, since it's anyone's guess how to subset clusters anyways!


```

It looks like for both samples we have some clear division between EWS high and EWS low cells.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. what do we make of the fact that it barely looks like annotation (which is the sub clusters - can you modify the heatmap code to get that labeled as "sub_cluster" and not as "annotation"?) is clustered on the 2nd heatmap? Of course, the hierachical clustering algorithm that the heatmap is using is decently different from the one used to cluster in the first place, but the total lack of relationship I find odd. I'm curious, what does the heatmap look like if we first arrange the data by subcluster, and then turn off x-axis clustering so that the subclusters indeed are grouped together?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I'm looking at the next version of this heatmap after you group into putative cell states, and I feel less concerned! The rainbow look, shall we say, was probably just brought on by some very similar sub-clusters.

allyhawkins and others added 4 commits January 10, 2025 14:36
@allyhawkins
Copy link
Member Author

One additional item I wanted to point out beyond the comments I left was that, I'm not sure we can rule out proliferative being useful based on 2 libraries alone, although I do agree based on the density plots that there seems to be a lot less going in within that gene set here. I did leave a comment about the relationship between proliferative and high, which might also help to get a sense of how much we want to use proliferative going forward if at all!

This comment made me take another look at this and I do see some expression of the proliferative gene set, but it looks so much lower because the values are lower. This is probably a side effect of having marker gene sets with varying numbers of genes in them. Because of that, I actually think we want to use the mean here. It doesn't change the overall conclusions and the plots mostly look the same, but it does keep things on a similar scale. It also makes it easier to see all the gene sets in the heatmaps.

In revisiting this, I think there is one group of cells that are "proliferative" in SCPCL000824 that also line up with EWS-high. I feel comfortable labeling these as such. The one thing that does change when using mean is that the differences between EWS-mid and EWS-high are not as striking to see in the heatmap, but I see it in the UMAP. Maybe that's okay and we could also just label cells as low, high, or proliferative?

Just an FYI that I wrote a new function to calculate means rather than modifying the existing function to calculate sums because we use that function in lots of places and I was trying not to break any previous things that use it.

I also updated the heatmap to not cluster the columns, but I wasn't able to change the name of the annotation bar. That also gets used by a lot of other notebooks and would require moving the definition of the ComplexHeatmap annotation out of the function. If we really wanted to do it then I would need to move the code into this notebook and there's already so much in this notebook. I did add a note about what the annotation label means and can add it if you feel like it's really necessary.

New rendered notebook:
06-tumor-cell-state-assignment.html.zip

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes you've made here seem good to me (and totally fine about the annotation label, it's good enough!)!

The one thing that does change when using mean is that the differences between EWS-mid and EWS-high are not as striking to see in the heatmap, but I see it in the UMAP. Maybe that's okay and we could also just label cells as low, high, or proliferative?

I like the idea of labeling as high-proliferative as you have done (for that one cluster at least), but I think we should bear in mind the caveat that this does mean other high cells aren't also proliferative. That said, I was intrigued that a cluster came out for proliferative here at all, and in reminding myself of the marker genes we're using, I wondered if we just don't have enough proliferative markers for sufficient power. So I did another quick search and uncovered* the contender VRK1 from https://doi.org/10.1016/j.ccell.2014.10.004. Locally, I added this to the marker gene TSV and re-rendered the notebook. I think this may be a contender to include and achieve a bit more signal, what do you think?
06-tumor-cell-state-assignment-with-VRK1.html.zip


df <- data.frame(
barcodes = names(mean_exp),
sum_exp = mean_exp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name will change anyway, but for clarity in the meantime...

Suggested change
sum_exp = mean_exp
mean_exp = mean_exp

@@ -20,6 +20,9 @@ This includes comparing annotations to those obtained from marker gene annotatio
5. `05-cluster-exploration.Rmd`: This notebook looks at clustering across different parameters to choose the most optimal clusters for `SCPCL000822` and `SCPCL000824`.
Additionally, we look at expression of marker genes from `references/visser-all-marker-genes.tsv` across all clusters and use that to refine tumor cell annotations obtained from running `aucell-singler-annotation.sh`.

6. `06-tumor-cell-state-assignments.Rmd`: This notebook looks at assigning tumor cell states in `SCPCL000822` and `SCPCL000824`.
Tumor cells are grouped into `EWS-FLI` high, `EWS-FLI` low, and `EWS-FLI` middle based on expression of marker genes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and high proliferative?

## Conclusions

- There is a clear division between cells that are `EWS-FLI` high and `EWS-FLI1` low.
- Most tumor cells in both samples are `EWS-FLI` high, which is consistent with the literature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub-bullet for proliferative

Comment on lines 509 to 520
Based on this I would make the following assignments:

`SCPCL000822`:
`EWS-FLI1` low - 2
`EWS-FLI1` middle - 1 & 4
`EWS-FLI1` high - 3

`SCPCL000824` :
`EWS-FLI1` low - 1
`EWS-FLI1` middle - 6
`EWS-FLI1` high - 3, 4, 5, 7
`EWS-FLI1` high proliferative - 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add this comment before - this is each appearing on one line so let's actually make it bullets

Suggested change
Based on this I would make the following assignments:
`SCPCL000822`:
`EWS-FLI1` low - 2
`EWS-FLI1` middle - 1 & 4
`EWS-FLI1` high - 3
`SCPCL000824` :
`EWS-FLI1` low - 1
`EWS-FLI1` middle - 6
`EWS-FLI1` high - 3, 4, 5, 7
`EWS-FLI1` high proliferative - 2
Based on this I would make the following assignments:
`SCPCL000822`:
* `EWS-FLI1` low - 2
* `EWS-FLI1` middle - 1 & 4
* `EWS-FLI1` high - 3
`SCPCL000824` :
* `EWS-FLI1` low - 1
* `EWS-FLI1` middle - 6
* `EWS-FLI1` high - 3, 4, 5, 7
* `EWS-FLI1` high proliferative - 2

@allyhawkins
Copy link
Member Author

I like the idea of labeling as high-proliferative as you have done (for that one cluster at least), but I think we should bear in mind the caveat that this does mean other high cells aren't also proliferative. That said, I was intrigued that a cluster came out for proliferative here at all, and in reminding myself of the marker genes we're using, I wondered if we just don't have enough proliferative markers for sufficient power. So I did another quick search and uncovered* the contender VRK1 from https://doi.org/10.1016/j.ccell.2014.10.004.

I think we should figure out what exactly we mean as "proliferative". It's been shown that EWS-FLI1 high cells have a more proliferative phenotype and that EWS-FLI1 low cells have a more metastatic phenotype (see https://www.nature.com/articles/onc2016498). So if we use that logic, then all EWS-FLI1 high cells should be proliferative. And then if you add in VRK1, an EWS-FLI1 target which has been shown to contribute to tumor cell growth and proliferation in the paper you posted, then you see more of the EWS-FLI1 high cells overlapping with proliferative cells. I think if we want to use that logic, then we don't need to have the "proliferative" category at all and just categorize by EWS-FLI1 levels.

However, those studies were all done in bulk and all cell lines. The "proliferative" class that I was trying to characterize is based on the "Ewing sarcoma proliferating cells" labeled in https://www.biorxiv.org/content/10.1101/2024.01.18.576251v2.full. These were specifically identified by looking at the two markers that we have on the list and are a subset of Ewing tumor cells in patient tumor cells using single-cell. That doesn't mean it should be the gold standard, but I think if we want to isolate a specific "proliferating" population, then we can't just use EWS-FLI1 targets, since those should be high regardless.

I did revisit Figure 2 from Goodspeed et al,. and I wonder if we want to add TOP2A to the list?
I think if we want to add VRK1 it should be to the EWS-FLI high list.

Also, I think we should add ICAM1 based on Franzetti et al. I completely forgot about that paper until now!

Here's an example of the report with ICAM1, TOP2A, and VRK1.
06-tumor-cell-state-assignment.html.zip

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't mean it should be the gold standard, but I think if we want to isolate a specific "proliferating" population, then we can't just use EWS-FLI1 targets, since those should be high regardless.

Got it, thanks for the biology! Totally makes sense.

I did revisit Figure 2 from Goodspeed et al,. and I wonder if we want to add TOP2A to the list? I think if we want to add VRK1 it should be to the EWS-FLI high list.

If we're including the others from Goodspeed, then I agree TOP2A should be in there per that linked heatmap, and I agree with adding VRK1 to the high list.

Looking at the updated report, it all looks good to me at this point actually and I think we have a pretty solid gene set here to move forward with. The only changes to make are to add the additional DOI we have to the references/README.md where the marker gene file is described, and we do want to get bullets in that one spot I commented on since otherwise it's all on the same line in the report -
Screenshot 2025-01-13 at 1 00 06 PM

Otherwise, good to go! 🐈

@allyhawkins
Copy link
Member Author

Looking at the updated report, it all looks good to me at this point actually and I think we have a pretty solid gene set here to move forward with. The only changes to make are to add the additional DOI we have to the references/README.md where the marker gene file is described, and we do want to get bullets in that one spot I commented on since otherwise it's all on the same line in the report -

Sorry about that, I thought I had added the bullets but they should be there now! And then I updated the README to include the new references we're using.

Since I made some small changes to functions that are used in other places used by some of the scripts run in CI, I'm going to make sure checks pass before merging this.

@allyhawkins allyhawkins merged commit 40d6db1 into AlexsLemonade:main Jan 14, 2025
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/explore-cell-states-ewings branch January 14, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notebook to explore identifying tumor cell states in Ewing samples using marker gene lists
2 participants