Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Mapping high level histology color labels -- Part 1 #899

Merged
merged 27 commits into from
Jan 15, 2021

Conversation

cansavvy
Copy link
Collaborator

@cansavvy cansavvy commented Jan 12, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

This PR addresses Step 1 of 4 discussed in #897. This PR adds that main notebook described where we attempt to group biospecimens into "high level" histology groups that will be more practical for plotting purposes.

Note that subsequent PRs will deal with the updating of the README and retirement of histology_color_palette.tsv in favor for the mapping table, palettes/histology_label_color_table.tsv, that is created in this notebook.

What was your approach?

Copied from the notebook added in this PR itself:

# Purpose: 

The histology label variables included in `pbta-histologies.tsv` from data releases are not always useful for visualizing the full set of biospecimens due to the large number of different values.
Having too many different possible values makes the colors harder to distinguish.
In addition, there are some groups that are represented by only a very few samples; giving such groups a distinct color may be counterproductive.

The goal of this notebook is to use the currently existing `broad_histology` groups from `pbta-histologies.tsv`, to form 10-15 "high level histology" group labels that can used for plotting purposes.

## The output table

The output of this notebook is a TSV file: `palettes/histology_label_color_table.tsv` that contains the following fields:

**Copied from `pbta-histologies.tsv`**:    
- `Kids_First_Biospecimen_ID` (from `pbta-histologies.tsv`)  
- All the original histology label variables (`broad_histology`, `short_histology`, etc.)  
  
**Created in this notebook**:  
- `display_group` - the high-level histology labels that should be used for plotting   
- `hex_codes` the direct colors that should be used for plotting  

With this info, `histology_label_color_table.tsv` can be used by all plots and figures that summarize high level data  while displaying histology information. 

# How `display_group` is made:

Here's how `broad-histology` groups are [combined into the higher-level groupings of `display_group`](#declare-new-equivalent-groups).

1) "Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into `Low-grade astrocytic tumor` in `display_group`. 

2)  `Germ cell tumor`  samples are combined into the `Non-CNS tumor` group. 

3) `Benign tumor` and `Non-tumor` biospecimens are combined into a `Benign` group. 

4) Anything not in the above categories gets its `broad_histology` label carried over. 

What GitHub issue does your pull request address?

#897

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • How do we feel about this notebook and its ability to deal with upcoming releases' versions of pbta-histologies.tsv
  • How do we feel about these broad histology groupings? Aka how scientifically wrong are they? (Note that we've always known "high level" groupings will not be more accurate or specific; that is not the goal here, the goal is to be broad and have something easier to visualize in a plot.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes. The ones included can be.

Results

What is your summary of the results?

We have a list of colors and 16 + Normal groups. I have not tried it on an official plot yet. (upcoming PR)

Reproducibility Checklist

Documentation Checklist

  • This analysis module has a README and it is up to date. -- No, this will be in a subsequent PR. The README will require quite a bit of updating.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date. -- it is not because its not located in the analyses folder.
  • The analytical code is documented and contains comments.

@cansavvy cansavvy changed the title WIP: Mapping high level histology color labels -- Part 1 Mapping high level histology color labels -- Part 1 Jan 12, 2021
@cansavvy cansavvy marked this pull request as ready for review January 12, 2021 21:44
@cansavvy cansavvy requested a review from jashapiro January 12, 2021 21:51
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start. I think my main comment is that we should try to be a bit more intentional about grouping things before resorting to "Other". There may well be reasonable groupings that take some of the small-count histologies and link them to large-count histologies that they are related to rather than just to other small-count groups. It may be helpful to have a table which shows the integrated_diagnosis values for each broad_histology to assist with such grouping.

My other suggestions are mostly about display. This is a notebook for internal evaluation, so having more easily visible data at each stage in tables is more valuable than some of the summaries or figures.

I also suggest treating "Other" as a special group, with a defined color that remains constant.

With regards to your question:

For ease of downstream use, I'm wondering if we should drop everything that's not Kids_First_Biospecimen_ID summary_group and hex_codes, because downstream we'll need to drop everything.

I think we should keep as much information as possible for now. It is potentially easier to evaluate groupings using the full table, and the cost of keeping it is essentially zero.

summary_groups <- histology_table %>%
dplyr::count(summary_group)

nrow(summary_groups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would here again print out the full table to make it easier to evaluate what has happened.

set.seed(2021)

# Sample from the 18 colors
subset_colors <- sample(color_palette, n_colors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing when making the colors is that I feel like we should treat "Other" as special, and always assign it a standard color, something neutral & grey like #A0A0A0 (though I am not sure how that fits with colorblind-friendly schemes).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could drop this "Other" group from the large summary figures.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

@cansavvy
Copy link
Collaborator Author

@jharenza , we could use some more domain knowledge about these histologies at this point.
In particular, which of these display_groups (which are mostly from broad_histology) would be most reasonable to combine into other groups?

display_group	n
Lymphoma	2
Melanocytic tumor	2
Other tumor	2
Other astrocytic tumor	6
Metastatic tumors	9
Tumor of pineal region	10
Histiocytic tumor	12
Choroid plexus tumor	21
Pre-cancerous lesion	27
Non-CNS tumor	29
Mesenchymal non-meningothelial tumor	48
Meningioma	59
Tumors of sellar region	73
Benign	78
Tumor of cranial and paraspinal nerves	83
Ependymal tumor	174
Embryonal tumor	339
NA	833
Low-grade astrocytic tumor and adjacent	1034

@jaclyn-taroni
Copy link
Member

Without looking at the code, I wanted to point out that the NA should be normal samples. So you may want to include some kind of filtering to tumor samples if you have not already.

@cansavvy
Copy link
Collaborator Author

Without looking at the code, I wanted to point out that the NA should be normal samples. So you may want to include some kind of filtering to tumor samples if you have not already.

I did have that previously but with the changes I made most recently with Josh's review we lost that. I'll add it back.

@cansavvy cansavvy requested a review from jharenza January 13, 2021 15:06

There's handful of very small groups (many are n = 2).

## Declare new equivalent groups
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jharenza this is where we set up the equivalent groups which are then recoded in the next section.

@jharenza
Copy link
Collaborator

jharenza commented Jan 13, 2021

Hi @cansavvy -

  1. "Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into Low-grade astrocytic tumor in display_group.

This is not true - These are three separate entities and really should not be combined. The neuronal and mixed glial tumors comprise of many distinct tumors, so they had a wider array of short_histology. The Diffuse astrocytic and oligodendroglial tumors were all HGAT (High-grade astrocytic tumors) and should be kept as a separate group. So, I would keep those three groups, especially since they have a bulk of the tumors.

In particular, which of these display_groups (which are mostly from broad_histology) would be most reasonable to combine into other groups?

I think we can combine:

Lymphoma	2
Melanocytic tumor	2
Other tumor	2
Metastatic tumors	9
Non-CNS tumor	29

to Non-CNS or other tumor

I think you can combine Other astrocytic tumor into LGAT for now, since it is a low-grade astrocytoma. Possibly, we can update that broad_histology in v19.

I think that will leave us with 15.

@cansavvy
Copy link
Collaborator Author

@jharenza , thanks for these comments. I'm going to implement them and then re-request you for a review when its ready.

@cansavvy
Copy link
Collaborator Author

I think that will leave us with 15.

This brings us to 16 + Normal. Are there any other combos that seem reasonable or should we leave it at that?

@cansavvy
Copy link
Collaborator Author

I would be more prone to dropping Benign tumor first, then Other if we want to drop more.

If we are talking about dropping that completely from the figures, than I think we'd probably keep this data in the table for now, but change that at the figure scripts?

But I was more talking about groups we could combine into other groups.

@cansavvy cansavvy requested a review from jashapiro January 13, 2021 20:21
@jharenza
Copy link
Collaborator

Ah, yeah I think that is as far as we can combine right now.

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cansavvy! Looks good- made a few minor comments, but otherwise it looks good!

set.seed(2021)

# Sample from the 18 colors
subset_colors <- sample(color_palette, n_colors)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

@jashapiro
Copy link
Member

We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

This makes sense to me. Maybe make both Benign and Other "special" so that their colors aren't random?

@jharenza
Copy link
Collaborator

Sure!

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a question about what to do with "Normal" colorwise, and some concern about capitalization consistency between files.

@cansavvy cansavvy requested a review from jashapiro January 15, 2021 17:38
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo, but looks good!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants