Mapping high level histology color labels -- Part 1 #899

cansavvy · 2021-01-12T18:29:27Z

Purpose/implementation Section

What scientific question is your analysis addressing?

This PR addresses Step 1 of 4 discussed in #897. This PR adds that main notebook described where we attempt to group biospecimens into "high level" histology groups that will be more practical for plotting purposes.

Note that subsequent PRs will deal with the updating of the README and retirement of histology_color_palette.tsv in favor for the mapping table, palettes/histology_label_color_table.tsv, that is created in this notebook.

What was your approach?

Copied from the notebook added in this PR itself:

# Purpose: 

The histology label variables included in `pbta-histologies.tsv` from data releases are not always useful for visualizing the full set of biospecimens due to the large number of different values.
Having too many different possible values makes the colors harder to distinguish.
In addition, there are some groups that are represented by only a very few samples; giving such groups a distinct color may be counterproductive.

The goal of this notebook is to use the currently existing `broad_histology` groups from `pbta-histologies.tsv`, to form 10-15 "high level histology" group labels that can used for plotting purposes.

## The output table

The output of this notebook is a TSV file: `palettes/histology_label_color_table.tsv` that contains the following fields:

**Copied from `pbta-histologies.tsv`**:    
- `Kids_First_Biospecimen_ID` (from `pbta-histologies.tsv`)  
- All the original histology label variables (`broad_histology`, `short_histology`, etc.)  
  
**Created in this notebook**:  
- `display_group` - the high-level histology labels that should be used for plotting   
- `hex_codes` the direct colors that should be used for plotting  

With this info, `histology_label_color_table.tsv` can be used by all plots and figures that summarize high level data  while displaying histology information. 

# How `display_group` is made:

Here's how `broad-histology` groups are [combined into the higher-level groupings of `display_group`](#declare-new-equivalent-groups).

1) "Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into `Low-grade astrocytic tumor` in `display_group`. 

2)  `Germ cell tumor`  samples are combined into the `Non-CNS tumor` group. 

3) `Benign tumor` and `Non-tumor` biospecimens are combined into a `Benign` group. 

4) Anything not in the above categories gets its `broad_histology` label carried over.

What GitHub issue does your pull request address?

#897

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

How do we feel about this notebook and its ability to deal with upcoming releases' versions of pbta-histologies.tsv
How do we feel about these broad histology groupings? Aka how scientifically wrong are they? (Note that we've always known "high level" groupings will not be more accurate or specific; that is not the goal here, the goal is to be broad and have something easier to visualize in a plot.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes. The ones included can be.

Results

What is your summary of the results?

We have a list of colors and 16 + Normal groups. I have not tried it on an official plot yet. (upcoming PR)

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile. -- no new packages are needed.
This analysis has been added to continuous integration. -- I've added it at the very top because the first few CI tests will need it once Make mapping histology groups for plotting -- Part 2 implementing the use of the mapping table #898 is addressed and those plots use the table created here.

Documentation Checklist

This analysis module has a README and it is up to date. -- No, this will be in a subsequent PR. The README will require quite a bit of updating.
This analysis is recorded in the table in analyses/README.md and the entry is up to date. -- it is not because its not located in the analyses folder.
The analytical code is documented and contains comments.

figures/mapping-histology-labels.Rmd

…ansavvy/mapping-table

figures/mapping-histology-labels.Rmd

jashapiro

This looks like a good start. I think my main comment is that we should try to be a bit more intentional about grouping things before resorting to "Other". There may well be reasonable groupings that take some of the small-count histologies and link them to large-count histologies that they are related to rather than just to other small-count groups. It may be helpful to have a table which shows the integrated_diagnosis values for each broad_histology to assist with such grouping.

My other suggestions are mostly about display. This is a notebook for internal evaluation, so having more easily visible data at each stage in tables is more valuable than some of the summaries or figures.

I also suggest treating "Other" as a special group, with a defined color that remains constant.

With regards to your question:

For ease of downstream use, I'm wondering if we should drop everything that's not Kids_First_Biospecimen_ID summary_group and hex_codes, because downstream we'll need to drop everything.

I think we should keep as much information as possible for now. It is potentially easier to evaluate groupings using the full table, and the cost of keeping it is essentially zero.

figures/mapping-histology-labels.Rmd

jashapiro · 2021-01-13T02:39:34Z

figures/mapping-histology-labels.Rmd

+summary_groups <- histology_table %>% 
+  dplyr::count(summary_group)
+
+nrow(summary_groups)


I would here again print out the full table to make it easier to evaluate what has happened.

figures/mapping-histology-labels.Rmd

jashapiro · 2021-01-13T13:55:07Z

figures/mapping-histology-labels.Rmd

+set.seed(2021)
+
+# Sample from the 18 colors
+subset_colors <- sample(color_palette, n_colors)


One thing when making the colors is that I feel like we should treat "Other" as special, and always assign it a standard color, something neutral & grey like #A0A0A0 (though I am not sure how that fits with colorblind-friendly schemes).

Alternatively, we could drop this "Other" group from the large summary figures.

We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

cansavvy · 2021-01-13T14:59:06Z

@jharenza , we could use some more domain knowledge about these histologies at this point.
In particular, which of these display_groups (which are mostly from broad_histology) would be most reasonable to combine into other groups?

display_group	n
Lymphoma	2
Melanocytic tumor	2
Other tumor	2
Other astrocytic tumor	6
Metastatic tumors	9
Tumor of pineal region	10
Histiocytic tumor	12
Choroid plexus tumor	21
Pre-cancerous lesion	27
Non-CNS tumor	29
Mesenchymal non-meningothelial tumor	48
Meningioma	59
Tumors of sellar region	73
Benign	78
Tumor of cranial and paraspinal nerves	83
Ependymal tumor	174
Embryonal tumor	339
NA	833
Low-grade astrocytic tumor and adjacent	1034

jaclyn-taroni · 2021-01-13T15:00:37Z

Without looking at the code, I wanted to point out that the NA should be normal samples. So you may want to include some kind of filtering to tumor samples if you have not already.

cansavvy · 2021-01-13T15:01:33Z

Without looking at the code, I wanted to point out that the NA should be normal samples. So you may want to include some kind of filtering to tumor samples if you have not already.

I did have that previously but with the changes I made most recently with Josh's review we lost that. I'll add it back.

…ansavvy/mapping-table

cansavvy · 2021-01-13T15:13:42Z

figures/mapping-histology-labels.Rmd

+
+There's handful of very small groups (many are n = 2). 
+
+## Declare new equivalent groups


@jharenza this is where we set up the equivalent groups which are then recoded in the next section.

jharenza · 2021-01-13T18:27:15Z

Hi @cansavvy -

"Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into Low-grade astrocytic tumor in display_group.

This is not true - These are three separate entities and really should not be combined. The neuronal and mixed glial tumors comprise of many distinct tumors, so they had a wider array of short_histology. The Diffuse astrocytic and oligodendroglial tumors were all HGAT (High-grade astrocytic tumors) and should be kept as a separate group. So, I would keep those three groups, especially since they have a bulk of the tumors.

In particular, which of these display_groups (which are mostly from broad_histology) would be most reasonable to combine into other groups?

I think we can combine:

Lymphoma	2
Melanocytic tumor	2
Other tumor	2
Metastatic tumors	9
Non-CNS tumor	29

to Non-CNS or other tumor

I think you can combine Other astrocytic tumor into LGAT for now, since it is a low-grade astrocytoma. Possibly, we can update that broad_histology in v19.

I think that will leave us with 15.

cansavvy · 2021-01-13T19:31:13Z

@jharenza , thanks for these comments. I'm going to implement them and then re-request you for a review when its ready.

cansavvy · 2021-01-13T19:42:38Z

I think that will leave us with 15.

This brings us to 16 + Normal. Are there any other combos that seem reasonable or should we leave it at that?

cansavvy · 2021-01-13T20:17:07Z

I would be more prone to dropping Benign tumor first, then Other if we want to drop more.

If we are talking about dropping that completely from the figures, than I think we'd probably keep this data in the table for now, but change that at the figure scripts?

But I was more talking about groups we could combine into other groups.

jharenza · 2021-01-13T20:30:39Z

Ah, yeah I think that is as far as we can combine right now.

jharenza

Hi @cansavvy! Looks good- made a few minor comments, but otherwise it looks good!

figures/mapping-histology-labels.Rmd

jharenza · 2021-01-14T13:51:57Z

figures/mapping-histology-labels.Rmd

+set.seed(2021)
+
+# Sample from the 18 colors
+subset_colors <- sample(color_palette, n_colors)


We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

jashapiro · 2021-01-14T14:07:22Z

We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

This makes sense to me. Maybe make both Benign and Other "special" so that their colors aren't random?

jharenza · 2021-01-14T14:36:35Z

Sure!

…ansavvy/mapping-table

…whoops

jashapiro

Looks good. Just a question about what to do with "Normal" colorwise, and some concern about capitalization consistency between files.

figures/mapping-histology-labels.Rmd

jashapiro

One typo, but looks good!

Push the basic notebook so far

e708bb1

cansavvy commented Jan 12, 2021

View reviewed changes

figures/mapping-histology-labels.Rmd Show resolved Hide resolved

cansavvy added 3 commits January 12, 2021 15:40

Make color palette smaller and move set.seed()

d2e0fc7

Polishing polishing polishing

f571738

One more formatting edit

15a3651

cansavvy changed the title ~~WIP: Mapping high level histology color labels -- Part 1~~ Mapping high level histology color labels -- Part 1 Jan 12, 2021

Add to CI testing

a02c616

cansavvy marked this pull request as ready for review January 12, 2021 21:44

cansavvy added 3 commits January 12, 2021 16:44

Merge branch 'master' into cansavvy/mapping-table

bc4935a

Fix formatting in CI test file whoops

5e00fe2

Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…

4580439

…ansavvy/mapping-table

cansavvy requested a review from jashapiro January 12, 2021 21:51

cansavvy commented Jan 13, 2021

View reviewed changes

figures/mapping-histology-labels.Rmd Show resolved Hide resolved

jashapiro reviewed Jan 13, 2021

View reviewed changes

cansavvy added 5 commits January 13, 2021 09:17

Incorporate part of jashapiro review

959b683

Push a few other changes from jashapiro review

a173f91

re-run notebook

a3c165b

Update documentation

81b3004

Merge branch 'master' into cansavvy/mapping-table

f339b42

cansavvy added 2 commits January 13, 2021 10:04

Add back the Normal labels

566a7c6

Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…

3ceaaf3

…ansavvy/mapping-table

cansavvy requested a review from jharenza January 13, 2021 15:06

cansavvy commented Jan 13, 2021

View reviewed changes

Switch to jharenza recomendations (also do capitalization thing)

4e4a7f0

cansavvy requested a review from jashapiro January 13, 2021 20:21

Merge branch 'master' into cansavvy/mapping-table

8896f0a

jharenza approved these changes Jan 14, 2021

View reviewed changes

cansavvy added 4 commits January 14, 2021 10:28

Incorporate jharenza review -- make Other and Benign gray

709e623

Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…

a0603b7

…ansavvy/mapping-table

Some tidyverse rearranging because hex_codes weren't actually saving …

6ac8a8e

…whoops

Okay had my hex_code list/names backwards now we're good

35a0700

This was referenced Jan 14, 2021

Using mapping histology groups for plotting -- Part 2 implemention (PR 1 of 4) #904

Merged

Using mapping histology groups for plotting -- Part 2 implemention (PR 2 of 4) #911

Merged

jashapiro reviewed Jan 15, 2021

View reviewed changes

figures/mapping-histology-labels.Rmd Outdated Show resolved Hide resolved

figures/mapping-histology-labels.Rmd Outdated Show resolved Hide resolved

cansavvy added 2 commits January 15, 2021 12:18

Don't do tolower on all those variables!

a479b34

Make Normal samples have hex code NA

9e79a70

cansavvy requested a review from jashapiro January 15, 2021 17:38

cansavvy added 2 commits January 15, 2021 13:06

Use black hex code for normal

242c21e

re-run notebook again

449cb0f

jashapiro reviewed Jan 15, 2021

View reviewed changes

figures/mapping-histology-labels.Rmd Outdated Show resolved Hide resolved

jashapiro approved these changes Jan 15, 2021

View reviewed changes

Fix typo and re-run

ec68189

jaclyn-taroni merged commit f574c9d into AlexsLemonade:master Jan 15, 2021

This was referenced Jan 20, 2021

Using mapping histology groups for plotting implementation (PR 3 of 4) #918

Merged

Using mapping histology groups for transcriptomic-overview plot (PR 4 of 5) #921

Merged

cansavvy deleted the cansavvy/mapping-table branch January 25, 2021 16:25

This was referenced Jan 25, 2021

Using mapping histology groups for plotting implementation (PR 5 of 5) #927

Merged

Splitting up #921: GSEA module changes #928

Merged

Splitting up #921: Immune-deconv changes #929

Merged

cbethell mentioned this pull request Apr 5, 2021

Refactor oncoprint module to include plotting by broad histology #983

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping high level histology color labels -- Part 1 #899

Mapping high level histology color labels -- Part 1 #899

cansavvy commented Jan 12, 2021 •

edited

Loading

jashapiro left a comment

jashapiro Jan 13, 2021

jashapiro Jan 13, 2021

cansavvy Jan 13, 2021

jharenza Jan 14, 2021

cansavvy commented Jan 13, 2021

jaclyn-taroni commented Jan 13, 2021

cansavvy commented Jan 13, 2021

cansavvy Jan 13, 2021

jharenza commented Jan 13, 2021 •

edited

Loading

cansavvy commented Jan 13, 2021

cansavvy commented Jan 13, 2021

cansavvy commented Jan 13, 2021

jharenza commented Jan 13, 2021

jharenza left a comment

jharenza Jan 14, 2021

jashapiro commented Jan 14, 2021

jharenza commented Jan 14, 2021

jashapiro left a comment

jashapiro left a comment


		There's handful of very small groups (many are n = 2).

		## Declare new equivalent groups

Mapping high level histology color labels -- Part 1 #899

Mapping high level histology color labels -- Part 1 #899

Conversation

cansavvy commented Jan 12, 2021 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 13, 2021

Choose a reason for hiding this comment

jashapiro Jan 13, 2021

Choose a reason for hiding this comment

cansavvy Jan 13, 2021

Choose a reason for hiding this comment

jharenza Jan 14, 2021

Choose a reason for hiding this comment

cansavvy commented Jan 13, 2021

jaclyn-taroni commented Jan 13, 2021

cansavvy commented Jan 13, 2021

cansavvy Jan 13, 2021

Choose a reason for hiding this comment

jharenza commented Jan 13, 2021 • edited Loading

cansavvy commented Jan 13, 2021

cansavvy commented Jan 13, 2021

cansavvy commented Jan 13, 2021

jharenza commented Jan 13, 2021

jharenza left a comment

Choose a reason for hiding this comment

jharenza Jan 14, 2021

Choose a reason for hiding this comment

jashapiro commented Jan 14, 2021

jharenza commented Jan 14, 2021

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

cansavvy commented Jan 12, 2021 •

edited

Loading

jharenza commented Jan 13, 2021 •

edited

Loading