Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Mapping high level histology color labels -- Part 1 #899

Merged
merged 27 commits into from
Jan 15, 2021
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
e708bb1
Push the basic notebook so far
cansavvy Jan 12, 2021
d2e0fc7
Make color palette smaller and move set.seed()
cansavvy Jan 12, 2021
f571738
Polishing polishing polishing
cansavvy Jan 12, 2021
15a3651
One more formatting edit
cansavvy Jan 12, 2021
a02c616
Add to CI testing
cansavvy Jan 12, 2021
bc4935a
Merge branch 'master' into cansavvy/mapping-table
cansavvy Jan 12, 2021
5e00fe2
Fix formatting in CI test file whoops
cansavvy Jan 12, 2021
4580439
Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…
cansavvy Jan 12, 2021
959b683
Incorporate part of jashapiro review
cansavvy Jan 13, 2021
a173f91
Push a few other changes from jashapiro review
cansavvy Jan 13, 2021
a3c165b
re-run notebook
cansavvy Jan 13, 2021
81b3004
Update documentation
cansavvy Jan 13, 2021
f339b42
Merge branch 'master' into cansavvy/mapping-table
cansavvy Jan 13, 2021
566a7c6
Add back the `Normal` labels
cansavvy Jan 13, 2021
3ceaaf3
Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…
cansavvy Jan 13, 2021
4e4a7f0
Switch to jharenza recomendations (also do capitalization thing)
cansavvy Jan 13, 2021
465477e
Get rid of small_groups_cutoff remnant
cansavvy Jan 13, 2021
8896f0a
Merge branch 'master' into cansavvy/mapping-table
cansavvy Jan 14, 2021
709e623
Incorporate jharenza review -- make Other and Benign gray
cansavvy Jan 14, 2021
a0603b7
Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…
cansavvy Jan 14, 2021
6ac8a8e
Some tidyverse rearranging because hex_codes weren't actually saving …
cansavvy Jan 14, 2021
35a0700
Okay had my hex_code list/names backwards now we're good
cansavvy Jan 14, 2021
a479b34
Don't do tolower on all those variables!
cansavvy Jan 15, 2021
9e79a70
Make Normal samples have hex code NA
cansavvy Jan 15, 2021
242c21e
Use black hex code for normal
cansavvy Jan 15, 2021
449cb0f
re-run notebook again
cansavvy Jan 15, 2021
ec68189
Fix typo and re-run
cansavvy Jan 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ jobs:
name: List Data Directory Contents
command: ./scripts/run_in_ci.sh ls data/testing

- run:
name: High level histology grouping for plot labels
command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', clean = TRUE)"

- run:
name: Sample Distribution Analyses
command: ./scripts/run_in_ci.sh bash "analyses/sample-distribution-analysis/run-sample-distribution.sh"
Expand Down Expand Up @@ -191,10 +195,10 @@ jobs:
- run:
name: Gene set enrichment analysis to generate GSVA scores
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"

- run:
name: Gene set enrichment analysis to generate GSVA scores FOR BASE SUBTYPING
command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"
command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"

- run:
name: Add Shatterseek
Expand Down
232 changes: 232 additions & 0 deletions figures/mapping-histology-labels.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
---
title: "Mapping histology labels for plots"
output:
html_notebook:
toc: true
toc_float: true
author: Candace Savonen for ALSF - CCDL
date: 2020
---

# Purpose:

The histology label variables included in `pbta-histologies.tsv` from data releases are not always useful for visualizing the full set of biospecimens due to the large number of different values.
Having too many different possible values makes the colors harder to distinguish.
In addition, there are some groups that are represented by only a very few samples; giving such groups a distinct color may be counterproductive.

The goal of this notebook is to use the currently existing `broad_histology` groups from `pbta-histologies.tsv`, to form 10-15 "high level histology" group labels that can used for plotting purposes.

## The output table

The output of this notebook is a TSV file: `palettes/histology_label_color_table.tsv` that contains the following fields:

**Copied from `pbta-histologies.tsv`**:
- `Kids_First_Biospecimen_ID` (from `pbta-histologies.tsv`)
- All the original histology label variables (`broad_histology`, `short_histology`, etc.)

**Created in this notebook**:
- `display_group` - the high-level histology labels that should be used for plotting
- `hex_codes` the direct colors that should be used for plotting

With this info, `histology_label_color_table.tsv` can be used by all plots and figures that summarize high level data while displaying histology information.

# How `display_group` is made:

Here's how `broad-histology` groups are [combined into the higher-level groupings of `display_group`](#declare-new-equivalent-groups).

1) "Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into `Low-grade astrocytic tumor` in `display_group`.

2) `Germ cell tumor` samples are combined into the `Non-CNS tumor` group.

3) `Benign tumor` and `Non-tumor` biospecimens are combined into a `Benign` group.

4) Anything not in the above categories gets its `broad_histology` label carried over.

# Usage

This notebook can be run via the command line from the top directory of the
repository as follows:

```
Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd',
clean = TRUE)"
```

## Set Up

```{r}
# Magrittr pipe
`%>%` <- dplyr::`%>%`
```

For groups that are smaller than this cutoff, they will be put into an `Other` group in `display_group`.

```{r}
small_groups_cutoff <- 12
```

### Directories and Files

```{r}
# Path to input directory
input_dir <- file.path("..", "data")
output_dir <- "palettes"
```

# Read in metadata

Which variables are we keeping for this table?

```{r}
histology_variables <-
c("Kids_First_Biospecimen_ID",
"sample_type",
"integrated_diagnosis",
"Notes",
"harmonized_diagnosis",
"broad_histology",
"short_histology")

```

Let's read in the current release's `pbta-histologies.tsv` file and select the histology variables we mentioned above.

```{r}
metadata <-
readr::read_tsv(file.path(input_dir, "pbta-histologies.tsv"), guess_max = 10000) %>%
dplyr::select(histology_variables)
```

# Take a look at how many biospecimens per `broad_histology` group

Let's summarize `broad_histology`.
Because the `Normal` samples don't have histologies, we'll look at just the `Tumor` samples at for this summary.

```{r}
broad_summary <- metadata %>%
dplyr::filter(sample_type == "Tumor") %>%
dplyr::count(broad_histology) %>%
dplyr::arrange(n)
```

Let's print out the summary.

```{r}
broad_summary %>%
knitr::kable()
```

There's handful of very small groups (many are n = 2).

## Declare new equivalent groups
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jharenza this is where we set up the equivalent groups which are then recoded in the next section.


Previously samples with these labels were generally considered `LGAT` groups, so we will group them back in in `display_group`.

```{r}
lgat_adjacent <- c("Low-grade astrocytic tumor",
"Neuronal and mixed neuronal-glial tumor",
"Diffuse astrocytic and oligodendroglial tumor")
```

These groups we'll combine non-CNS.

```{r}
non_cns <- c("Non-CNS tumor", "Germ cell tumor")
```

These groups we'll combine as a benign.

```{r}
begnign <- c("Benign tumor", "Non-tumor")
```

# Make new `display_group`

```{r}
histology_table <- metadata %>%
dplyr::mutate(
# NAs are really Normals
display_group = tidyr::replace_na(broad_histology, "Normal"),
# Now do the group combining
display_group = forcats::fct_collapse(display_group,
"Low-grade astrocytic tumor and adjacent" = lgat_adjacent,
"Non-CNS tumor" = non_cns,
"Benign" = begnign
),
)
```

Print out the number of `display_group` (including `Normal`)!

```{r}
display_group_df <- histology_table %>%
dplyr::count(display_group) %>%
dplyr::arrange(n)

knitr::kable(display_group_df)
```

Make this notebook stop if there are more than 15 groups.

```{r}
if (nrow(display_group_df) > 15) {
stop("There's more than 15 `display_group` may want to re-evaluate the high-level histology groupings")
}
```

# Add hex codes

These hex codes were retrieved from http://phrogz.net/css/distinct-colors.html with the settings on default for 18 colors.

```{r}
color_palette <-
c("#ff0000", "#cc0000", "#995200", "#bfb300", "#fffbbf",
"#2e7300", "#00e65c", "#00ffee", "#103d40", "#0085a6",
"#003380", "#4073ff", "#737899", "#70008c", "#f2b6ee",
"#ff40bf", "#8c0038", "#330d12"
)
```

Declare how many colors we need.

```{r}
n_colors <- nrow(display_group_df)
```

Make a named list color key where histologies are the names.

```{r}
# Set seed so the colors are consistent upon re-run
set.seed(2021)

# Sample from the 18 colors
subset_colors <- sample(color_palette, n_colors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing when making the colors is that I feel like we should treat "Other" as special, and always assign it a standard color, something neutral & grey like #A0A0A0 (though I am not sure how that fits with colorblind-friendly schemes).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could drop this "Other" group from the large summary figures.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

names(subset_colors) <- display_group_df$display_group
```

Use `pie` function to preview what these look like.

```{r}
pie(rep(1, n_colors),
col = subset_colors,
labels = display_group_df$display_group)
```

Add the hex codes to the `histology_table`.

```{r}
histology_table <- histology_table %>%
dplyr::mutate(hex_codes = dplyr::recode(display_group, hex_codes = !!!subset_colors))
```

## Save to TSV

```{r}
readr::write_tsv(histology_table, file.path(output_dir, "histology_label_color_table.tsv"))
```

# Session Info

```{r}
sessionInfo()
```
3,453 changes: 3,453 additions & 0 deletions figures/mapping-histology-labels.nb.html

Large diffs are not rendered by default.

Loading