Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Mapping high level histology color labels -- Part 1 #899

Merged
merged 27 commits into from
Jan 15, 2021
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
e708bb1
Push the basic notebook so far
cansavvy Jan 12, 2021
d2e0fc7
Make color palette smaller and move set.seed()
cansavvy Jan 12, 2021
f571738
Polishing polishing polishing
cansavvy Jan 12, 2021
15a3651
One more formatting edit
cansavvy Jan 12, 2021
a02c616
Add to CI testing
cansavvy Jan 12, 2021
bc4935a
Merge branch 'master' into cansavvy/mapping-table
cansavvy Jan 12, 2021
5e00fe2
Fix formatting in CI test file whoops
cansavvy Jan 12, 2021
4580439
Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…
cansavvy Jan 12, 2021
959b683
Incorporate part of jashapiro review
cansavvy Jan 13, 2021
a173f91
Push a few other changes from jashapiro review
cansavvy Jan 13, 2021
a3c165b
re-run notebook
cansavvy Jan 13, 2021
81b3004
Update documentation
cansavvy Jan 13, 2021
f339b42
Merge branch 'master' into cansavvy/mapping-table
cansavvy Jan 13, 2021
566a7c6
Add back the `Normal` labels
cansavvy Jan 13, 2021
3ceaaf3
Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…
cansavvy Jan 13, 2021
4e4a7f0
Switch to jharenza recomendations (also do capitalization thing)
cansavvy Jan 13, 2021
465477e
Get rid of small_groups_cutoff remnant
cansavvy Jan 13, 2021
8896f0a
Merge branch 'master' into cansavvy/mapping-table
cansavvy Jan 14, 2021
709e623
Incorporate jharenza review -- make Other and Benign gray
cansavvy Jan 14, 2021
a0603b7
Merge remote-tracking branch 'cansavvy/cansavvy/mapping-table' into c…
cansavvy Jan 14, 2021
6ac8a8e
Some tidyverse rearranging because hex_codes weren't actually saving …
cansavvy Jan 14, 2021
35a0700
Okay had my hex_code list/names backwards now we're good
cansavvy Jan 14, 2021
a479b34
Don't do tolower on all those variables!
cansavvy Jan 15, 2021
9e79a70
Make Normal samples have hex code NA
cansavvy Jan 15, 2021
242c21e
Use black hex code for normal
cansavvy Jan 15, 2021
449cb0f
re-run notebook again
cansavvy Jan 15, 2021
ec68189
Fix typo and re-run
cansavvy Jan 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ jobs:
name: List Data Directory Contents
command: ./scripts/run_in_ci.sh ls data/testing

- run:
name: High level histology grouping for plot labels
command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', clean = TRUE)"

- run:
name: Sample Distribution Analyses
command: ./scripts/run_in_ci.sh bash "analyses/sample-distribution-analysis/run-sample-distribution.sh"
Expand Down Expand Up @@ -199,10 +203,10 @@ jobs:
- run:
name: Gene set enrichment analysis to generate GSVA scores
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"

- run:
name: Gene set enrichment analysis to generate GSVA scores FOR BASE SUBTYPING
command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"
command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"

- run:
name: Add Shatterseek
Expand Down
237 changes: 237 additions & 0 deletions figures/mapping-histology-labels.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
---
title: "Mapping histology labels for plots"
output:
html_notebook:
toc: true
toc_float: true
author: Candace Savonen for ALSF - CCDL
date: 2020
---

# Purpose:

The histology label variables included in `pbta-histologies.tsv` from data releases, while perhaps specific and accurate, are not useful for summarizing the biospecimens for visuals.

The goal of this notebook is to use the currently existing `broad_histology` groups from `pbta-histologies.tsv`, to form 10-15 "high level histology" group labels that can used for plotting purposes.

## The output table

The output of this notebook is a TSV file: `palettes/histology_label_color_table.tsv` that contains...

**Copied from `pbta-histologies.tsv`**:
- `Kids_First_Biospecimen_ID` (from `pbta-histologies.tsv`)
- All the original histology label variables (`broad_histology`, `short_histology`, etc.)

**Created in this notebook**:
- `summary_group` - the high-level histology labels that should be used for plotting
- `hex_codes` the direct colors that should be used for plotting

With this info, `histology_label_color_table.tsv` should be used by all plots and figures that need summarizing on a high-level.

# How `summary_group` is made:

Here's how `broad-histology` is [recoded into the high-level groupings of `summary_group`](#make_new_summary_group).

1) All biospecimens in `broad_histology` groups with less than the number of `small_groups_cutoff` will be now grouped together as `Other` in `summary_group`.

2) "Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into `Low-grade astrocytic tumor` in `summary_group`.

3) So its clear why they are `NA`s, we will label all `Normal` samples (non-tumor samples) as `Normal` in `summary_group`.

4) Anything not in the above categories gets its `broad_histology` label carried over.

# Usage

This notebook can be run via the command line from the top directory of the
repository as follows:

```
Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd',
clean = TRUE)"
```

## Set Up

```{r}
# Magrittr pipe
`%>%` <- dplyr::`%>%`
```

For groups that are smaller than this cutoff, they will be put into an `Other` group in `summary_group`.

```{r}
small_groups_cutoff <- 12
```

### Directories and Files

```{r}
# Path to input directory
input_dir <- file.path("..", "data")
output_dir <- "palettes"
```

# Read in metadata

Which variables are we keeping for this table?

```{r}
histology_variables <-
c("Kids_First_Biospecimen_ID",
"sample_type",
"integrated_diagnosis",
"Notes",
"harmonized_diagnosis",
"broad_histology",
"short_histology")

```

Let's read in the current release's `pbta-histologies.tsv` file and select the histology variables we mentioned above.

```{r}
metadata <-
readr::read_tsv(file.path(input_dir, "pbta-histologies.tsv"), guess_max = 10000) %>%
dplyr::select(histology_variables)
```

# Take a look at how many biospecimens per `broad_histology` group

Let's summarize `broad_histology`.
Because the `Normal` samples don't have histologies, we'll look at just the `Tumor` samples at for this summary.

```{r}
broad_summary <- metadata %>%
dplyr::filter(sample_type == "Tumor") %>%
dplyr::count(broad_histology)
```

Now let's get a vector of those small `broad_histology` groups based on the `small_groups_cutoff`.

```{r}
small_groups <- broad_summary %>%
dplyr::filter(n < small_groups_cutoff) %>%
dplyr::pull(broad_histology)
```

Let's take a look at how big all these groups are with a barplot.
We'll mark which small histology groups are going to be labeled as "Other".

```{r}
broad_summary %>%
dplyr::mutate(now_labeled_other = broad_histology %in% small_groups) %>%
ggplot2::ggplot(ggplot2::aes(x = reorder(broad_histology, -n), y = n, fill = now_labeled_other)) +
ggplot2::geom_bar(position = "dodge", stat = "identity") +
ggplot2::theme_classic() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1)) +
ggplot2::xlab("broad_histology label") +
ggplot2::ggtitle(paste("Broad histologies: groups <", small_groups_cutoff, "will be 'Other' in `summary_group`"))
```

There's handful of very small groups (many are n = 2).

# LGAT adjacent groups

Previously samples with these labels were generally considered `LGAT` groups, so we will group them back in in `summary_group`.

```{r}
lgat_adjacent <- c("Neuronal and mixed neuronal-glial tumor",
"Diffuse astrocytic and oligodendroglial tumor")
```

Stop if these strings aren't detected.

```{r}
if (!all(lgat_adjacent %in% metadata$broad_histology)) {
stop("LGAT adjacent groups' text strings are not detected in this version of pbta-histologies' broad_histology variable")
}
```

# Make new `summary_group`

This [logic is described at the top of the notebook](#how_summary_group_is_made:).

```{r}
histology_table <- metadata %>%
dplyr::mutate(
summary_group = dplyr::case_when(
broad_histology %in% small_groups ~ "Other",
broad_histology %in% lgat_adjacent ~ "Low-grade astrocytic tumor",
sample_type == "Normal" ~ "Normal",
TRUE ~ broad_histology)
)
```

Print out the number of `summary_groups` (including `Normal`)!

```{r}
summary_groups <- histology_table %>%
dplyr::count(summary_group)

nrow(summary_groups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would here again print out the full table to make it easier to evaluate what has happened.

```

Make this notebook stop if there are more than 15 groups.

```{r}
if (nrow(summary_groups) > 15) {
stop("There's more than 15 `summary_groups` may want to re-evaluate the high-level histology groupings")
}
```

# Add hex codes

These hex codes were retrieved from http://phrogz.net/css/distinct-colors.html with the settings on default for 18 colors.

```{r}
color_palette <-
c("#ff0000", "#cc0000", "#995200", "#bfb300", "#fffbbf",
"#2e7300", "#00e65c", "#00ffee", "#103d40", "#0085a6",
"#003380", "#4073ff", "#737899", "#70008c", "#f2b6ee",
"#ff40bf", "#8c0038", "#330d12"
)
```

Declare how many colors we need.

```{r}
n_colors <- nrow(summary_groups)
```

Make a named list color key where histologies are the names.

```{r}
# Set seed so the colors are consistent upon re-run
set.seed(2021)

# Sample from the 18 colors
subset_colors <- sample(color_palette, n_colors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing when making the colors is that I feel like we should treat "Other" as special, and always assign it a standard color, something neutral & grey like #A0A0A0 (though I am not sure how that fits with colorblind-friendly schemes).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could drop this "Other" group from the large summary figures.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider treating "Benign" as the grey group @jashapiro and @cansavvy - because it's more like a control and we would be less interested (maybe) in those genomics.

names(subset_colors) <- summary_groups$summary_group
```

Use `pie` function to preview what these look like.

```{r}
pie(rep(1, n_colors),
col = subset_colors,
labels = summary_groups$summary_group)
```

Add the hex codes to the `histology_table`.

```{r}
histology_table <- histology_table %>%
dplyr::mutate(hex_codes = dplyr::recode(summary_group, hex_codes = !!!subset_colors))
```

## Save to TSV

```{r}
readr::write_tsv(histology_table, file.path(output_dir, "histology_label_color_table.tsv"))
```

# Session Info

```{r}
sessionInfo()
```
3,306 changes: 3,306 additions & 0 deletions figures/mapping-histology-labels.nb.html

Large diffs are not rendered by default.

Loading