AlexsLemonade · jaclyn-taroni · Jan 15, 2021 · Jan 12, 2021 · Jan 12, 2021 · Jan 12, 2021
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -16,6 +16,10 @@ jobs:
           name: List Data Directory Contents
           command: ./scripts/run_in_ci.sh ls data/testing
 
+      - run:
+          name: High level histology grouping for plot labels
+          command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', clean = TRUE)"
+
       - run:
           name: Sample Distribution Analyses
           command: ./scripts/run_in_ci.sh bash "analyses/sample-distribution-analysis/run-sample-distribution.sh"
@@ -199,10 +203,10 @@ jobs:
       - run:
           name: Gene set enrichment analysis to generate GSVA scores
           command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"
-          
+
       - run:
           name: Gene set enrichment analysis to generate GSVA scores FOR BASE SUBTYPING
-          command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"    
+          command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"
 
       - run:
           name: Add Shatterseek

diff --git a/figures/mapping-histology-labels.Rmd b/figures/mapping-histology-labels.Rmd
@@ -0,0 +1,237 @@
+---
+title: "Mapping histology labels for plots"
+output:   
+  html_notebook: 
+    toc: true
+    toc_float: true
+author: Candace Savonen for ALSF - CCDL
+date: 2020
+---
+
+# Purpose: 
+
+The histology label variables included in `pbta-histologies.tsv` from data releases, while perhaps specific and accurate, are not useful for summarizing the biospecimens for visuals. 
+
+The goal of this notebook is to use the currently existing `broad_histology` groups from `pbta-histologies.tsv`, to form 10-15 "high level histology" group labels that can used for plotting purposes.
+
+## The output table
+
+The output of this notebook is a TSV file: `palettes/histology_label_color_table.tsv` that contains...
+
+**Copied from `pbta-histologies.tsv`**:    
+- `Kids_First_Biospecimen_ID` (from `pbta-histologies.tsv`)  
+- All the original histology label variables (`broad_histology`, `short_histology`, etc.)  
+
+**Created in this notebook**:  
+- `summary_group` - the high-level histology labels that should be used for plotting   
+- `hex_codes` the direct colors that should be used for plotting  
+
+With this info, `histology_label_color_table.tsv` should be used by all plots and figures that need summarizing on a high-level. 
+
+# How `summary_group` is made:
+
+Here's how `broad-histology` is [recoded into the high-level groupings of `summary_group`](#make_new_summary_group).
+
+1) All biospecimens in `broad_histology` groups with less than the number of `small_groups_cutoff` will be now grouped together as `Other` in `summary_group`. 
+
+2) "Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into `Low-grade astrocytic tumor` in `summary_group`. 
+
+3) So its clear why they are `NA`s, we will label all `Normal` samples (non-tumor samples) as `Normal` in `summary_group`. 
+
+4) Anything not in the above categories gets its `broad_histology` label carried over. 
+
+# Usage
+
+This notebook can be run via the command line from the top directory of the 
+repository as follows:
+
+```
+Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', 
+                              clean = TRUE)"
+```
+
+## Set Up
+
+```{r}
+# Magrittr pipe
+`%>%` <- dplyr::`%>%`
+```
+
+For groups that are smaller than this cutoff, they will be put into an `Other` group in `summary_group`. 
+
+```{r}
+small_groups_cutoff <- 12
+```
+
+### Directories and Files
+
+```{r}
+# Path to input directory
+input_dir <- file.path("..", "data")
+output_dir <- "palettes"
+```
+
+# Read in metadata 
+
+Which variables are we keeping for this table? 
+
+```{r}
+histology_variables <- 
+  c("Kids_First_Biospecimen_ID",
+    "sample_type", 
+    "integrated_diagnosis", 
+    "Notes", 
+    "harmonized_diagnosis",
+    "broad_histology", 
+    "short_histology")
+
+```
+
+Let's read in the current release's `pbta-histologies.tsv` file and select the histology variables we mentioned above. 
+
+```{r}
+metadata <-
+  readr::read_tsv(file.path(input_dir, "pbta-histologies.tsv"), guess_max = 10000) %>% 
+  dplyr::select(histology_variables)
+```
+
+# Take a look at how many biospecimens per `broad_histology` group
+
+Let's summarize `broad_histology`. 
+Because the `Normal` samples don't have histologies, we'll look at just the `Tumor` samples at for this summary. 
+
+```{r}
+broad_summary <- metadata %>% 
+  dplyr::filter(sample_type == "Tumor") %>%
+  dplyr::count(broad_histology) 
+```
+
+Now let's get a vector of those small `broad_histology` groups based on the `small_groups_cutoff`. 
+
+```{r}
+small_groups <- broad_summary %>% 
+  dplyr::filter(n < small_groups_cutoff) %>% 
+  dplyr::pull(broad_histology)
+```
+
+Let's take a look at how big all these groups are with a barplot.
+We'll mark which small histology groups are going to be labeled as "Other". 
+
+```{r}
+broad_summary %>% 
+  dplyr::mutate(now_labeled_other = broad_histology %in% small_groups) %>% 
+ggplot2::ggplot(ggplot2::aes(x = reorder(broad_histology, -n), y = n, fill = now_labeled_other)) +
+  ggplot2::geom_bar(position = "dodge", stat = "identity") +
+  ggplot2::theme_classic() +
+  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1)) + 
+  ggplot2::xlab("broad_histology label") +
+  ggplot2::ggtitle(paste("Broad histologies: groups <", small_groups_cutoff, "will be 'Other' in `summary_group`"))
+```
+
+There's handful of very small groups (many are n = 2). 
+
+# LGAT adjacent groups 
+
+Previously samples with these labels were generally considered `LGAT` groups, so we will group them back in in `summary_group`. 
+
+```{r}
+lgat_adjacent <- c("Neuronal and mixed neuronal-glial tumor", 
+                   "Diffuse astrocytic and oligodendroglial tumor")
+```
+
+Stop if these strings aren't detected. 
+
+```{r}
+if (!all(lgat_adjacent %in% metadata$broad_histology)) {
+  stop("LGAT adjacent groups' text strings are not detected in this version of pbta-histologies' broad_histology variable")
+}
+```
+
+# Make new `summary_group`
+
+This [logic is described at the top of the notebook](#how_summary_group_is_made:).
+
+```{r}
+histology_table <- metadata %>% 
+  dplyr::mutate(
+    summary_group = dplyr::case_when(
+      broad_histology %in% small_groups ~ "Other",
+      broad_histology %in% lgat_adjacent ~ "Low-grade astrocytic tumor", 
+      sample_type == "Normal" ~ "Normal",
+      TRUE ~ broad_histology)
+    )
+```
+
+Print out the number of `summary_groups` (including `Normal`)!
+
+```{r}
+summary_groups <- histology_table %>% 
+  dplyr::count(summary_group)
+
+nrow(summary_groups)
+```
+
+Make this notebook stop if there are more than 15 groups. 
+
+```{r}
+if (nrow(summary_groups) > 15) {
+  stop("There's more than 15 `summary_groups` may want to re-evaluate the high-level histology groupings")
+}
+```
+
+# Add hex codes 
+
+These hex codes were retrieved from http://phrogz.net/css/distinct-colors.html with the settings on default for 18 colors.
+
+```{r}
+color_palette <- 
+  c("#ff0000", "#cc0000", "#995200", "#bfb300", "#fffbbf", 
+    "#2e7300", "#00e65c", "#00ffee", "#103d40", "#0085a6", 
+    "#003380", "#4073ff", "#737899", "#70008c", "#f2b6ee", 
+    "#ff40bf", "#8c0038", "#330d12"
+)
+```
+
+Declare how many colors we need. 
+
+```{r}
+n_colors <- nrow(summary_groups)
+```
+
+Make a named list color key where histologies are the names. 
+
+```{r}
+# Set seed so the colors are consistent upon re-run
+set.seed(2021)
+
+# Sample from the 18 colors
+subset_colors <- sample(color_palette, n_colors)
+names(subset_colors) <- summary_groups$summary_group
+```
+
+Use `pie` function to preview what these look like.
+
+```{r}
+pie(rep(1, n_colors), 
+    col = subset_colors, 
+    labels = summary_groups$summary_group)
+```
+
+Add the hex codes to the `histology_table`. 
+
+```{r}
+histology_table <- histology_table %>% 
+  dplyr::mutate(hex_codes = dplyr::recode(summary_group, hex_codes = !!!subset_colors))
+```
+
+## Save to TSV 
+
+```{r}
+readr::write_tsv(histology_table, file.path(output_dir, "histology_label_color_table.tsv"))
+```
+
+# Session Info
+
+```{r}
+sessionInfo()
+```
diff --git a/figures/mapping-histology-labels.nb.html b/figures/mapping-histology-labels.nb.html