AlexsLemonade · jaclyn-taroni · Jan 15, 2021 · Jan 12, 2021 · Jan 12, 2021 · Jan 12, 2021
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -16,6 +16,10 @@ jobs:
           name: List Data Directory Contents
           command: ./scripts/run_in_ci.sh ls data/testing
 
+      - run:
+          name: High level histology grouping for plot labels
+          command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', clean = TRUE)"
+
       - run:
           name: Sample Distribution Analyses
           command: ./scripts/run_in_ci.sh bash "analyses/sample-distribution-analysis/run-sample-distribution.sh"
@@ -191,10 +195,10 @@ jobs:
       - run:
           name: Gene set enrichment analysis to generate GSVA scores
           command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"
-          
+
       - run:
           name: Gene set enrichment analysis to generate GSVA scores FOR BASE SUBTYPING
-          command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"    
+          command: OPENPBTA_TESTING=1 OPENPBTA_BASE_SUBTYPING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"
 
       - run:
           name: Add Shatterseek

diff --git a/figures/mapping-histology-labels.Rmd b/figures/mapping-histology-labels.Rmd
@@ -0,0 +1,232 @@
+---
+title: "Mapping histology labels for plots"
+output:   
+  html_notebook: 
+    toc: true
+    toc_float: true
+author: Candace Savonen for ALSF - CCDL
+date: 2020
+---
+
+# Purpose: 
+
+The histology label variables included in `pbta-histologies.tsv` from data releases are not always useful for visualizing the full set of biospecimens due to the large number of different values.
+Having too many different possible values makes the colors harder to distinguish.
+In addition, there are some groups that are represented by only a very few samples; giving such groups a distinct color may be counterproductive.
+
+The goal of this notebook is to use the currently existing `broad_histology` groups from `pbta-histologies.tsv`, to form 10-15 "high level histology" group labels that can used for plotting purposes.
+
+## The output table
+
+The output of this notebook is a TSV file: `palettes/histology_label_color_table.tsv` that contains the following fields:
+
+**Copied from `pbta-histologies.tsv`**:    
+- `Kids_First_Biospecimen_ID` (from `pbta-histologies.tsv`)  
+- All the original histology label variables (`broad_histology`, `short_histology`, etc.)  
+
+**Created in this notebook**:  
+- `display_group` - the high-level histology labels that should be used for plotting   
+- `hex_codes` the direct colors that should be used for plotting  
+
+With this info, `histology_label_color_table.tsv` can be used by all plots and figures that summarize high level data  while displaying histology information. 
+
+# How `display_group` is made:
+
+Here's how `broad-histology` groups are [combined into the higher-level groupings of `display_group`](#declare-new-equivalent-groups).
+
+1) "Neuronal and mixed neuronal-glial tumor" and "Diffuse astrocytic and oligodendroglial tumor" were labeled as LGAT in previous releases so we will group these into `Low-grade astrocytic tumor` in `display_group`. 
+
+2)  `Germ cell tumor`  samples are combined into the `Non-CNS tumor` group. 
+
+3) `Benign tumor` and `Non-tumor` biospecimens are combined into a `Benign` group. 
+
+4) Anything not in the above categories gets its `broad_histology` label carried over. 
+
+# Usage
+
+This notebook can be run via the command line from the top directory of the 
+repository as follows:
+
+```
+Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', 
+                              clean = TRUE)"
+```
+
+## Set Up
+
+```{r}
+# Magrittr pipe
+`%>%` <- dplyr::`%>%`
+```
+
+For groups that are smaller than this cutoff, they will be put into an `Other` group in `display_group`. 
+
+```{r}
+small_groups_cutoff <- 12
+```
+
+### Directories and Files
+
+```{r}
+# Path to input directory
+input_dir <- file.path("..", "data")
+output_dir <- "palettes"
+```
+
+# Read in metadata 
+
+Which variables are we keeping for this table? 
+
+```{r}
+histology_variables <- 
+  c("Kids_First_Biospecimen_ID",
+    "sample_type", 
+    "integrated_diagnosis", 
+    "Notes", 
+    "harmonized_diagnosis",
+    "broad_histology", 
+    "short_histology")
+
+```
+
+Let's read in the current release's `pbta-histologies.tsv` file and select the histology variables we mentioned above. 
+
+```{r}
+metadata <-
+  readr::read_tsv(file.path(input_dir, "pbta-histologies.tsv"), guess_max = 10000) %>% 
+  dplyr::select(histology_variables)
+```
+
+# Take a look at how many biospecimens per `broad_histology` group
+
+Let's summarize `broad_histology`. 
+Because the `Normal` samples don't have histologies, we'll look at just the `Tumor` samples at for this summary. 
+
+```{r}
+broad_summary <- metadata %>% 
+  dplyr::filter(sample_type == "Tumor") %>%
+  dplyr::count(broad_histology) %>% 
+  dplyr::arrange(n)
+```
+
+Let's print out the summary. 
+
+```{r}
+broad_summary %>% 
+  knitr::kable()
+```
+
+There's handful of very small groups (many are n = 2). 
+
+## Declare new equivalent groups
+
+Previously samples with these labels were generally considered `LGAT` groups, so we will group them back in in `display_group`. 
+
+```{r}
+lgat_adjacent <- c("Low-grade astrocytic tumor",
+                   "Neuronal and mixed neuronal-glial tumor",
+                   "Diffuse astrocytic and oligodendroglial tumor")
+```
+
+These groups we'll combine non-CNS. 
+
+```{r}
+non_cns <- c("Non-CNS tumor", "Germ cell tumor")
+```
+
+These groups we'll combine as a benign.
+
+```{r}
+begnign <- c("Benign tumor", "Non-tumor")
+```
+
+# Make new `display_group`
+
+```{r}
+histology_table <- metadata %>% 
+  dplyr::mutate(
+    # NAs are really Normals
+    display_group = tidyr::replace_na(broad_histology, "Normal"),
+    # Now do the group combining
+    display_group = forcats::fct_collapse(display_group,
+      "Low-grade astrocytic tumor and adjacent" = lgat_adjacent,
+      "Non-CNS tumor" = non_cns,
+      "Benign" = begnign
+    ),
+    )
+```
+
+Print out the number of `display_group` (including `Normal`)!
+
+```{r}
+display_group_df <- histology_table %>% 
+  dplyr::count(display_group) %>% 
+  dplyr::arrange(n)
+
+knitr::kable(display_group_df)
+```
+
+Make this notebook stop if there are more than 15 groups. 
+
+```{r}
+if (nrow(display_group_df) > 15) {
+  stop("There's more than 15 `display_group` may want to re-evaluate the high-level histology groupings")
+}
+```
+
+# Add hex codes 
+
+These hex codes were retrieved from http://phrogz.net/css/distinct-colors.html with the settings on default for 18 colors.
+
+```{r}
+color_palette <- 
+  c("#ff0000", "#cc0000", "#995200", "#bfb300", "#fffbbf", 
+    "#2e7300", "#00e65c", "#00ffee", "#103d40", "#0085a6", 
+    "#003380", "#4073ff", "#737899", "#70008c", "#f2b6ee", 
+    "#ff40bf", "#8c0038", "#330d12"
+)
+```
+
+Declare how many colors we need. 
+
+```{r}
+n_colors <- nrow(display_group_df)
+```
+
+Make a named list color key where histologies are the names. 
+
+```{r}
+# Set seed so the colors are consistent upon re-run
+set.seed(2021)
+
+# Sample from the 18 colors
+subset_colors <- sample(color_palette, n_colors)
+names(subset_colors) <- display_group_df$display_group
+```
+
+Use `pie` function to preview what these look like.
+
+```{r}
+pie(rep(1, n_colors), 
+    col = subset_colors, 
+    labels = display_group_df$display_group)
+```
+
+Add the hex codes to the `histology_table`. 
+
+```{r}
+histology_table <- histology_table %>% 
+  dplyr::mutate(hex_codes = dplyr::recode(display_group, hex_codes = !!!subset_colors))
+```
+
+## Save to TSV 
+
+```{r}
+readr::write_tsv(histology_table, file.path(output_dir, "histology_label_color_table.tsv"))
+```
+
+# Session Info
+
+```{r}
+sessionInfo()
+```
diff --git a/figures/mapping-histology-labels.nb.html b/figures/mapping-histology-labels.nb.html