AlexsLemonade · jaclyn-taroni · Sep 27, 2020 · Sep 19, 2020 · Sep 20, 2020 · Sep 20, 2020
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -36,11 +36,9 @@ jobs:
 
      ### MOLECULAR SUBTYPING ###
 
-      # TODO: This is currently broken because of a change from glioma_brain_region to CNS_brain_region
-      # The fix is tracked in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754#issuecomment-691525827
-      # - run:
-      #     name: Molecular Subtyping - HGG
-      #     command: OPENPBTA_SUBSET=0 ./scripts/run_in_ci.sh bash analyses/molecular-subtyping-HGG/run-molecular-subtyping-HGG.sh
+      - run:
+          name: Molecular Subtyping - HGG
+          command: OPENPBTA_SUBSET=0 ./scripts/run_in_ci.sh bash analyses/molecular-subtyping-HGG/run-molecular-subtyping-HGG.sh
 
       - run:
           name: Molecular subtyping - Non-MB/Non-ATRT Embryonal tumors

diff --git a/analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd b/analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd
@@ -0,0 +1,164 @@
+---
+title: "Select pathology diagnoses for inclusion"
+output: 
+  html_notebook:
+    toc: TRUE
+    toc_float: TRUE
+author: Jaclyn Taroni for ALSF CCDL
+date: 2020
+---
+
+## Background
+
+Originally, we subtyped tumors in this module if the specimen satisfied one of the following criteria:
+
+1. A defining lesion was identified in the SNV consensus file (H3 K28M or G35R/V)
+2. The `short_histology` was `HGAT`.
+
+In an upcoming release, `integrated_diagnosis`, which can be updated as the result of subtyping, will be used to populate the `short_histology` column (see [#748](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/748)).
+Thus, molecular subtyping modules need to be upstream of `short_histology` and use the `pathology_diagnosis` and `pathology_free_text_diagnosis` fields.
+This change for this module is tracked in [#754](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754).
+
+Filtering on the basis of `short_histology == HGAT` is more straightforward than using the pathology diagnosis fields, so we include this notebook to put together the terms in `pathology_diagnosis` and `pathology_free_text_diagnosis`.
+
+We will use the 2016 WHO Classification as our guide ([Louis et al. _Acta Neuropathol._ doi: 10.1007/s00401-016-1545-1](10.1007/s00401-016-1545-1)) and take a look at the current version of the histology file (`release-v17-20200908`).
+
+## Set up
+
+```{r}
+library(tidyverse)
+```
+
+### Directories and files
+
+We're going to tie this to a specific release.
+
+```{r}
+data_dir <- file.path("..", "..", "data", "release-v17-20200908")
+histologies_file <- file.path(data_dir, "pbta-histologies.tsv")
+```
+
+We're going to save the pathology diagnosis information we'll use to generate the subset files in a directory `hgg-subset`.
+
+```{r}
+output_dir <- "hgg-subset"
+output_file <- file.path(output_dir, 
+                         "hgg_subtyping_path_dx_strings.json")
+```
+
+## Read in data
+
+```{r}
+histologies_df <- read_tsv(histologies_file)
+```
+
+## Explore the pathology diagnoses
+
+### `short_histology == HGAT`
+
+In the current histologies file, if we filter based on `short_histology` as we did originally, what is in the pathology diagnosis fields?
+Note that some of the `short_histology` values will have been altered based on earlier subtyping efforts.
+(That's why we're doing this!)
+
+```{r}
+histologies_df %>% 
+  filter(short_histology == "HGAT") %>%
+  count(pathology_diagnosis) %>%
+  arrange(desc(n))
+```
+
+For the most part, this is as we would expect given the 2016 WHO classifications. 
+In an initial round of subtyping, PNET specimens were reclassified (see [this comment on #609](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/609#issuecomment-602821376)).
+We should not and will not include PNET samples in the criteria used for detect samples for subtyping from pathology diagnosis fields. 
+Instead, these samples that were reclassified earlier shouldbe included downstream on the basis of defining lesions.
+
+Although some of these terms appear to be part of a defined vocabulary, there are others like `High Grade Glial Neoplasma` that were not subject to the same harmonization.
+These are likely from the completed PNOC trial (see [#754 (comment)] (https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754#issuecomment-697004412)).
+Let's take a look if we filter by `cohort == "PNOC003"`.
+
+
+```{r}
+histologies_df %>%
+  filter(cohort == "PNOC003",
+         # Filter out normal WGS rows
+         sample_type != "Normal") %>%
+  count(pathology_diagnosis) %>%
+  arrange(desc(n))
+```
+
+As anticipated, these are all indicative samples that should be included for subtyping but they are not harmonized.
+
+Let's take a look at the free text field when filtering with `short_histology == HGAT`.
+
+```{r}
+histologies_df %>% 
+  filter(short_histology == "HGAT") %>%
+  group_by(pathology_free_text_diagnosis) %>%
+  tally() %>%
+  arrange(desc(n))
+```
+
+As we might expect from a free text field, this is even less uniform.
+
+## Pathology diagnosis strings for inclusion
+
+For the CBTTC samples, the `pathology_diagnosis` fields are harmonized, so we can use the terms below to look for exact matches.
+
+```{r}
+exact_path_dx<- c(
+  "High-grade glioma/astrocytoma (WHO grade III/IV)",
+  "Brainstem glioma- Diffuse intrinsic pontine glioma",
+  "Gliomatosis Cerebri"
+)
+```
+
+And all samples from the PNOC003 trial should be included.
+
+Let's take a look at a first attempt using these terms as described above.
+
+```{r}
+filtered_on_dx_df <- histologies_df %>%
+  filter(pathology_diagnosis %in% exact_path_dx | 
+            cohort == "PNOC003",
+         # Exclude normal samples when filtering on cohort
+         sample_type != "Normal") %>%
+  select(Kids_First_Biospecimen_ID, 
+         sample_id, 
+         Kids_First_Participant_ID,
+         pathology_diagnosis,
+         pathology_free_text_diagnosis,
+         integrated_diagnosis, 
+         short_histology)
+
+filtered_on_dx_df
+```
+
+Let's tally the values in `pathology_diagnosis` in this data frame.
+
+```{r}
+filtered_on_dx_df %>%
+  count(pathology_diagnosis) %>%
+  arrange(desc(n))
+```
+
+We are not including any samples with pathology diagnoses outside of what we should include for subtyping.
+
+### Save the strings we'll use downstream
+
+Create a list with the strings we'll use for inclusion.
+
+```{r}
+terms_list <- list(exact_path_dx = exact_path_dx)
+```
+
+Save this list as JSON.
+
+```{r}
+writeLines(jsonlite::prettify(jsonlite::toJSON(terms_list)), output_file)
+```
+
+## Session Info
+
+```{r}
+sessionInfo()
+```
diff --git a/analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.nb.html b/analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.nb.html
diff --git a/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd b/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd
@@ -45,19 +45,11 @@ if (!dir.exists(results_dir)) {
 
 # Read in metadata
 metadata <-
-  readr::read_tsv(file.path(root_dir, "data", "pbta-histologies.tsv"), guess_max = 10000) %>%
+  readr::read_tsv(file.path(root_dir, "data", "pbta-histologies.tsv"), 
+                  guess_max = 10000) %>%
   dplyr::filter(sample_type == "Tumor",
                 composition == "Solid Tissue") 
 
-# Select wanted columns in metadata for merging and assign to a new object
-select_metadata <- metadata %>%
-  dplyr::select(Kids_First_Participant_ID,
-                sample_id,
-                Kids_First_Biospecimen_ID,
-                broad_histology,
-                short_histology,
-                integrated_diagnosis)
-
 # Read in snv consensus mutation data
 snv_df <-
   data.table::fread(file.path(root_dir,
@@ -111,51 +103,39 @@ snv_lesions_df <- snv_lesions_df %>%
     )
   ) %>%
   dplyr::mutate_all(function(x) tidyr::replace_na(x, "No"))
+```
+
+Add a column that keeps track of the presence of any defining lesion.
+We'll use this to create subset files in the next step.
+
+```{r}
+snv_lesions_df <- snv_lesions_df %>%
+  dplyr::mutate(
+    defining_lesion = dplyr::case_when(
+      H3F3A.K28M == "Yes" ~ TRUE,
+      HIST1H3B.K28M == "Yes" ~ TRUE,
+      HIST1H3C.K28M == "Yes" ~ TRUE,
+      HIST2H3C.K28M == "Yes" ~ TRUE,
+      H3F3A.G35R == "Yes" ~ TRUE,
+      H3F3A.G35V == "Yes" ~ TRUE,
+      TRUE ~ FALSE
+    )
+  )
+```
+
+Add other identifiers and sort.
 
-# Join the selected variables from the metadata with the snv consensus mutation
-# and defining lesions data.frame
-snv_lesions_df <- select_metadata %>%
+```{r}
+snv_lesions_df <- metadata %>%
+  dplyr::select(Kids_First_Participant_ID, 
+                sample_id, 
+                Kids_First_Biospecimen_ID) %>%
   dplyr::inner_join(snv_lesions_df,
                     by = c("Kids_First_Biospecimen_ID" = "Tumor_Sample_Barcode")) %>%
-  dplyr::select(
-    dplyr::ends_with("ID"),
-    dplyr::starts_with("H"),
-    broad_histology,
-    short_histology,
-    integrated_diagnosis
-  ) %>%
-  dplyr::mutate(
-    disease_type_reclassified = dplyr::case_when(
-      H3F3A.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
-      HIST1H3B.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
-      HIST1H3C.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
-      HIST2H3C.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
-      H3F3A.G35R == "Yes" ~ "High-grade glioma, H3 G35 mutant",
-      H3F3A.G35V == "Yes" ~ "High-grade glioma, H3 G35 mutant",
-      TRUE ~ as.character(integrated_diagnosis)),
-    short_histology_reclassified = dplyr::case_when(
-      H3F3A.K28M == "Yes" ~ "HGAT",
-      HIST1H3B.K28M == "Yes" ~ "HGAT",
-      HIST1H3C.K28M == "Yes" ~ "HGAT",
-      HIST2H3C.K28M == "Yes" ~ "HGAT",
-      H3F3A.G35R == "Yes" ~ "HGAT",
-      H3F3A.G35V == "Yes" ~ "HGAT",
-      TRUE ~ as.character(short_histology)),
-    broad_histology_reclassified = dplyr::case_when(
-      H3F3A.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
-      HIST1H3B.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
-      HIST1H3C.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
-      HIST2H3C.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
-      H3F3A.G35R == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
-      H3F3A.G35V == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
-      TRUE ~ as.character(broad_histology)),
-    ) %>%
   dplyr::arrange(Kids_First_Participant_ID, sample_id)
-
-# Display `snv_lesions_df`
-snv_lesions_df 
 ```
 
+
 ## Save final table of results
 
 ```{r}
@@ -164,23 +144,9 @@ readr::write_tsv(snv_lesions_df,
                  file.path(results_dir, "HGG_defining_lesions.tsv"))
 ```
 
-## Inconsistencies in disease classification
-
-```{r}
-# Isolate the samples with the specified mutations that were not classified
-# as HGG or DIPG
-snv_lesions_df %>%
-  dplyr::filter(
-    grepl("High-grade glioma|Diffuse midline glioma", disease_type_reclassified) &
-      !(integrated_diagnosis %in% c("High-grade glioma", 
-                                "Brainstem glioma- Diffuse intrinsic pontine glioma"))
-  )
-```
-
-# Session Info
+## Session Info
 
 ```{r}
 # Print the session information
 sessionInfo()
 ```
-