AlexsLemonade · jharenza · Dec 1, 2020 · Nov 24, 2020 · Nov 24, 2020 · Nov 24, 2020
diff --git a/analyses/molecular-subtyping-pathology/01-compile-subtyping-results.Rmd b/analyses/molecular-subtyping-pathology/01-compile-subtyping-results.Rmd
@@ -4,18 +4,23 @@ output:
   html_notebook:
     toc: true
     toc_float: true
-author: Jaclyn Taroni for CCDL
+author: Jaclyn Taroni for CCDL, Jo Lynne Rokita for D3b
 date: 2020
 params:
    is_ci: FALSE
 ---
 
 The purpose of this notebook is to aggregate molecular subtyping results from the following mature analysis modules:
 
-* [`molecular-subtyping-EWS`](https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/645-pathology-feedback/analyses/molecular-subtyping-EWS)
-* [`molecular-subtyping-HGG`](https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/645-pathology-feedback/analyses/molecular-subtyping-HGG)
-* [`molecular-subtyping-LGAT`](https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/645-pathology-feedback/analyses/molecular-subtyping-LGAT)
-* [`molecular-subtyping-embryonal`](https://github.com/jaclyn-taroni/OpenPBTA-analysis/tree/645-pathology-feedback/analyses/molecular-subtyping-embryonal)
+* [`molecular-subtyping-EWS`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EWS)
+* [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG)
+* [`molecular-subtyping-LGAT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-LGAT)
+* [`molecular-subtyping-embryonal`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-embryonal)
+* [`molecular-subtyping-CRANIO`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-CRANIO)
+* [`molecular-subtyping-EPN`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EPN\)
+* [`molecular-subtyping-MB`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-MB)
+* [`molecular-subtyping-neurocytoma`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-neurocytoma)
+
 
 ## Set up
 
@@ -74,10 +79,14 @@ data_dir <- file.path("..", "..", "data")
 analyses_dir <- ".."
 
 # directories for upstream subtyping modules
+cranio_dir <- file.path(analyses_dir, "molecular-subtyping-CRANIO")
 ews_dir <- file.path(analyses_dir, "molecular-subtyping-EWS")
+epn_dir <- file.path(analyses_dir, "molecular-subtyping-EPN")
 hgg_dir <- file.path(analyses_dir, "molecular-subtyping-HGG")
 lgat_dir <- file.path(analyses_dir, "molecular-subtyping-LGAT")
+mb_dir <- file.path(analyses_dir, "molecular-subtyping-MB")
 embryonal_dir <- file.path(analyses_dir, "molecular-subtyping-embryonal")
+neurocytoma_dir <- file.path(analyses_dir, "molecular-subtyping-neurocytoma")
 
 # the folder that contains the tabular results is standardized across modules
 results_dir <- "results"
@@ -95,19 +104,23 @@ When we run this locally, we want to tie it to a specific version of the histolo
 if (running_in_ci) {
   histologies_file <- file.path(data_dir, "pbta-histologies.tsv")
 } else {
-  histologies_file <- file.path(data_dir, "release-v15-20200228",
+  histologies_file <- file.path(data_dir, "release-v17-20200908",
                                 "pbta-histologies.tsv")
 }
 ```
 
 Results files from individual modules.
 
 ```{r}
+cranio_results_file <- file.path(cranio_dir, results_dir, "CRANIO_molecular_subtype.tsv")
 ews_results_file <- file.path(ews_dir, results_dir, "EWS_results.tsv")
+epn_results_file <- file.path(epn_dir, results_dir, "EPN_all_data_withsubgroup.tsv")
 hgg_results_file <- file.path(hgg_dir, results_dir, "HGG_molecular_subtype.tsv")
 lgat_results_file <- file.path(lgat_dir, results_dir, "lgat_subtyping.tsv")
+mb_results_file <- file.path(mb_dir, results_dir, "MB_molecular_subtype.tsv")
 embryonal_results_file <- file.path(embryonal_dir, results_dir,
                                     "embryonal_tumor_molecular_subtypes.tsv")
+neurocytoma_results_file <- file.path(neurocytoma_dir, results_dir, "neurocytoma_subtyping.tsv")
 ```
 
 #### Output file
@@ -119,18 +132,23 @@ output_file <- file.path(results_dir, "compiled_molecular_subtypes.tsv")
 ## Read in data
 
 ```{r message=FALSE}
+# split 
 histologies_df <- read_tsv(histologies_file, guess_max = 10000)
+cranio_results_df <- read_tsv(cranio_results_file)
 ews_results_df <- read_tsv(ews_results_file)
+epn_results_df <- read_tsv(epn_results_file)
 hgg_results_df <- read_tsv(hgg_results_file)
 lgat_results_df <- read_tsv(lgat_results_file)
+mb_results_df <- read_tsv(mb_results_file)
+neurocytoma_results_df <- read_tsv(neurocytoma_results_file)
 embryonal_results_df <- read_tsv(embryonal_results_file)
 ```
 
 ## Compile the subtyping resutls
 
 ### Handling non-ATRT/non-MB embryonal tumors
 
-The molecular subtyping information from these tumors went into the v15 release, so we can use the `integrated_diagnosis`, `short_histology`, `broad_histology`, and `Notes` columns from the histologies file from that release.
+The molecular subtyping information from these tumors will go into the v18 release, but we will use the `integrated_diagnosis`, `short_histology`, `broad_histology`, and `Notes` columns from the v17 histologies file until SQL rules PR goes in later.
 
 ```{r}
 embryonal_results_df <- bind_exp_strategies(embryonal_results_df) %>%
@@ -140,23 +158,42 @@ embryonal_results_df <- bind_exp_strategies(embryonal_results_df) %>%
                     short_histology,
                     broad_histology,
                     Notes),
-             by = "Kids_First_Biospecimen_ID")
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(integrated_diagnosis = 
+           ifelse(molecular_subtype == "CNS Embryonal, NOS", "CNS Embryonal Tumor, NOS",
+                  ifelse(molecular_subtype == "CNS HGNET-MN1", "CNS Embryonal Tumor, HGNET-MN1",
+                         ifelse(molecular_subtype == "CNS NB-FOXR2", "CNS neuroblastoma",
+                                ifelse(molecular_subtype == "ETMR, C19MC-altered", "Embryonal tumor with multilayer rosettes, C19MC-altered",
+                                       ifelse(molecular_subtype == "ETMR, NOS",
+                "Embryonal tumor with multilayer rosettes, NOS", NA))))),
+        short_histology = 
+          ifelse(molecular_subtype %in% c("ETMR, C19MC-altered", "ETMR, NOS"),
+                "ETMR", "Embryonal Tumor"),
+        broad_histology = "Embryonal Tumor")
 ```
 
 ### Handling EWS
 
-The EWS results were post-v15 and come with their own `Notes` column.
+The EWS results were updated in V18. Adding integrated dx, broad hist, short hist here for now.
 
 ```{r}
+# Add EWS integrated diagnosis, broad histology, short histology
 ews_results_df <- bind_exp_strategies(ews_results_df) %>%
-  rename(integrated_diagnosis = integrated_diagnosis_reclassified,
-         short_histology = short_histology_reclassified,
-         broad_histology = broad_histology_reclassified) 
+  inner_join(select(histologies_df,
+                    Kids_First_Biospecimen_ID,
+                    integrated_diagnosis,
+                    short_histology,
+                    broad_histology,
+                    Notes),
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(integrated_diagnosis = "Ewing sarcoma",
+         broad_histology = "Mesenchymal non-meningothelial tumor",
+         short_histology = "EWS")
 ```
 
 ### Handling HGG
 
-Like the non-ATRT/non-MB embryonal tumors, HGG subtyping was performed prior to v15.
+HGG subtyping was updated with V18.
 
 ```{r}
 hgg_results_df <- bind_exp_strategies(hgg_results_df) %>%
@@ -166,30 +203,119 @@ hgg_results_df <- bind_exp_strategies(hgg_results_df) %>%
                     short_histology,
                     broad_histology,
                     Notes),
-             by = "Kids_First_Biospecimen_ID") 
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(integrated_diagnosis = 
+           ifelse(molecular_subtype == "DMG, H3 K28", "Diffuse midline glioma, H3 K28-mutant",
+         ifelse(molecular_subtype == "DMG, H3 K28, BRAF V600E", "Diffuse midline glioma, H3 K28-mutant, BRAF V600E",
+                ifelse(molecular_subtype == "HGG, BRAF V600E", "High-grade glioma/astrocytoma, BRAF V600E", 
+                       ifelse(molecular_subtype == "HGG, H3 G35", "High-grade glioma/astrocytoma, H3 G35-mutant",
+                         ifelse(molecular_subtype == "HGG, H3 wildtype", "High-grade glioma/astrocytoma, H3 wildtype",
+                                ifelse(molecular_subtype == "HGG, IDH", "High-grade glioma/astrocytoma, IDH-mutant", NA)))))),
+                  broad_histology == "Diffuse astrocytic and oligodendroglial tumor",
+                  short_histology == "HGAT")
 ```
 
 ### Handling LGAT
 
-No columns that are disease labels have been changed yet.
-
 ```{r}
 lgat_results_df <- bind_exp_strategies(lgat_results_df) %>%
   inner_join(select(histologies_df,
                     Kids_First_Biospecimen_ID,
+                    pathology_diagnosis,
                     integrated_diagnosis,
                     short_histology,
                     broad_histology,
                     Notes),
-             by = "Kids_First_Biospecimen_ID") 
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(integrated_diagnosis = 
+           ifelse(molecular_subtype == "LGG, BRAF fusion", "Low-grade glioma/astrocytoma, BRAF fusion",
+                  ifelse(molecular_subtype == "LGG, BRAF V600E", "Low-grade glioma/astrocytoma, BRAF V600E", 
+                         ifelse(molecular_subtype == "LGG, BRAF wildtype", "Low-grade glioma/astrocytoma, BRAF wildtype", NA))),
+                  broad_histology == "Low-grade astrocytic tumor",
+                  short_histology == ifelse(pathology_diagnosis == "Ganglioglioma", "Ganglioglioma", "LGAT"))
 ```
 
 ### Handling EPN 
 
+```{r}
+epn_results_df <- bind_exp_strategies(epn_results_df) %>%
+  inner_join(select(histologies_df,
+                    Kids_First_Biospecimen_ID,
+                    integrated_diagnosis,
+                    short_histology,
+                    broad_histology,
+                    Notes),
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(molecular_subtype = subgroup) %>%
+  mutate(molecular_subtype = ifelse(is.na(molecular_subtype), "EPN, To be classified", molecular_subtype), 
+         integrated_diagnosis = ifelse(molecular_subtype == "PT_EPN_A", 
+                                       "Posterior Fossa Ependymoma, Type A",
+                                       ifelse(molecular_subtype == "ST_EPN_RELA",
+                                       "Supratentorial Ependymoma, RELA fusion positive",
+                                       ifelse(molecular_subtype == "ST_EPN_YAP1", "Supratentorial Ependymoma, YAP1 fusion positive", NA))),
+         broad_histology == "Ependymal Tumor",
+         short_histology == "Ependymoma")
 ```
-# TODO
+
+### Handling MB
+
+```{r}
+mb_results_df <- bind_exp_strategies(mb_results_df) %>%
+  inner_join(select(histologies_df,
+                    Kids_First_Biospecimen_ID,
+                    integrated_diagnosis,
+                    short_histology,
+                    broad_histology,
+                    Notes),
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(integrated_diagnosis = ifelse(molecular_subtype == "MB, SHH",
+                                       "Medulloblastoma, SHH-activated",
+                                       ifelse(molecular_subtype == "MB, WNT","Medulloblastoma, WNT-activated",
+                                              ifelse(molecular_subtype == "MB, Group3", 
+                                                     "Medulloblastoma, group 3",
+                                                     ifelse(molecular_subtype == "MB, Group4",
+                                                            "Medulloblastoma, group 4", NA)))),
+         broad_histology == "Embryonal Tumor",
+         short_histology == "Medulloblastoma")
+```
+
+
+### Handling CRANIO
+
+```{r}
+cranio_results_df <- bind_exp_strategies(cranio_results_df) %>%
+  inner_join(select(histologies_df,
+                    Kids_First_Biospecimen_ID,
+                    integrated_diagnosis,
+                    short_histology,
+                    broad_histology,
+                    Notes),
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(integrated_diagnosis = ifelse(molecular_subtype == "CRANIO, ADAM",
+                                       "Adamantimomatous craniopharyngioma",
+                                       ifelse(molecular_subtype == "CRANIO, PAP","Papillary craniopharyngioma", NA)),
+         broad_histology == "Tumors of sellar region",
+         short_histology == "Craniopharyngioma")
 ```
 
+### Handling Neurocytoma
+```{r}
+neurocytoma_results_df <- neurocytoma_results_df %>%
+  inner_join(select(histologies_df,
+                    Kids_First_Biospecimen_ID,
+                    integrated_diagnosis,
+                    short_histology,
+                    broad_histology,
+                    Notes),
+             by = "Kids_First_Biospecimen_ID") %>%
+  mutate(integrated_diagnosis = ifelse(molecular_subtype == "CNC",
+                                       "Central Neurocytoma",
+                                       ifelse(molecular_subtype == "EVN","Extraventricular Neurocytoma", NA)),
+         broad_histology == "Neuronal and mixed neuronal-glial tumor",
+         short_histology == "Neurocytoma")
+```
+
+
 ### All results
 
 Compile results, sort, and write to file
@@ -198,7 +324,13 @@ Compile results, sort, and write to file
 all_results_df <- bind_rows(embryonal_results_df,
                             ews_results_df,
                             hgg_results_df,
-                            lgat_results_df) %>%
+                            lgat_results_df,
+                            epn_results_df,
+                            cranio_results_df,
+                            mb_results_df,
+                            neurocytoma_results_df) %>%
+  select(Kids_First_Participant_ID, sample_id, Kids_First_Biospecimen_ID, molecular_subtype,
+         integrated_diagnosis, short_histology, broad_histology, Notes) %>%
   arrange(Kids_First_Participant_ID, sample_id) %>%
   write_tsv(output_file)
 ```

diff --git a/analyses/molecular-subtyping-pathology/01-compile-subtyping-results.nb.html b/analyses/molecular-subtyping-pathology/01-compile-subtyping-results.nb.html
diff --git a/analyses/molecular-subtyping-pathology/02-incorporate-clinical-feedback.nb.html b/analyses/molecular-subtyping-pathology/02-incorporate-clinical-feedback.nb.html
diff --git a/analyses/molecular-subtyping-pathology/03-incorporate-pathology-feedback.Rmd b/analyses/molecular-subtyping-pathology/03-incorporate-pathology-feedback.Rmd
@@ -4,7 +4,7 @@ output:
   html_notebook:
     toc: TRUE
     toc_float: TRUE
-author: Jaclyn Taroni for ALSF CCDL
+author: Jaclyn Taroni for ALSF CCDL, Jo Lynne Rokita for D3b
 date: 2020
 params:
   is_ci: FALSE
@@ -51,7 +51,7 @@ And the notes:
 > #### Few notes:
 > 1. `PT_7E3V3JFX` specimens were consistent with the original EPN dx, so pathology would call this a rare EPN, H3 K28 mutated tumor, rather than DMG.
 > 2. `PT_AQWDQW27` specimen was consistent with meningioma, even though it has a hallmark EPN fusion, so pathology would also call this a rare meningioma with a _YAP1_ fusion.
-> 3. Because 1 is a rare tumor (maybe first seen), the logic of searching for all H3 K28 mutations in [HGG subtyping](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) would convert this sample by default - how to handle this? 
+> 3. Because 1 is a rare tumor (maybe first seen), the logic of searching for all H3 K28 mutations in [HGG subtyping](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) would convert this sample by default.
 >  4. Pathology confirmed this HGG BRAF V600E mutant tumor, [`BS_H1XPVS9A`](https://cbethell.github.io/open-pbta-output/09-HGG-with-braf-clustering.nb.html#identify_sample_that_clusters_with_lgat), to be a LGAT (PXA). I updated `molecular_subtype` here based on what it would look like, but this should come through via the LGAT [subtyping](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/631) ticket. How should we add this info?
 
 ## Set up 
@@ -92,7 +92,7 @@ When we run this locally, we want to tie it to a specific version of the histolo
 if (running_in_ci) {
   histologies_file <- file.path(data_dir, "pbta-histologies.tsv")
 } else {
-  histologies_file <- file.path(data_dir, "release-v15-20200228",
+  histologies_file <- file.path(data_dir, "release-v17-20200908",
                                 "pbta-histologies.tsv")
 }
 ```
@@ -217,7 +217,9 @@ compiled_df <- compiled_df %>%
   )
 ```
 
-### HGG BRAF V600E
+### HGG BRAF V600E 
+
+We will be removing this from subtyping, so this can be left for now
 
 The follow point comes from another issue [#627 (comment)](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/627#issuecomment-598789232):
 
@@ -244,9 +246,28 @@ compiled_df <- compiled_df %>%
 
 The `molecular-subtyping-EPN` module has not been completed yet, but the logic that is in that module may mean that we need to include revising the labels of `PT_AQWDQW27`.
 
+```{r}
+compiled_df %>%
+  filter(Kids_First_Participant_ID == "PT_AQWDQW27")
+# This sample is missing from the EPN table, but it should be there - will have to investigate and update this later.
+```
+
+### `PT_6Q0NPVP3`
+
+The specimens for this patient, BS_5JM573JC and BS_E5H6CFYT, were classified as HGAT due to the presence of a histone mutation, but with the removal of LGAT from the HGAT module, this sample will no longer show up in two modules.
+```{r}
+compiled_df %>%
+  filter(Kids_First_Participant_ID == "PT_6Q0NPVP3")
 ```
-# TODO: do we need to update PT_AQWDQW27 once molecular-subtyping-EPN is 
-# complete?
+
+### Are there any other duplicate subtypes?
+```{r}
+unique_subtypes <- compiled_df %>%
+  select(Kids_First_Participant_ID, sample_id, molecular_subtype) %>%
+  distinct()
+
+unique_subtypes[duplicated(unique_subtypes$sample_id),]
+#PT_KTRJ8TFY (fixed in clinical feedback) and PT_6Q0NPVP3 (fixed in HGG module removing LGAT)
 ```
 
 ### Write revised table to file

diff --git a/analyses/molecular-subtyping-pathology/03-incorporate-pathology-feedback.nb.html b/analyses/molecular-subtyping-pathology/03-incorporate-pathology-feedback.nb.html