This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
Update molecular-subtyping-HGG to use pathology diagnosis fields #786
Merged
jaclyn-taroni
merged 24 commits into
AlexsLemonade:master
from
jaclyn-taroni:jaclyn-taroni/hgg-path-dx
Sep 27, 2020
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
b3eefa4
Remove disease label reclassification step
jaclyn-taroni 8f3e832
Add a notebook that preps strings for inclusion/exclusion
jaclyn-taroni e20e3db
Add a column for TRUE/FALSE presence of defining lesion
jaclyn-taroni d445b5b
Moar identifiers
jaclyn-taroni e26dad0
Add a couple more path dx strings
jaclyn-taroni 91ba044
Update subsetting to use the path dx fields
jaclyn-taroni e8480cd
Add intermediate metadata file for included specimens
jaclyn-taroni c808a51
Use intermediate metadata file for CNV cleaning
jaclyn-taroni ec24122
Remove defining_lesion from cleaned mutation table
jaclyn-taroni 6881692
glioma_brain_region -> CNS_region
jaclyn-taroni a61f921
Run the whole module with modifications
jaclyn-taroni 886ffa6
Uncomment HGG subtyping step in CI
jaclyn-taroni 37dbc24
Finish sentence
jaclyn-taroni 2640fda
Update documentation to reflect changes
jaclyn-taroni b80a69a
Apply suggestions from code review
jaclyn-taroni 1ce3ed8
Docs and terms changes from code review
jaclyn-taroni b55fa9d
All lowercase for matching path dx
jaclyn-taroni f65c803
Rerun the entire module
jaclyn-taroni 4ceefd0
Exact path dx matches for CBTTC samples
jaclyn-taroni f4b35e0
Rerun entire module
jaclyn-taroni 1b66ce0
Update docs to reflect exact match changes
jaclyn-taroni 37a07d2
Merge branch 'master' into jaclyn-taroni/hgg-path-dx
jaclyn-taroni bee18c7
Merge branch 'master' into jaclyn-taroni/hgg-path-dx
jaclyn-taroni f833ac7
Add back in Gliomatosis Cerebri; rerun
jaclyn-taroni File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
164 changes: 164 additions & 0 deletions
164
analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
--- | ||
title: "Select pathology diagnoses for inclusion" | ||
output: | ||
html_notebook: | ||
toc: TRUE | ||
toc_float: TRUE | ||
author: Jaclyn Taroni for ALSF CCDL | ||
date: 2020 | ||
--- | ||
|
||
## Background | ||
|
||
Originally, we subtyped tumors in this module if the specimen satisfied one of the following criteria: | ||
|
||
1. A defining lesion was identified in the SNV consensus file (H3 K28M or G35R/V) | ||
2. The `short_histology` was `HGAT`. | ||
|
||
In an upcoming release, `integrated_diagnosis`, which can be updated as the result of subtyping, will be used to populate the `short_histology` column (see [#748](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/748)). | ||
Thus, molecular subtyping modules need to be upstream of `short_histology` and use the `pathology_diagnosis` and `pathology_free_text_diagnosis` fields. | ||
This change for this module is tracked in [#754](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754). | ||
|
||
Filtering on the basis of `short_histology == HGAT` is more straightforward than using the pathology diagnosis fields, so we include this notebook to put together the terms in `pathology_diagnosis` and `pathology_free_text_diagnosis`. | ||
|
||
We will use the 2016 WHO Classification as our guide ([Louis et al. _Acta Neuropathol._ doi: 10.1007/s00401-016-1545-1](10.1007/s00401-016-1545-1)) and take a look at the current version of the histology file (`release-v17-20200908`). | ||
|
||
## Set up | ||
|
||
```{r} | ||
library(tidyverse) | ||
``` | ||
|
||
### Directories and files | ||
|
||
We're going to tie this to a specific release. | ||
|
||
```{r} | ||
data_dir <- file.path("..", "..", "data", "release-v17-20200908") | ||
histologies_file <- file.path(data_dir, "pbta-histologies.tsv") | ||
``` | ||
|
||
We're going to save the pathology diagnosis information we'll use to generate the subset files in a directory `hgg-subset`. | ||
|
||
```{r} | ||
output_dir <- "hgg-subset" | ||
output_file <- file.path(output_dir, | ||
"hgg_subtyping_path_dx_strings.json") | ||
``` | ||
|
||
## Read in data | ||
|
||
```{r} | ||
histologies_df <- read_tsv(histologies_file) | ||
``` | ||
|
||
## Explore the pathology diagnoses | ||
|
||
### `short_histology == HGAT` | ||
|
||
In the current histologies file, if we filter based on `short_histology` as we did originally, what is in the pathology diagnosis fields? | ||
Note that some of the `short_histology` values will have been altered based on earlier subtyping efforts. | ||
(That's why we're doing this!) | ||
|
||
```{r} | ||
histologies_df %>% | ||
filter(short_histology == "HGAT") %>% | ||
count(pathology_diagnosis) %>% | ||
arrange(desc(n)) | ||
``` | ||
|
||
For the most part, this is as we would expect given the 2016 WHO classifications. | ||
In an initial round of subtyping, PNET specimens were reclassified (see [this comment on #609](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/609#issuecomment-602821376)). | ||
We should not and will not include PNET samples in the criteria used for detect samples for subtyping from pathology diagnosis fields. | ||
Instead, these samples that were reclassified earlier shouldbe included downstream on the basis of defining lesions. | ||
|
||
Although some of these terms appear to be part of a defined vocabulary, there are others like `High Grade Glial Neoplasma` that were not subject to the same harmonization. | ||
These are likely from the completed PNOC trial (see [#754 (comment)] (https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754#issuecomment-697004412)). | ||
Let's take a look if we filter by `cohort == "PNOC003"`. | ||
|
||
|
||
```{r} | ||
histologies_df %>% | ||
filter(cohort == "PNOC003", | ||
# Filter out normal WGS rows | ||
sample_type != "Normal") %>% | ||
count(pathology_diagnosis) %>% | ||
arrange(desc(n)) | ||
``` | ||
|
||
As anticipated, these are all indicative samples that should be included for subtyping but they are not harmonized. | ||
|
||
Let's take a look at the free text field when filtering with `short_histology == HGAT`. | ||
|
||
```{r} | ||
histologies_df %>% | ||
filter(short_histology == "HGAT") %>% | ||
group_by(pathology_free_text_diagnosis) %>% | ||
tally() %>% | ||
arrange(desc(n)) | ||
``` | ||
|
||
As we might expect from a free text field, this is even less uniform. | ||
|
||
## Pathology diagnosis strings for inclusion | ||
|
||
For the CBTTC samples, the `pathology_diagnosis` fields are harmonized, so we can use the terms below to look for exact matches. | ||
|
||
```{r} | ||
exact_path_dx<- c( | ||
"High-grade glioma/astrocytoma (WHO grade III/IV)", | ||
"Brainstem glioma- Diffuse intrinsic pontine glioma", | ||
"Gliomatosis Cerebri" | ||
) | ||
``` | ||
|
||
And all samples from the PNOC003 trial should be included. | ||
|
||
Let's take a look at a first attempt using these terms as described above. | ||
|
||
```{r} | ||
filtered_on_dx_df <- histologies_df %>% | ||
filter(pathology_diagnosis %in% exact_path_dx | | ||
cohort == "PNOC003", | ||
# Exclude normal samples when filtering on cohort | ||
sample_type != "Normal") %>% | ||
select(Kids_First_Biospecimen_ID, | ||
sample_id, | ||
Kids_First_Participant_ID, | ||
pathology_diagnosis, | ||
pathology_free_text_diagnosis, | ||
integrated_diagnosis, | ||
short_histology) | ||
|
||
filtered_on_dx_df | ||
``` | ||
|
||
Let's tally the values in `pathology_diagnosis` in this data frame. | ||
|
||
```{r} | ||
filtered_on_dx_df %>% | ||
count(pathology_diagnosis) %>% | ||
arrange(desc(n)) | ||
``` | ||
|
||
We are not including any samples with pathology diagnoses outside of what we should include for subtyping. | ||
|
||
### Save the strings we'll use downstream | ||
|
||
Create a list with the strings we'll use for inclusion. | ||
|
||
```{r} | ||
terms_list <- list(exact_path_dx = exact_path_dx) | ||
``` | ||
|
||
Save this list as JSON. | ||
|
||
```{r} | ||
writeLines(jsonlite::prettify(jsonlite::toJSON(terms_list)), output_file) | ||
``` | ||
|
||
## Session Info | ||
|
||
```{r} | ||
sessionInfo() | ||
``` |
3,250 changes: 3,250 additions & 0 deletions
3,250
analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.nb.html
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why this makes sense to do, but it also seems like something that could easily be missed in updates (I see that it is discussed in the readme, but still worry). I'm wondering if this could be put in the RMD
params
for easier future updates?Maybe this is not needed, as this is meant to be a run-once notebook, but how are we planning to handle updates to the JSON if needed? If the vocabulary were to change, would we update that manually and deprecate this notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will be in a situation where we want to look at the contents of the histologies files on the basis of
short_histology == "HGAT"
again, onceshort_histology
fully depends onmolecular_subtype
(planned for next release). I can definitely see a situation where we would update the JSON (that's the rationale for including it rather than hardcoding in02
!) but I don't think the process for getting those terms would be the same.All that to say - I think you're correct here.