Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Update molecular-subtyping-HGG to use pathology diagnosis fields #786

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
b3eefa4
Remove disease label reclassification step
jaclyn-taroni Sep 19, 2020
8f3e832
Add a notebook that preps strings for inclusion/exclusion
jaclyn-taroni Sep 20, 2020
e20e3db
Add a column for TRUE/FALSE presence of defining lesion
jaclyn-taroni Sep 20, 2020
d445b5b
Moar identifiers
jaclyn-taroni Sep 20, 2020
e26dad0
Add a couple more path dx strings
jaclyn-taroni Sep 20, 2020
91ba044
Update subsetting to use the path dx fields
jaclyn-taroni Sep 20, 2020
e8480cd
Add intermediate metadata file for included specimens
jaclyn-taroni Sep 20, 2020
c808a51
Use intermediate metadata file for CNV cleaning
jaclyn-taroni Sep 20, 2020
ec24122
Remove defining_lesion from cleaned mutation table
jaclyn-taroni Sep 20, 2020
6881692
glioma_brain_region -> CNS_region
jaclyn-taroni Sep 20, 2020
a61f921
Run the whole module with modifications
jaclyn-taroni Sep 20, 2020
886ffa6
Uncomment HGG subtyping step in CI
jaclyn-taroni Sep 20, 2020
37dbc24
Finish sentence
jaclyn-taroni Sep 20, 2020
2640fda
Update documentation to reflect changes
jaclyn-taroni Sep 20, 2020
b80a69a
Apply suggestions from code review
jaclyn-taroni Sep 21, 2020
1ce3ed8
Docs and terms changes from code review
jaclyn-taroni Sep 21, 2020
b55fa9d
All lowercase for matching path dx
jaclyn-taroni Sep 21, 2020
f65c803
Rerun the entire module
jaclyn-taroni Sep 21, 2020
4ceefd0
Exact path dx matches for CBTTC samples
jaclyn-taroni Sep 23, 2020
f4b35e0
Rerun entire module
jaclyn-taroni Sep 23, 2020
1b66ce0
Update docs to reflect exact match changes
jaclyn-taroni Sep 23, 2020
37a07d2
Merge branch 'master' into jaclyn-taroni/hgg-path-dx
jaclyn-taroni Sep 23, 2020
bee18c7
Merge branch 'master' into jaclyn-taroni/hgg-path-dx
jaclyn-taroni Sep 25, 2020
f833ac7
Add back in Gliomatosis Cerebri; rerun
jaclyn-taroni Sep 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,9 @@ jobs:

### MOLECULAR SUBTYPING ###

# TODO: This is currently broken because of a change from glioma_brain_region to CNS_brain_region
# The fix is tracked in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754#issuecomment-691525827
# - run:
# name: Molecular Subtyping - HGG
# command: OPENPBTA_SUBSET=0 ./scripts/run_in_ci.sh bash analyses/molecular-subtyping-HGG/run-molecular-subtyping-HGG.sh
- run:
name: Molecular Subtyping - HGG
command: OPENPBTA_SUBSET=0 ./scripts/run_in_ci.sh bash analyses/molecular-subtyping-HGG/run-molecular-subtyping-HGG.sh

- run:
name: Molecular subtyping - Non-MB/Non-ATRT Embryonal tumors
Expand Down
164 changes: 164 additions & 0 deletions analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
title: "Select pathology diagnoses for inclusion"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: Jaclyn Taroni for ALSF CCDL
date: 2020
---

## Background

Originally, we subtyped tumors in this module if the specimen satisfied one of the following criteria:

1. A defining lesion was identified in the SNV consensus file (H3 K28M or G35R/V)
2. The `short_histology` was `HGAT`.

In an upcoming release, `integrated_diagnosis`, which can be updated as the result of subtyping, will be used to populate the `short_histology` column (see [#748](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/748)).
Thus, molecular subtyping modules need to be upstream of `short_histology` and use the `pathology_diagnosis` and `pathology_free_text_diagnosis` fields.
This change for this module is tracked in [#754](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754).

Filtering on the basis of `short_histology == HGAT` is more straightforward than using the pathology diagnosis fields, so we include this notebook to put together the terms in `pathology_diagnosis` and `pathology_free_text_diagnosis`.

We will use the 2016 WHO Classification as our guide ([Louis et al. _Acta Neuropathol._ doi: 10.1007/s00401-016-1545-1](10.1007/s00401-016-1545-1)) and take a look at the current version of the histology file (`release-v17-20200908`).

## Set up

```{r}
library(tidyverse)
```

### Directories and files

We're going to tie this to a specific release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this makes sense to do, but it also seems like something that could easily be missed in updates (I see that it is discussed in the readme, but still worry). I'm wondering if this could be put in the RMD params for easier future updates?

Maybe this is not needed, as this is meant to be a run-once notebook, but how are we planning to handle updates to the JSON if needed? If the vocabulary were to change, would we update that manually and deprecate this notebook?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will be in a situation where we want to look at the contents of the histologies files on the basis of short_histology == "HGAT" again, once short_histology fully depends on molecular_subtype (planned for next release). I can definitely see a situation where we would update the JSON (that's the rationale for including it rather than hardcoding in 02!) but I don't think the process for getting those terms would be the same.

If the vocabulary were to change, would we update that manually and deprecate this notebook?

All that to say - I think you're correct here.


```{r}
data_dir <- file.path("..", "..", "data", "release-v17-20200908")
histologies_file <- file.path(data_dir, "pbta-histologies.tsv")
```

We're going to save the pathology diagnosis information we'll use to generate the subset files in a directory `hgg-subset`.

```{r}
output_dir <- "hgg-subset"
output_file <- file.path(output_dir,
"hgg_subtyping_path_dx_strings.json")
```

## Read in data

```{r}
histologies_df <- read_tsv(histologies_file)
```

## Explore the pathology diagnoses

### `short_histology == HGAT`

In the current histologies file, if we filter based on `short_histology` as we did originally, what is in the pathology diagnosis fields?
Note that some of the `short_histology` values will have been altered based on earlier subtyping efforts.
(That's why we're doing this!)

```{r}
histologies_df %>%
filter(short_histology == "HGAT") %>%
count(pathology_diagnosis) %>%
arrange(desc(n))
```

For the most part, this is as we would expect given the 2016 WHO classifications.
In an initial round of subtyping, PNET specimens were reclassified (see [this comment on #609](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/609#issuecomment-602821376)).
We should not and will not include PNET samples in the criteria used for detect samples for subtyping from pathology diagnosis fields.
Instead, these samples that were reclassified earlier shouldbe included downstream on the basis of defining lesions.

Although some of these terms appear to be part of a defined vocabulary, there are others like `High Grade Glial Neoplasma` that were not subject to the same harmonization.
These are likely from the completed PNOC trial (see [#754 (comment)] (https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/754#issuecomment-697004412)).
Let's take a look if we filter by `cohort == "PNOC003"`.


```{r}
histologies_df %>%
filter(cohort == "PNOC003",
# Filter out normal WGS rows
sample_type != "Normal") %>%
count(pathology_diagnosis) %>%
arrange(desc(n))
```

As anticipated, these are all indicative samples that should be included for subtyping but they are not harmonized.

Let's take a look at the free text field when filtering with `short_histology == HGAT`.

```{r}
histologies_df %>%
filter(short_histology == "HGAT") %>%
group_by(pathology_free_text_diagnosis) %>%
tally() %>%
arrange(desc(n))
```

As we might expect from a free text field, this is even less uniform.

## Pathology diagnosis strings for inclusion

For the CBTTC samples, the `pathology_diagnosis` fields are harmonized, so we can use the terms below to look for exact matches.

```{r}
exact_path_dx<- c(
"High-grade glioma/astrocytoma (WHO grade III/IV)",
"Brainstem glioma- Diffuse intrinsic pontine glioma",
"Gliomatosis Cerebri"
)
```

And all samples from the PNOC003 trial should be included.

Let's take a look at a first attempt using these terms as described above.

```{r}
filtered_on_dx_df <- histologies_df %>%
filter(pathology_diagnosis %in% exact_path_dx |
cohort == "PNOC003",
# Exclude normal samples when filtering on cohort
sample_type != "Normal") %>%
select(Kids_First_Biospecimen_ID,
sample_id,
Kids_First_Participant_ID,
pathology_diagnosis,
pathology_free_text_diagnosis,
integrated_diagnosis,
short_histology)

filtered_on_dx_df
```

Let's tally the values in `pathology_diagnosis` in this data frame.

```{r}
filtered_on_dx_df %>%
count(pathology_diagnosis) %>%
arrange(desc(n))
```

We are not including any samples with pathology diagnoses outside of what we should include for subtyping.

### Save the strings we'll use downstream

Create a list with the strings we'll use for inclusion.

```{r}
terms_list <- list(exact_path_dx = exact_path_dx)
```

Save this list as JSON.

```{r}
writeLines(jsonlite::prettify(jsonlite::toJSON(terms_list)), output_file)
```

## Session Info

```{r}
sessionInfo()
```
3,250 changes: 3,250 additions & 0 deletions analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.nb.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,11 @@ if (!dir.exists(results_dir)) {

# Read in metadata
metadata <-
readr::read_tsv(file.path(root_dir, "data", "pbta-histologies.tsv"), guess_max = 10000) %>%
readr::read_tsv(file.path(root_dir, "data", "pbta-histologies.tsv"),
guess_max = 10000) %>%
dplyr::filter(sample_type == "Tumor",
composition == "Solid Tissue")

# Select wanted columns in metadata for merging and assign to a new object
select_metadata <- metadata %>%
dplyr::select(Kids_First_Participant_ID,
sample_id,
Kids_First_Biospecimen_ID,
broad_histology,
short_histology,
integrated_diagnosis)

# Read in snv consensus mutation data
snv_df <-
data.table::fread(file.path(root_dir,
Expand Down Expand Up @@ -111,51 +103,39 @@ snv_lesions_df <- snv_lesions_df %>%
)
) %>%
dplyr::mutate_all(function(x) tidyr::replace_na(x, "No"))
```

Add a column that keeps track of the presence of any defining lesion.
We'll use this to create subset files in the next step.

```{r}
snv_lesions_df <- snv_lesions_df %>%
dplyr::mutate(
defining_lesion = dplyr::case_when(
H3F3A.K28M == "Yes" ~ TRUE,
HIST1H3B.K28M == "Yes" ~ TRUE,
HIST1H3C.K28M == "Yes" ~ TRUE,
HIST2H3C.K28M == "Yes" ~ TRUE,
H3F3A.G35R == "Yes" ~ TRUE,
H3F3A.G35V == "Yes" ~ TRUE,
TRUE ~ FALSE
)
)
```

Add other identifiers and sort.

# Join the selected variables from the metadata with the snv consensus mutation
# and defining lesions data.frame
snv_lesions_df <- select_metadata %>%
```{r}
snv_lesions_df <- metadata %>%
dplyr::select(Kids_First_Participant_ID,
sample_id,
Kids_First_Biospecimen_ID) %>%
dplyr::inner_join(snv_lesions_df,
by = c("Kids_First_Biospecimen_ID" = "Tumor_Sample_Barcode")) %>%
dplyr::select(
dplyr::ends_with("ID"),
dplyr::starts_with("H"),
broad_histology,
short_histology,
integrated_diagnosis
) %>%
dplyr::mutate(
disease_type_reclassified = dplyr::case_when(
H3F3A.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
HIST1H3B.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
HIST1H3C.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
HIST2H3C.K28M == "Yes" ~ "Diffuse midline glioma, H3 K28 mutant",
H3F3A.G35R == "Yes" ~ "High-grade glioma, H3 G35 mutant",
H3F3A.G35V == "Yes" ~ "High-grade glioma, H3 G35 mutant",
TRUE ~ as.character(integrated_diagnosis)),
short_histology_reclassified = dplyr::case_when(
H3F3A.K28M == "Yes" ~ "HGAT",
HIST1H3B.K28M == "Yes" ~ "HGAT",
HIST1H3C.K28M == "Yes" ~ "HGAT",
HIST2H3C.K28M == "Yes" ~ "HGAT",
H3F3A.G35R == "Yes" ~ "HGAT",
H3F3A.G35V == "Yes" ~ "HGAT",
TRUE ~ as.character(short_histology)),
broad_histology_reclassified = dplyr::case_when(
H3F3A.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
HIST1H3B.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
HIST1H3C.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
HIST2H3C.K28M == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
H3F3A.G35R == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
H3F3A.G35V == "Yes" ~ "Diffuse astrocytic and oligodendroglial tumor",
TRUE ~ as.character(broad_histology)),
) %>%
dplyr::arrange(Kids_First_Participant_ID, sample_id)

# Display `snv_lesions_df`
snv_lesions_df
```


## Save final table of results

```{r}
Expand All @@ -164,23 +144,9 @@ readr::write_tsv(snv_lesions_df,
file.path(results_dir, "HGG_defining_lesions.tsv"))
```

## Inconsistencies in disease classification

```{r}
# Isolate the samples with the specified mutations that were not classified
# as HGG or DIPG
snv_lesions_df %>%
dplyr::filter(
grepl("High-grade glioma|Diffuse midline glioma", disease_type_reclassified) &
!(integrated_diagnosis %in% c("High-grade glioma",
"Brainstem glioma- Diffuse intrinsic pontine glioma"))
)
```

# Session Info
## Session Info

```{r}
# Print the session information
sessionInfo()
```

Loading