Update molecular-subtyping-HGG to use pathology diagnosis fields #786

jaclyn-taroni · 2020-09-20T03:26:17Z

Purpose/implementation Section

Previously, samples were included for subtyping in the molecular-subtyping-HGG module if they met at least one of the criteria below:

A defining lesion was identified in the SNV consensus file (H3 K28M or G35R/V)
The short_histology value was HGAT.

In an upcoming release, short_histology will rely on the output of the molecular subtyping modules (#748) and therefore be downstream of the molecular subtyping module.

We need to use pathology_diagnosis and pathology_free_text_diagnosis fields to replace the filtering on the basis of short_histology prior to that change being made.

What was your approach?

Identifying terms in `pathology_diagnosis` and `pathology_free_text_diagnosis`

The issue that tracks this change (#754) doesn't yet have specific information about values of pathology_diagnosis and pathology_free_text_diagnosis that should be used to include samples.
For what I think is a reasonable first pass, I filtered the release-v17-20200908 pbta-histologies.tsv to rows where the short_histology value is HGAT and examined what samples that passed filtering had in the pathology_diagnosis and pathology_free_text_diagnosis.

I then created a list of strings to be used for filtering the pathology_diagnosis and pathology_free_text_diagnosis in the downstream inclusion/exclusion steps and saved that information in analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json.

This takes place in the 00-HGG-select-pathology-dx notebook that has the current data release, release-v17-20200908, hardcoded and is not run via the shell script for the module. It is designed to capture this moment in pbta-histologies.tsv and not be updated.

High-level summary: For the most part, terms in the pathology diagnosis fields are what I, a non-expert, would expect on the basis of the 2016 WHO classification (Louis et al. Acta Neuropathol. doi: 10.1007/s00401-016-1545-1). There are a few unexpected pathology_diagnosis values when filtering by short_histology == "HGAT" (e.g., PNET) but these appear to be due to changes in short_histology introduced into the histologies file due to earlier subtyping efforts (#609 (comment)). There are several terms under Diffuse astrocytic and oligodendroglial tumor, e.g., anaplastic astrocytoma, that when used for detection in the pathology free text diagnosis field return LGG samples (based on the pathology diagnosis).

Inclusion/exclusion of samples

The methodology/logic for including samples for subtyping in this module are:

A defining lesion was identified in the SNV consensus file (H3 K28M or G35R/V) - this is unchanged, but how that's accomplished differs slightly (see the next section).
Filter to samples that have one of the strings in analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json in pathology_diagnosis or pathology_free_text_diagnosis
Once the filtering on the basis of inclusion occurs, exclude LGG samples based on the pathology_diagnosis field.
Combine the biospecimens based on the defining lesion and pathology diagnosis steps and filter out duplicates (biospecimens meet both criteria).

Other changes

There are a few other updates to this module that were necessary or a good idea while addressing #754 that I'll summarize below.

Removal of the reclassification of short histology, broad histology, etc. from the first notebook (01-HGG-molecular-subtyping-defining-lesions.Rmd) where we handle the defining lesions. The reclassification step meant things like molecular subtyping: BS_N6N147BY #783 cropped up and I'm fairly sure that was only in there in the first place to accommodate the old way of incorporating subtyping information into the histologies file. @jharenza mentioned leaving those fields blank in molecular subtyping: BS_N6N147BY #783 (comment); I think leaving this in would be both unnecessary and confusing. Instead, I add a logical column (defining_lesion) in analyses/molecular-subtyping-HGG/results/HGG_defining_lesions.tsv that indicates the presence of one of the defining lesions described above that's used for including samples downstream. The change to 04-HGG-molecular-subtyping-mutation.Rmd included here is to drop that column from the cleaned mutation table.
The step where we generate the subset files (02) also now outputs the subset of the pbta-histologies.tsv file for biospecimens that are included for subtyping (analyses/molecular-subtyping-HGG/hgg-subset/hgg_metadata.tsv). I thought this would be useful for inspection while we're getting the details hammered out, but this also prevents us from having to repeat the logic for inclusion/exclusion in 03-HGG-molecular-subtyping-cnv.Rmd. You can see these changes in this PR as well.
The required changes to 07-HGG-molecular-subtyping-combine-table.Rmd to accommodate glioma_brain_region being renamed as CNS_region in
Uncomment this step so it runs in CI.

What GitHub issue does your pull request address?

Closes #754 - update HGG subtyping module to use pathology diagnosis information

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json contains the:

include_path_dx: the strings that are being used to detect samples that should be included for subtyping, filtering on pathology_diagnosis
include_free_text: the strings that are being used to detect samples that should be included for subtyping, filtering on pathology_free_text_diagnosis
exclude_path_dx: the strings that are being used to detect samples that should be excluded for subtyping, filtering on pathology_diagnosis. (This exclusion step happens after the inclusion steps above.)

As a reminder, the rationale for the choice of those strings is captured in the 00-HGG-select-pathology-dx notebook (HTML preview). You can see the implementation of the filtering in 02-HGG-molecular-subtyping-subset-files.R.

See questions below.

Is there anything that you want to discuss further?

I'm making an assumption that the pathology_diagnosis and pathology_free_text_diagnosis are in essence frozen as of release-v17-20200908. Is this an appropriate assumption? If not, we need to have a larger discussion about how and why we're making the molecular subtyping changes to use the pathology diagnosis values.
Are the terms being used to perform the string detection for inclusion and exclusion appropriate?
Are there any issues with the string detection methodology that could cause unexpected behavior that I have not handled?

Please also take a look at the documentation in the module README to see if it's sufficient.

Results

What types of results are included (e.g., table, figure)?

Cleaned molecular data in results and ultimately the table with the subtyping results: analyses/molecular-subtyping-HGG/results/HGG_molecular_subtype.tsv.

What is your summary of the results?

Below I'm summarizing the net result of the changes to this module (e.g., samples now being included).

BS_N6N147BY is now included in RNA-seq files (analyses/molecular-subtyping-HGG/results/HGG_cleaned_expression.stranded.tsv [changes z-score values slightly] and analyses/molecular-subtyping-HGG/results/HGG_cleaned_fusion.tsv). This sample is the subject of molecular subtyping: BS_N6N147BY #783. It was reclassified due to efforts in moelcular-subtyping-embryonal and therefore the short_histology has been ETMR in the last couple of releases. That's why it's not in the files on master. The pathology diagnosis information all points to HGG, though, so the inclusion of this biospecimen is appropriate.
The PT_RGX23JFP and 7316-2901 participant ID, sample id pair is now included in the subset files (biospecimen IDs: BS_ZZWMD6FA and BS_XW26N96W). The pathology_diagnosis indicates Ganglioglioma, but the pathology_free_text_diagnosis is ganglioglioma and high-grade glioma. Not sure what the right call is here in terms of inclusion or not, but, if the general methodology is deemed appropriate in this module, it may be appropriate to sort that out downstream in molecular-subtyping-pathology. Notably, there's another sample from PT_RGX23JFP (7316-156) where pathology_free_text_diagnosis is ganglioglioma. Is this indicative of a problem upstream or expected?
For samples that were already included, there was no change in the molecular subtype labels.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

jaclyn-taroni · 2020-09-20T17:04:17Z

I'm going to request reviews from @jharenza and @jashapiro, to comment on the appropriateness of the pathology diagnoses used for inclusion/exclusion, the methodology for how those terms got chosen, and the implementation of the filtering. Apologies for the length of the initial comment, but there was a lot of decision making and required downstream changes that required context in my opinion.

jashapiro

This approach, defining inclusion and exclusion terms for the initial list, seems right to me. I had some questions about the way those lists were developed, and whether the ones that use pathology_diagnosis should be exact matches. I tend to think they should, despite the fact that it might be more brittle. My reasoning is that it will be more transparent if we have the full list of included terms for this field, rather than some that are partial matches to potentially more than one of the defined terms.

On the other hand, if we are trying to avoid brittleness, then the comparisons should be case-insensitive, which they are not at the moment (I made this suggestion to be sure the free text was robust, but not for the defined vocab).

Otherwise, this looks like a good model of the approach for other similar modules. One concern is how the JSON file might be updated in the future, but since the 00 notebook is not part of the script, I am not too concerned about changes being made and accidentally undone.

jashapiro · 2020-09-21T15:02:03Z

analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd

+
+### Directories and files
+
+We're going to tie this to a specific release.


I can see why this makes sense to do, but it also seems like something that could easily be missed in updates (I see that it is discussed in the readme, but still worry). I'm wondering if this could be put in the RMD params for easier future updates?

Maybe this is not needed, as this is meant to be a run-once notebook, but how are we planning to handle updates to the JSON if needed? If the vocabulary were to change, would we update that manually and deprecate this notebook?

I don't think we will be in a situation where we want to look at the contents of the histologies files on the basis of short_histology == "HGAT" again, once short_histology fully depends on molecular_subtype (planned for next release). I can definitely see a situation where we would update the JSON (that's the rationale for including it rather than hardcoding in 02!) but I don't think the process for getting those terms would be the same.

If the vocabulary were to change, would we update that manually and deprecate this notebook?

All that to say - I think you're correct here.

analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd

jashapiro · 2020-09-21T15:18:43Z

analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd

+
+## Pathology diagnosis strings for inclusion
+
+These are the terms that we'll collapse together with `|` to detect strings in the `pathology_diagnosis` column.


Can you say more about why some of the terms are excluded? For example: High Grade Glial Neoplasma. I see that PNET samples were reclassified later, but are there PNET samples that were not? If we include them here are we getting false positives?

Also, more generally, why not use exact matching here? If this is a defined vocabulary, why not use it with the fully defined values and %in%?

It's not a defined vocabulary as far as I can tell (this is my assumption based on the presence of both Brainstem glioma- Diffuse intrinsic pontine glioma and Infiltrating Dipg which may be incorrect), which is why I hesitate to use exact matching. But you are right that it could be more robust if made case insensitive. I'll go that route unless @jharenza replies and tells us that the values for pathology_diagnosis are a controlled vocabulary.

Can you say more about why some of the terms are excluded? For example: High Grade Glial Neoplasma. I see that PNET samples were reclassified later, but are there PNET samples that were not? If we include them here are we getting false positives?

Regarding the PNET samples, there's nothing in the strings used for matching that is designed to capture these PNET. The PNET samples that were reclassified should be captured by the defining lesions step in 02. My (non-expert) understanding is that the tumors designated as PNET are more appropriately subtyped in the non-ATRT/non-MB embryonal module. I can add text at line 72 to make excluding PNETs more clear.

You're right that High Grade Glial Neoplasma should probably be included. I will make that change.

My (non-expert) understanding is that the tumors designated as PNET are more appropriately subtyped in the non-ATRT/non-MB embryonal module. I can add text at line 72 to make excluding PNETs more clear.

Just catching up on this, but yes, @jaclyn-taroni you are correct here.

jashapiro · 2020-09-21T15:39:53Z

analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd

+These are the terms that we'll collapse together with `|` to detect strings in the `pathology_diagnosis` column.
+
+```{r}
+path_dx_terms <- c(


If not matching exactly, do we want to use all lower case here (or during matching)?

analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd

jashapiro · 2020-09-21T15:48:40Z

analyses/molecular-subtyping-HGG/02-HGG-molecular-subtyping-subset-files.R

+  filter(str_detect(pathology_diagnosis, 
+                    paste0(path_dx_list$include_path_dx, collapse = "|")) | 


Just tagging here to note that discussion of full text vs. subseqs applies here and could result in changes.

jashapiro · 2020-09-21T15:51:33Z

analyses/molecular-subtyping-HGG/02-HGG-molecular-subtyping-subset-files.R

+# Now samples on the basis of the defining lesions
+lesions_df <- tumor_metadata_df %>%
+  filter(sample_id %in% hgg_lesions_df$sample_id)


For clarity, this seems like it should happen around the same time as filtering hgg_lesions_df. Maybe move this up, or move lines 121-127 down?

I approved, but realized you didn't implement this change... up to you.

I did something I think is better in lines 150-155.

Probably more clear/helpful in the diff for b55fa9d

Looks good! For whatever reason I missed that? I don't know. I blame kids.

jashapiro · 2020-09-21T16:29:16Z

analyses/molecular-subtyping-HGG/02-HGG-molecular-subtyping-subset-files.R

+  # pathology_free_text_diagnosis
+  filter(str_detect(pathology_diagnosis, 
+                    paste0(path_dx_list$include_path_dx, collapse = "|")) | 
+           str_detect(pathology_free_text_diagnosis, 


These should be all lower case anyway, but maybe make it robust?

Suggested change

str_detect(pathology_free_text_diagnosis,

str_detect(str_to_lower(pathology_free_text_diagnosis),

analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd

Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>

jaclyn-taroni · 2020-09-21T20:58:49Z

With the fact that I don't believe we're using a controlled vocabulary for pathology_diagnosis in mind, and therefore I don't think exact matches are necessarily the way to go, I believe that I've addressed all of your comments @jashapiro. With those changes, I've rerun the module and the samples included does not change. Re-requesting your review!

jashapiro

LGTM!

jaclyn-taroni · 2020-09-23T02:11:24Z

@jharenza I made the changes you laid out in #754 (comment). Overall, we are dropping two samples that currently have HGAT for short_histology and are in the CBTTC cohort (7316-3817 and 7316-71) because they are classified as Gliomatosis Cerebri in pathology_diagnosis. Based on your comment, that seems correct but wanted to call out that change in the final subtype table specifically.

jharenza · 2020-09-23T21:13:30Z

@jharenza I made the changes you laid out in #754 (comment). Overall, we are dropping two samples that currently have HGAT for short_histology and are in the CBTTC cohort (7316-3817 and 7316-71) because they are classified as Gliomatosis Cerebri in pathology_diagnosis. Based on your comment, that seems correct but wanted to call out that change in the final subtype table specifically.

Just double-confirming this. They are no longer a tumor, but a growth pattern, however, they are associated with tumors that could have mutations, so I just want to double confirm that we don't need to subtype.

jharenza · 2020-09-24T20:09:09Z

@jharenza I made the changes you laid out in #754 (comment). Overall, we are dropping two samples that currently have HGAT for short_histology and are in the CBTTC cohort (7316-3817 and 7316-71) because they are classified as Gliomatosis Cerebri in pathology_diagnosis. Based on your comment, that seems correct but wanted to call out that change in the final subtype table specifically.

Just double-confirming this. They are no longer a tumor, but a growth pattern, however, they are associated with tumors that could have mutations, so I just want to double confirm that we don't need to subtype.

@jaclyn-taroni Cassie says we should subtype these. So, we can add Gliomatosis Cerebri back to the pathology_diagnosis terms. Sorry about that!

jaclyn-taroni · 2020-09-25T19:12:40Z

With f833ac7, the Gliomatosis Cerebri samples are included. We don't see any additions or deletions of samples from the final subtyping table, just a few small ordering changes.

jaclyn-taroni added 14 commits September 19, 2020 19:14

Remove disease label reclassification step

b3eefa4

Add a notebook that preps strings for inclusion/exclusion

8f3e832

Add a column for TRUE/FALSE presence of defining lesion

e20e3db

Moar identifiers

d445b5b

Add a couple more path dx strings

e26dad0

Update subsetting to use the path dx fields

91ba044

Add intermediate metadata file for included specimens

e8480cd

Use intermediate metadata file for CNV cleaning

c808a51

Remove defining_lesion from cleaned mutation table

ec24122

glioma_brain_region -> CNS_region

6881692

Run the whole module with modifications

a61f921

Uncomment HGG subtyping step in CI

886ffa6

Finish sentence

37dbc24

Update documentation to reflect changes

2640fda

jaclyn-taroni marked this pull request as ready for review September 20, 2020 17:02

jaclyn-taroni requested review from jashapiro and jharenza September 20, 2020 17:04

jaclyn-taroni changed the title ~~[WIP] Update molecular-subtyping-HGG to use pathology diagnosis fields~~ Update molecular-subtyping-HGG to use pathology diagnosis fields Sep 20, 2020

This was referenced Sep 20, 2020

Updated analysis: molecular subtyping for cell lines #509

Closed

Updated analysis: MB subtyping to use pathology diagnosis #756

Closed

jashapiro reviewed Sep 21, 2020

View reviewed changes

Apply suggestions from code review

b80a69a

Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>

jaclyn-taroni mentioned this pull request Sep 21, 2020

MB subtyping update to pathology_diagnosis #787

Merged

5 tasks

jaclyn-taroni added 3 commits September 21, 2020 16:35

Docs and terms changes from code review

1ce3ed8

All lowercase for matching path dx

b55fa9d

Rerun the entire module

f65c803

jaclyn-taroni requested a review from jashapiro September 21, 2020 20:58

jashapiro approved these changes Sep 21, 2020

View reviewed changes

cbethell mentioned this pull request Sep 22, 2020

Update molecular subtyping embryonal module to use pathology diagnosis fields #788

Merged

5 tasks

jaclyn-taroni mentioned this pull request Sep 22, 2020

Updated analysis: HGG subtyping to use pathology diagnosis #754

Closed

jaclyn-taroni added 2 commits September 22, 2020 21:44

Exact path dx matches for CBTTC samples

4ceefd0

Rerun entire module

f4b35e0

jaclyn-taroni added 2 commits September 22, 2020 22:29

Update docs to reflect exact match changes

1b66ce0

Merge branch 'master' into jaclyn-taroni/hgg-path-dx

37a07d2

jaclyn-taroni added 2 commits September 25, 2020 14:51

Merge branch 'master' into jaclyn-taroni/hgg-path-dx

bee18c7

Add back in Gliomatosis Cerebri; rerun

f833ac7

jharenza approved these changes Sep 25, 2020

View reviewed changes

jaclyn-taroni merged commit 4626dbc into AlexsLemonade:master Sep 27, 2020

jaclyn-taroni deleted the jaclyn-taroni/hgg-path-dx branch September 27, 2020 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update molecular-subtyping-HGG to use pathology diagnosis fields #786

Update molecular-subtyping-HGG to use pathology diagnosis fields #786

jaclyn-taroni commented Sep 20, 2020 •

edited

Loading

jaclyn-taroni commented Sep 20, 2020

jashapiro left a comment

jashapiro Sep 21, 2020

jaclyn-taroni Sep 21, 2020

jashapiro Sep 21, 2020

jaclyn-taroni Sep 21, 2020

jaclyn-taroni Sep 21, 2020 •

edited

Loading

jharenza Sep 23, 2020

jashapiro Sep 21, 2020

jashapiro Sep 21, 2020

jashapiro Sep 21, 2020

jashapiro Sep 21, 2020

jaclyn-taroni Sep 21, 2020

jaclyn-taroni Sep 21, 2020

jashapiro Sep 21, 2020

jashapiro Sep 21, 2020

jaclyn-taroni commented Sep 21, 2020

jashapiro left a comment

jaclyn-taroni commented Sep 23, 2020

jharenza commented Sep 23, 2020

jharenza commented Sep 24, 2020

jaclyn-taroni commented Sep 25, 2020


		### Directories and files

		We're going to tie this to a specific release.


		## Pathology diagnosis strings for inclusion

		These are the terms that we'll collapse together with `\|` to detect strings in the `pathology_diagnosis` column.

		filter(str_detect(pathology_diagnosis,
		paste0(path_dx_list$include_path_dx, collapse = "\|")) \|

	str_detect(pathology_free_text_diagnosis,
	str_detect(str_to_lower(pathology_free_text_diagnosis),

Update molecular-subtyping-HGG to use pathology diagnosis fields #786

Update molecular-subtyping-HGG to use pathology diagnosis fields #786

Conversation

jaclyn-taroni commented Sep 20, 2020 • edited Loading

Purpose/implementation Section

What was your approach?

Identifying terms in pathology_diagnosis and pathology_free_text_diagnosis

Inclusion/exclusion of samples

Other changes

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jaclyn-taroni commented Sep 20, 2020

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni Sep 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni commented Sep 21, 2020

jashapiro left a comment

Choose a reason for hiding this comment

jaclyn-taroni commented Sep 23, 2020

jharenza commented Sep 23, 2020

jharenza commented Sep 24, 2020

jaclyn-taroni commented Sep 25, 2020

jaclyn-taroni commented Sep 20, 2020 •

edited

Loading

Identifying terms in `pathology_diagnosis` and `pathology_free_text_diagnosis`

jaclyn-taroni Sep 21, 2020 •

edited

Loading