Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Update molecular-subtyping-HGG to use pathology diagnosis fields #786

Merged

Conversation

jaclyn-taroni
Copy link
Member

@jaclyn-taroni jaclyn-taroni commented Sep 20, 2020

Purpose/implementation Section

Previously, samples were included for subtyping in the molecular-subtyping-HGG module if they met at least one of the criteria below:

  1. A defining lesion was identified in the SNV consensus file (H3 K28M or G35R/V)
  2. The short_histology value was HGAT.

In an upcoming release, short_histology will rely on the output of the molecular subtyping modules (#748) and therefore be downstream of the molecular subtyping module.

We need to use pathology_diagnosis and pathology_free_text_diagnosis fields to replace the filtering on the basis of short_histology prior to that change being made.

What was your approach?

Identifying terms in pathology_diagnosis and pathology_free_text_diagnosis

The issue that tracks this change (#754) doesn't yet have specific information about values of pathology_diagnosis and pathology_free_text_diagnosis that should be used to include samples.
For what I think is a reasonable first pass, I filtered the release-v17-20200908 pbta-histologies.tsv to rows where the short_histology value is HGAT and examined what samples that passed filtering had in the pathology_diagnosis and pathology_free_text_diagnosis.

I then created a list of strings to be used for filtering the pathology_diagnosis and pathology_free_text_diagnosis in the downstream inclusion/exclusion steps and saved that information in analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json.

This takes place in the 00-HGG-select-pathology-dx notebook that has the current data release, release-v17-20200908, hardcoded and is not run via the shell script for the module. It is designed to capture this moment in pbta-histologies.tsv and not be updated.

High-level summary: For the most part, terms in the pathology diagnosis fields are what I, a non-expert, would expect on the basis of the 2016 WHO classification (Louis et al. Acta Neuropathol. doi: 10.1007/s00401-016-1545-1). There are a few unexpected pathology_diagnosis values when filtering by short_histology == "HGAT" (e.g., PNET) but these appear to be due to changes in short_histology introduced into the histologies file due to earlier subtyping efforts (#609 (comment)). There are several terms under Diffuse astrocytic and oligodendroglial tumor, e.g., anaplastic astrocytoma, that when used for detection in the pathology free text diagnosis field return LGG samples (based on the pathology diagnosis).

Inclusion/exclusion of samples

The methodology/logic for including samples for subtyping in this module are:

  1. A defining lesion was identified in the SNV consensus file (H3 K28M or G35R/V) - this is unchanged, but how that's accomplished differs slightly (see the next section).
  2. Filter to samples that have one of the strings in analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json in pathology_diagnosis or pathology_free_text_diagnosis
  3. Once the filtering on the basis of inclusion occurs, exclude LGG samples based on the pathology_diagnosis field.
  4. Combine the biospecimens based on the defining lesion and pathology diagnosis steps and filter out duplicates (biospecimens meet both criteria).
Other changes

There are a few other updates to this module that were necessary or a good idea while addressing #754 that I'll summarize below.

  • Removal of the reclassification of short histology, broad histology, etc. from the first notebook (01-HGG-molecular-subtyping-defining-lesions.Rmd) where we handle the defining lesions. The reclassification step meant things like molecular subtyping: BS_N6N147BY #783 cropped up and I'm fairly sure that was only in there in the first place to accommodate the old way of incorporating subtyping information into the histologies file. @jharenza mentioned leaving those fields blank in molecular subtyping: BS_N6N147BY #783 (comment); I think leaving this in would be both unnecessary and confusing. Instead, I add a logical column (defining_lesion) in analyses/molecular-subtyping-HGG/results/HGG_defining_lesions.tsv that indicates the presence of one of the defining lesions described above that's used for including samples downstream. The change to 04-HGG-molecular-subtyping-mutation.Rmd included here is to drop that column from the cleaned mutation table.
  • The step where we generate the subset files (02) also now outputs the subset of the pbta-histologies.tsv file for biospecimens that are included for subtyping (analyses/molecular-subtyping-HGG/hgg-subset/hgg_metadata.tsv). I thought this would be useful for inspection while we're getting the details hammered out, but this also prevents us from having to repeat the logic for inclusion/exclusion in 03-HGG-molecular-subtyping-cnv.Rmd. You can see these changes in this PR as well.
  • The required changes to 07-HGG-molecular-subtyping-combine-table.Rmd to accommodate glioma_brain_region being renamed as CNS_region in
  • Uncomment this step so it runs in CI.

What GitHub issue does your pull request address?

Closes #754 - update HGG subtyping module to use pathology diagnosis information

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json contains the:

  • include_path_dx: the strings that are being used to detect samples that should be included for subtyping, filtering on pathology_diagnosis
  • include_free_text: the strings that are being used to detect samples that should be included for subtyping, filtering on pathology_free_text_diagnosis
  • exclude_path_dx: the strings that are being used to detect samples that should be excluded for subtyping, filtering on pathology_diagnosis. (This exclusion step happens after the inclusion steps above.)

As a reminder, the rationale for the choice of those strings is captured in the 00-HGG-select-pathology-dx notebook (HTML preview). You can see the implementation of the filtering in 02-HGG-molecular-subtyping-subset-files.R.

See questions below.

Is there anything that you want to discuss further?

  • I'm making an assumption that the pathology_diagnosis and pathology_free_text_diagnosis are in essence frozen as of release-v17-20200908. Is this an appropriate assumption? If not, we need to have a larger discussion about how and why we're making the molecular subtyping changes to use the pathology diagnosis values.
  • Are the terms being used to perform the string detection for inclusion and exclusion appropriate?
  • Are there any issues with the string detection methodology that could cause unexpected behavior that I have not handled?

Please also take a look at the documentation in the module README to see if it's sufficient.

Results

What types of results are included (e.g., table, figure)?

Cleaned molecular data in results and ultimately the table with the subtyping results: analyses/molecular-subtyping-HGG/results/HGG_molecular_subtype.tsv.

What is your summary of the results?

Below I'm summarizing the net result of the changes to this module (e.g., samples now being included).

  • BS_N6N147BY is now included in RNA-seq files (analyses/molecular-subtyping-HGG/results/HGG_cleaned_expression.stranded.tsv [changes z-score values slightly] and analyses/molecular-subtyping-HGG/results/HGG_cleaned_fusion.tsv). This sample is the subject of molecular subtyping: BS_N6N147BY #783. It was reclassified due to efforts in moelcular-subtyping-embryonal and therefore the short_histology has been ETMR in the last couple of releases. That's why it's not in the files on master. The pathology diagnosis information all points to HGG, though, so the inclusion of this biospecimen is appropriate.
  • The PT_RGX23JFP and 7316-2901 participant ID, sample id pair is now included in the subset files (biospecimen IDs: BS_ZZWMD6FA and BS_XW26N96W). The pathology_diagnosis indicates Ganglioglioma, but the pathology_free_text_diagnosis is ganglioglioma and high-grade glioma. Not sure what the right call is here in terms of inclusion or not, but, if the general methodology is deemed appropriate in this module, it may be appropriate to sort that out downstream in molecular-subtyping-pathology. Notably, there's another sample from PT_RGX23JFP (7316-156) where pathology_free_text_diagnosis is ganglioglioma. Is this indicative of a problem upstream or expected?
  • For samples that were already included, there was no change in the molecular subtype labels.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@jaclyn-taroni jaclyn-taroni marked this pull request as ready for review September 20, 2020 17:02
@jaclyn-taroni
Copy link
Member Author

I'm going to request reviews from @jharenza and @jashapiro, to comment on the appropriateness of the pathology diagnoses used for inclusion/exclusion, the methodology for how those terms got chosen, and the implementation of the filtering. Apologies for the length of the initial comment, but there was a lot of decision making and required downstream changes that required context in my opinion.

@jaclyn-taroni jaclyn-taroni changed the title [WIP] Update molecular-subtyping-HGG to use pathology diagnosis fields Update molecular-subtyping-HGG to use pathology diagnosis fields Sep 20, 2020
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach, defining inclusion and exclusion terms for the initial list, seems right to me. I had some questions about the way those lists were developed, and whether the ones that use pathology_diagnosis should be exact matches. I tend to think they should, despite the fact that it might be more brittle. My reasoning is that it will be more transparent if we have the full list of included terms for this field, rather than some that are partial matches to potentially more than one of the defined terms.

On the other hand, if we are trying to avoid brittleness, then the comparisons should be case-insensitive, which they are not at the moment (I made this suggestion to be sure the free text was robust, but not for the defined vocab).

Otherwise, this looks like a good model of the approach for other similar modules. One concern is how the JSON file might be updated in the future, but since the 00 notebook is not part of the script, I am not too concerned about changes being made and accidentally undone.


### Directories and files

We're going to tie this to a specific release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this makes sense to do, but it also seems like something that could easily be missed in updates (I see that it is discussed in the readme, but still worry). I'm wondering if this could be put in the RMD params for easier future updates?

Maybe this is not needed, as this is meant to be a run-once notebook, but how are we planning to handle updates to the JSON if needed? If the vocabulary were to change, would we update that manually and deprecate this notebook?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will be in a situation where we want to look at the contents of the histologies files on the basis of short_histology == "HGAT" again, once short_histology fully depends on molecular_subtype (planned for next release). I can definitely see a situation where we would update the JSON (that's the rationale for including it rather than hardcoding in 02!) but I don't think the process for getting those terms would be the same.

If the vocabulary were to change, would we update that manually and deprecate this notebook?

All that to say - I think you're correct here.


## Pathology diagnosis strings for inclusion

These are the terms that we'll collapse together with `|` to detect strings in the `pathology_diagnosis` column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about why some of the terms are excluded? For example: High Grade Glial Neoplasma. I see that PNET samples were reclassified later, but are there PNET samples that were not? If we include them here are we getting false positives?

Also, more generally, why not use exact matching here? If this is a defined vocabulary, why not use it with the fully defined values and %in%?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a defined vocabulary as far as I can tell (this is my assumption based on the presence of both Brainstem glioma- Diffuse intrinsic pontine glioma and Infiltrating Dipg which may be incorrect), which is why I hesitate to use exact matching. But you are right that it could be more robust if made case insensitive. I'll go that route unless @jharenza replies and tells us that the values for pathology_diagnosis are a controlled vocabulary.

Copy link
Member Author

@jaclyn-taroni jaclyn-taroni Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about why some of the terms are excluded? For example: High Grade Glial Neoplasma. I see that PNET samples were reclassified later, but are there PNET samples that were not? If we include them here are we getting false positives?

Regarding the PNET samples, there's nothing in the strings used for matching that is designed to capture these PNET. The PNET samples that were reclassified should be captured by the defining lesions step in 02. My (non-expert) understanding is that the tumors designated as PNET are more appropriately subtyped in the non-ATRT/non-MB embryonal module. I can add text at line 72 to make excluding PNETs more clear.

You're right that High Grade Glial Neoplasma should probably be included. I will make that change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (non-expert) understanding is that the tumors designated as PNET are more appropriately subtyped in the non-ATRT/non-MB embryonal module. I can add text at line 72 to make excluding PNETs more clear.

Just catching up on this, but yes, @jaclyn-taroni you are correct here.

These are the terms that we'll collapse together with `|` to detect strings in the `pathology_diagnosis` column.

```{r}
path_dx_terms <- c(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not matching exactly, do we want to use all lower case here (or during matching)?

Comment on lines 146 to 147
filter(str_detect(pathology_diagnosis,
paste0(path_dx_list$include_path_dx, collapse = "|")) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tagging here to note that discussion of full text vs. subseqs applies here and could result in changes.

Comment on lines 156 to 158
# Now samples on the basis of the defining lesions
lesions_df <- tumor_metadata_df %>%
filter(sample_id %in% hgg_lesions_df$sample_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, this seems like it should happen around the same time as filtering hgg_lesions_df. Maybe move this up, or move lines 121-127 down?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved, but realized you didn't implement this change... up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something I think is better in lines 150-155.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more clear/helpful in the diff for b55fa9d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! For whatever reason I missed that? I don't know. I blame kids.

# pathology_free_text_diagnosis
filter(str_detect(pathology_diagnosis,
paste0(path_dx_list$include_path_dx, collapse = "|")) |
str_detect(pathology_free_text_diagnosis,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be all lower case anyway, but maybe make it robust?

Suggested change
str_detect(pathology_free_text_diagnosis,
str_detect(str_to_lower(pathology_free_text_diagnosis),

Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>
@jaclyn-taroni
Copy link
Member Author

With the fact that I don't believe we're using a controlled vocabulary for pathology_diagnosis in mind, and therefore I don't think exact matches are necessarily the way to go, I believe that I've addressed all of your comments @jashapiro. With those changes, I've rerun the module and the samples included does not change. Re-requesting your review!

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jaclyn-taroni
Copy link
Member Author

@jharenza I made the changes you laid out in #754 (comment). Overall, we are dropping two samples that currently have HGAT for short_histology and are in the CBTTC cohort (7316-3817 and 7316-71) because they are classified as Gliomatosis Cerebri in pathology_diagnosis. Based on your comment, that seems correct but wanted to call out that change in the final subtype table specifically.

@jharenza
Copy link
Collaborator

@jharenza I made the changes you laid out in #754 (comment). Overall, we are dropping two samples that currently have HGAT for short_histology and are in the CBTTC cohort (7316-3817 and 7316-71) because they are classified as Gliomatosis Cerebri in pathology_diagnosis. Based on your comment, that seems correct but wanted to call out that change in the final subtype table specifically.

Just double-confirming this. They are no longer a tumor, but a growth pattern, however, they are associated with tumors that could have mutations, so I just want to double confirm that we don't need to subtype.

@jharenza
Copy link
Collaborator

@jharenza I made the changes you laid out in #754 (comment). Overall, we are dropping two samples that currently have HGAT for short_histology and are in the CBTTC cohort (7316-3817 and 7316-71) because they are classified as Gliomatosis Cerebri in pathology_diagnosis. Based on your comment, that seems correct but wanted to call out that change in the final subtype table specifically.

Just double-confirming this. They are no longer a tumor, but a growth pattern, however, they are associated with tumors that could have mutations, so I just want to double confirm that we don't need to subtype.

@jaclyn-taroni Cassie says we should subtype these. So, we can add Gliomatosis Cerebri back to the pathology_diagnosis terms. Sorry about that!

@jaclyn-taroni
Copy link
Member Author

With f833ac7, the Gliomatosis Cerebri samples are included. We don't see any additions or deletions of samples from the final subtyping table, just a few small ordering changes.

@jaclyn-taroni jaclyn-taroni merged commit 4626dbc into AlexsLemonade:master Sep 27, 2020
@jaclyn-taroni jaclyn-taroni deleted the jaclyn-taroni/hgg-path-dx branch September 27, 2020 19:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updated analysis: HGG subtyping to use pathology diagnosis
3 participants