-
Notifications
You must be signed in to change notification settings - Fork 83
Update molecular-subtyping-HGG to use pathology diagnosis fields #786
Update molecular-subtyping-HGG to use pathology diagnosis fields #786
Conversation
I'm going to request reviews from @jharenza and @jashapiro, to comment on the appropriateness of the pathology diagnoses used for inclusion/exclusion, the methodology for how those terms got chosen, and the implementation of the filtering. Apologies for the length of the initial comment, but there was a lot of decision making and required downstream changes that required context in my opinion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach, defining inclusion and exclusion terms for the initial list, seems right to me. I had some questions about the way those lists were developed, and whether the ones that use pathology_diagnosis
should be exact matches. I tend to think they should, despite the fact that it might be more brittle. My reasoning is that it will be more transparent if we have the full list of included terms for this field, rather than some that are partial matches to potentially more than one of the defined terms.
On the other hand, if we are trying to avoid brittleness, then the comparisons should be case-insensitive, which they are not at the moment (I made this suggestion to be sure the free text was robust, but not for the defined vocab).
Otherwise, this looks like a good model of the approach for other similar modules. One concern is how the JSON file might be updated in the future, but since the 00
notebook is not part of the script, I am not too concerned about changes being made and accidentally undone.
|
||
### Directories and files | ||
|
||
We're going to tie this to a specific release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why this makes sense to do, but it also seems like something that could easily be missed in updates (I see that it is discussed in the readme, but still worry). I'm wondering if this could be put in the RMD params
for easier future updates?
Maybe this is not needed, as this is meant to be a run-once notebook, but how are we planning to handle updates to the JSON if needed? If the vocabulary were to change, would we update that manually and deprecate this notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will be in a situation where we want to look at the contents of the histologies files on the basis of short_histology == "HGAT"
again, once short_histology
fully depends on molecular_subtype
(planned for next release). I can definitely see a situation where we would update the JSON (that's the rationale for including it rather than hardcoding in 02
!) but I don't think the process for getting those terms would be the same.
If the vocabulary were to change, would we update that manually and deprecate this notebook?
All that to say - I think you're correct here.
analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd
Outdated
Show resolved
Hide resolved
analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd
Outdated
Show resolved
Hide resolved
|
||
## Pathology diagnosis strings for inclusion | ||
|
||
These are the terms that we'll collapse together with `|` to detect strings in the `pathology_diagnosis` column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about why some of the terms are excluded? For example: High Grade Glial Neoplasma
. I see that PNET samples were reclassified later, but are there PNET samples that were not? If we include them here are we getting false positives?
Also, more generally, why not use exact matching here? If this is a defined vocabulary, why not use it with the fully defined values and %in%
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a defined vocabulary as far as I can tell (this is my assumption based on the presence of both Brainstem glioma- Diffuse intrinsic pontine glioma
and Infiltrating Dipg
which may be incorrect), which is why I hesitate to use exact matching. But you are right that it could be more robust if made case insensitive. I'll go that route unless @jharenza replies and tells us that the values for pathology_diagnosis
are a controlled vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about why some of the terms are excluded? For example:
High Grade Glial Neoplasma
. I see that PNET samples were reclassified later, but are there PNET samples that were not? If we include them here are we getting false positives?
Regarding the PNET samples, there's nothing in the strings used for matching that is designed to capture these PNET. The PNET samples that were reclassified should be captured by the defining lesions step in 02
. My (non-expert) understanding is that the tumors designated as PNET are more appropriately subtyped in the non-ATRT/non-MB embryonal module. I can add text at line 72 to make excluding PNETs more clear.
You're right that High Grade Glial Neoplasma
should probably be included. I will make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My (non-expert) understanding is that the tumors designated as PNET are more appropriately subtyped in the non-ATRT/non-MB embryonal module. I can add text at line 72 to make excluding PNETs more clear.
Just catching up on this, but yes, @jaclyn-taroni you are correct here.
These are the terms that we'll collapse together with `|` to detect strings in the `pathology_diagnosis` column. | ||
|
||
```{r} | ||
path_dx_terms <- c( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not matching exactly, do we want to use all lower case here (or during matching)?
analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd
Outdated
Show resolved
Hide resolved
filter(str_detect(pathology_diagnosis, | ||
paste0(path_dx_list$include_path_dx, collapse = "|")) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tagging here to note that discussion of full text vs. subseqs applies here and could result in changes.
# Now samples on the basis of the defining lesions | ||
lesions_df <- tumor_metadata_df %>% | ||
filter(sample_id %in% hgg_lesions_df$sample_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, this seems like it should happen around the same time as filtering hgg_lesions_df
. Maybe move this up, or move lines 121-127 down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved, but realized you didn't implement this change... up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did something I think is better in lines 150-155.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more clear/helpful in the diff for b55fa9d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! For whatever reason I missed that? I don't know. I blame kids.
# pathology_free_text_diagnosis | ||
filter(str_detect(pathology_diagnosis, | ||
paste0(path_dx_list$include_path_dx, collapse = "|")) | | ||
str_detect(pathology_free_text_diagnosis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be all lower case anyway, but maybe make it robust?
str_detect(pathology_free_text_diagnosis, | |
str_detect(str_to_lower(pathology_free_text_diagnosis), |
analyses/molecular-subtyping-HGG/00-HGG-select-pathology-dx.Rmd
Outdated
Show resolved
Hide resolved
Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>
With the fact that I don't believe we're using a controlled vocabulary for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@jharenza I made the changes you laid out in #754 (comment). Overall, we are dropping two samples that currently have |
Just double-confirming this. They are no longer a tumor, but a growth pattern, however, they are associated with tumors that could have mutations, so I just want to double confirm that we don't need to subtype. |
@jaclyn-taroni Cassie says we should subtype these. So, we can add |
With f833ac7, the |
Purpose/implementation Section
Previously, samples were included for subtyping in the
molecular-subtyping-HGG
module if they met at least one of the criteria below:short_histology
value wasHGAT
.In an upcoming release,
short_histology
will rely on the output of the molecular subtyping modules (#748) and therefore be downstream of the molecular subtyping module.We need to use
pathology_diagnosis
andpathology_free_text_diagnosis
fields to replace the filtering on the basis ofshort_histology
prior to that change being made.What was your approach?
Identifying terms in
pathology_diagnosis
andpathology_free_text_diagnosis
The issue that tracks this change (#754) doesn't yet have specific information about values of
pathology_diagnosis
andpathology_free_text_diagnosis
that should be used to include samples.For what I think is a reasonable first pass, I filtered the
release-v17-20200908
pbta-histologies.tsv
to rows where theshort_histology
value isHGAT
and examined what samples that passed filtering had in thepathology_diagnosis
andpathology_free_text_diagnosis
.I then created a list of strings to be used for filtering the
pathology_diagnosis
andpathology_free_text_diagnosis
in the downstream inclusion/exclusion steps and saved that information inanalyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json
.This takes place in the
00-HGG-select-pathology-dx
notebook that has the current data release,release-v17-20200908
, hardcoded and is not run via the shell script for the module. It is designed to capture this moment inpbta-histologies.tsv
and not be updated.High-level summary: For the most part, terms in the pathology diagnosis fields are what I, a non-expert, would expect on the basis of the 2016 WHO classification (Louis et al. Acta Neuropathol. doi: 10.1007/s00401-016-1545-1). There are a few unexpected
pathology_diagnosis
values when filtering byshort_histology == "HGAT"
(e.g., PNET) but these appear to be due to changes inshort_histology
introduced into the histologies file due to earlier subtyping efforts (#609 (comment)). There are several terms under Diffuse astrocytic and oligodendroglial tumor, e.g.,anaplastic astrocytoma
, that when used for detection in the pathology free text diagnosis field return LGG samples (based on the pathology diagnosis).Inclusion/exclusion of samples
The methodology/logic for including samples for subtyping in this module are:
analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json
inpathology_diagnosis
orpathology_free_text_diagnosis
pathology_diagnosis
field.Other changes
There are a few other updates to this module that were necessary or a good idea while addressing #754 that I'll summarize below.
01-HGG-molecular-subtyping-defining-lesions.Rmd
) where we handle the defining lesions. The reclassification step meant things like molecular subtyping: BS_N6N147BY #783 cropped up and I'm fairly sure that was only in there in the first place to accommodate the old way of incorporating subtyping information into the histologies file. @jharenza mentioned leaving those fields blank in molecular subtyping: BS_N6N147BY #783 (comment); I think leaving this in would be both unnecessary and confusing. Instead, I add a logical column (defining_lesion
) inanalyses/molecular-subtyping-HGG/results/HGG_defining_lesions.tsv
that indicates the presence of one of the defining lesions described above that's used for including samples downstream. The change to04-HGG-molecular-subtyping-mutation.Rmd
included here is to drop that column from the cleaned mutation table.02
) also now outputs the subset of thepbta-histologies.tsv
file for biospecimens that are included for subtyping (analyses/molecular-subtyping-HGG/hgg-subset/hgg_metadata.tsv
). I thought this would be useful for inspection while we're getting the details hammered out, but this also prevents us from having to repeat the logic for inclusion/exclusion in03-HGG-molecular-subtyping-cnv.Rmd
. You can see these changes in this PR as well.07-HGG-molecular-subtyping-combine-table.Rmd
to accommodateglioma_brain_region
being renamed asCNS_region
inWhat GitHub issue does your pull request address?
Closes #754 - update HGG subtyping module to use pathology diagnosis information
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
analyses/molecular-subtyping-HGG/hgg-subset/hgg_subtyping_path_dx_strings.json
contains the:include_path_dx
: the strings that are being used to detect samples that should be included for subtyping, filtering onpathology_diagnosis
include_free_text
: the strings that are being used to detect samples that should be included for subtyping, filtering onpathology_free_text_diagnosis
exclude_path_dx
: the strings that are being used to detect samples that should be excluded for subtyping, filtering onpathology_diagnosis
. (This exclusion step happens after the inclusion steps above.)As a reminder, the rationale for the choice of those strings is captured in the
00-HGG-select-pathology-dx
notebook (HTML preview). You can see the implementation of the filtering in02-HGG-molecular-subtyping-subset-files.R
.See questions below.
Is there anything that you want to discuss further?
pathology_diagnosis
andpathology_free_text_diagnosis
are in essence frozen as ofrelease-v17-20200908
. Is this an appropriate assumption? If not, we need to have a larger discussion about how and why we're making the molecular subtyping changes to use the pathology diagnosis values.Please also take a look at the documentation in the module README to see if it's sufficient.
Results
What types of results are included (e.g., table, figure)?
Cleaned molecular data in
results
and ultimately the table with the subtyping results:analyses/molecular-subtyping-HGG/results/HGG_molecular_subtype.tsv
.What is your summary of the results?
Below I'm summarizing the net result of the changes to this module (e.g., samples now being included).
BS_N6N147BY
is now included in RNA-seq files (analyses/molecular-subtyping-HGG/results/HGG_cleaned_expression.stranded.tsv
[changes z-score values slightly] andanalyses/molecular-subtyping-HGG/results/HGG_cleaned_fusion.tsv
). This sample is the subject of molecular subtyping: BS_N6N147BY #783. It was reclassified due to efforts inmoelcular-subtyping-embryonal
and therefore theshort_histology
has beenETMR
in the last couple of releases. That's why it's not in the files onmaster
. The pathology diagnosis information all points to HGG, though, so the inclusion of this biospecimen is appropriate.PT_RGX23JFP
and7316-2901
participant ID, sample id pair is now included in the subset files (biospecimen IDs:BS_ZZWMD6FA
andBS_XW26N96W
). Thepathology_diagnosis
indicatesGanglioglioma
, but thepathology_free_text_diagnosis
isganglioglioma and high-grade glioma
. Not sure what the right call is here in terms of inclusion or not, but, if the general methodology is deemed appropriate in this module, it may be appropriate to sort that out downstream inmolecular-subtyping-pathology
. Notably, there's another sample fromPT_RGX23JFP
(7316-156
) wherepathology_free_text_diagnosis
isganglioglioma
. Is this indicative of a problem upstream or expected?Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.