-
Notifications
You must be signed in to change notification settings - Fork 83
PR 1 of n: Molecular Subtyping - HGG (Defining Lesions) #352
PR 1 of n: Molecular Subtyping - HGG (Defining Lesions) #352
Conversation
- add this analysis to `.circleci`
- rerun notebook
- rerun notebook
…into hgg-molecular-subtyping-data-prep
@cbethell my read of #249 is that we want a column that tells us about the presence or absence of the following specific mutations:
Where the first step is to check all samples for these and then the subsequent steps should include all samples already classified as HGG + any that would be reclassified on the basis of the presence of these mutations. We also want to check for the presence or absence of
and looks like IDH1 R172, too. EDIT: Check BRAF V600E as well. It should only be present in LGG. |
…m/cbethell/OpenPBTA-analysis into hgg-molecular-subtyping-data-prep
- rerun notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbethell,
I'm going to outline my interpretation of what #249 is asking for and you, @jharenza, and I can go back and forth as needed.
I would split this up into (at least) two stages: 1) check every sample for defining lesions outlined in #249 and 2) wrangling the other information for all the HGG samples via disease_type_new
+ any samples that were not yet classified has HGG but should be based on the defining lesion.
So, the first notebook looks at the defining lesions and essentially produces the following table where the mutation columns are a binary outcome:
Kids_First_Participant_ID | sample_id | Kids_First_Biospecimen_ID | H3F3A K28M | HIST1H3B K28M | H3F3A G35R | H3F3A G35V |
---|
This notebook should note any inconsistencies, e.g., samples that would need to be reclassified.
I think what would come next is a script that subsets the HGG files, much like the approach you took with the ATRT subset files, that is not run in CI. The subset files should contain samples already labeled as HGG and those that were "picked up" because of the presence of a defining lesion.
You would then use those subset files to address part 2, where I think the final table will look like:
Kids_First_Participant_ID | sample_id | Kids_First_Biospecimen_ID | age at diagnosis (days) | glioma brain region | H3F3A K28M | HIST1H3B K28M | H3F3A G35R/V | ACRV1 mutated | TP53 mutated | ATRX mutated | PDGFRA copy status | PTEN copy status | FGFR1 mutated or fused | SETD2 mutated | NTRK fused | FOXG1 z-score | OLIG2 z-score | chr7 status | chr10 status | IDH1 R132 | MYCN copy status | TERT mutated | ... |
---|
Where the * mutated
and * fused
columns are binary outcomes. At first, I would limit the files that you look at for an individual gene to those that are explicitly mentioned in #249, e.g., only look at NTRK in the fusion file. The rationale is that this is going to be a lot of information to digest and we can always go back and look at additional data types if something is ambiguous or if it is requested. As for the chr7 and chr10 status, I think you can use the broad_values_by_arm.txt
file from GISTIC (related: #344 (comment)).
You might want to play with transposing this table. Alternatively you may want to split this up into multiple tables, one for each of the named subtypes, which I think might accomplish something similar to your Cooccuring_lesions
column, We'll have to figure out how the information is best presented.
"HIST1H3B", | ||
"ACVR1", | ||
"ATRX", | ||
"PDGFRA", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpret
PDGFRA amplification; PTEN loss
to mean copy number changes, not mutations
) | ||
|
||
# Read in consensus mutation data | ||
tmb_df <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not reading in the tumor mutation burden data, which is what I would expect based on tmb_df
.
"IDH1", | ||
"BRAF") | ||
|
||
H3_G35 <- c("H3F3A", "SETD2", "NTRK", "IDH1", "ATRX", "DAXX") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I would look in the fusions file for the NTRK information not the mutations file.
) %>% | ||
dplyr::group_by(sample_id) %>% | ||
dplyr::mutate( | ||
OLIG2_expression = paste(sort(unique(OLIG2)), collapse = ", "), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this step necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this step was to ensure that there was only one row per sample_id
, however this step was taken out of this particular PR and will be revisited in an upcoming PR (I believe there were duplicate rows for a reason other than the expression values).
```{r warning = FALSE} | ||
# Filter manta SV data for the target lesions and join this data.frame with | ||
# the selected variables of the metadata | ||
sv_df_filtered <- sv_df %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what is on #249, I would expect that
Co-deletion of chr 1p and 19q (LOH, loss of heterozygosity of both) results in translocation t(1p;19q)
Is what would be extracted from the Manta file. But I see that these are not necessarily required #249 (comment). I would add that you are using files that have already been annotated (controlfreec_annotated_cn_autosomes.tsv.gz
), so I don't know that the SV files are more straightforward to use than the CNV files at this point. Because we expect to have consensus copy number files (#128) that will probably get used for all the subtyping, I would recommend sticking with the files from focal-cn-file-preparation
. It's not clear to me that we will use AnnotSV, which I believe is what adds the gene name to the Manta output, on the consensus file.
…into hgg-molecular-subtyping-data-prep
- remove `results/HGG_molecular_subtypes.tsv` - new output file `results/HGG_defining_lesions.tsv` contains binary columns for all samples distinguishing whether or not they contain any of the four HGG defining lesions - rename `01` nb to better represent its purpose/content - rename object `tmb_df` to `snv_df`
Per @jaclyn-taroni's suggestions in this comment, this PR is now a notebook that looks only at the HGG defining lesions across all samples. The upcoming PR will be a script that subsets HGG samples, and a third PR will then incorporate all other relevant data (eg. fusion, CN, RNA expression data). |
hgg_samples <- snv_lesions_df %>% | ||
dplyr::filter( | ||
disease_type_reclassified == "High-grade glioma" & | ||
disease_type_new != "High-grade glioma;astrocytoma (WHO grade III/IV)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the v12 data I think this should be
disease_type_new != "High-grade glioma;astrocytoma (WHO grade III/IV)" | |
disease_type_new != "High-grade glioma" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbethell this looks great so far, and looks like you found an ependymoma and ganglioglioma that should be reclassified. I am requesting one change, based on the bolded subtypes I had described in #249. This is the detail we will want in the molecular_subtype column. Other than that, I think this is good to go as a first step.
dplyr::mutate( | ||
disease_type_reclassified = dplyr::case_when( | ||
H3F3A.K28M == "Yes" | | ||
HIST1H3B.K28M == "Yes" | | ||
H3F3A.G35R == "Yes" | | ||
H3F3A.G35V == "Yes" ~ "High-grade glioma", | ||
TRUE ~ as.character(disease_type_new) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an additional column here for the molecular_subtype
- something like:
HGG, H3 K28 mutant
or High-grade glioma, H3 K28 mutant
HGG, H3 G35 mutant
or High-grade glioma, H3 G35 mutant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made that update in a10cded. Note that this table is not the final table from this module, but an interim product #352 (review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM, let's subset some files 💪
Comments have been addressed a10cded
Purpose/implementation Section
To molecularly subtype HGG samples.
What scientific question is your analysis addressing?
What are the samples in the OpenPBTA dataset that fit into the HGG molecular subtypes?
What was your approach?
I joined together the data that is relevant to molecular subtyping HGG samples, including the metadata, RNA expression, SNV, and CN data.
I began by following the plan in the comment here.
What GitHub issue does your pull request address?
This PR addresses issue #249.
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Is there anything that you want to discuss further?
Note: A heatmap displaying the data in this PR is upcoming.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes, this PR is ready for review.
Results
What types of results are included (e.g., table, figure)?
A
tsv
file with the data in the final data.frame of the R notebook in this PR can be found in theresults
directory of this module atresults/HGG_molecular_subtypes.tsv
.The table can also be viewed on the html output here.
What is your summary of the results?
I have not yet developed a summary of the current results beyond the final data.frame
Reproducibility Checklist
PR Checklist