PR 1 of n: Molecular Subtyping - HGG (Defining Lesions) #352

cbethell · 2019-12-18T17:07:27Z

Purpose/implementation Section

To molecularly subtype HGG samples.

What scientific question is your analysis addressing?

What are the samples in the OpenPBTA dataset that fit into the HGG molecular subtypes?

What was your approach?

I joined together the data that is relevant to molecular subtyping HGG samples, including the metadata, RNA expression, SNV, and CN data.

I began by following the plan in the comment here.

What GitHub issue does your pull request address?

This PR addresses issue #249.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Does this analysis appear to be correct?
Do the variables in the final data.frame seem suffice to molecularly subtype the HGG samples?
Is there any obvious refactoring needed?

Is there anything that you want to discuss further?

Note: A heatmap displaying the data in this PR is upcoming.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this PR is ready for review.

Results

What types of results are included (e.g., table, figure)?

A tsv file with the data in the final data.frame of the R notebook in this PR can be found in the results directory of this module at results/HGG_molecular_subtypes.tsv.

The table can also be viewed on the html output here.

What is your summary of the results?

I have not yet developed a summary of the current results beyond the final data.frame

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

PR Checklist

Run a linter
Set the seed (NA)
Comments and/or documentation up to date
Double check your paths
Spell check any Rmd file or md file
Restart R and run all notebooks fresh and save

- add this analysis to `.circleci`

- rerun notebook

…into hgg-molecular-subtyping-data-prep

jaclyn-taroni · 2019-12-19T19:23:01Z

@cbethell my read of #249 is that we want a column that tells us about the presence or absence of the following specific mutations:

H3F3A K28M, H3F3A G35R/V or HIST1H3B K28M

Where the first step is to check all samples for these and then the subsequent steps should include all samples already classified as HGG + any that would be reclassified on the basis of the presence of these mutations.

We also want to check for the presence or absence of

IDH1 R132H

and looks like IDH1 R172, too.

EDIT: Check BRAF V600E as well. It should only be present in LGG.

…m/cbethell/OpenPBTA-analysis into hgg-molecular-subtyping-data-prep

- rerun notebook

jaclyn-taroni

Hi @cbethell,

I'm going to outline my interpretation of what #249 is asking for and you, @jharenza, and I can go back and forth as needed.

I would split this up into (at least) two stages: 1) check every sample for defining lesions outlined in #249 and 2) wrangling the other information for all the HGG samples via disease_type_new + any samples that were not yet classified has HGG but should be based on the defining lesion.

So, the first notebook looks at the defining lesions and essentially produces the following table where the mutation columns are a binary outcome:

Kids_First_Participant_ID	sample_id	Kids_First_Biospecimen_ID	H3F3A K28M	HIST1H3B K28M	H3F3A G35R	H3F3A G35V

This notebook should note any inconsistencies, e.g., samples that would need to be reclassified.

I think what would come next is a script that subsets the HGG files, much like the approach you took with the ATRT subset files, that is not run in CI. The subset files should contain samples already labeled as HGG and those that were "picked up" because of the presence of a defining lesion.

You would then use those subset files to address part 2, where I think the final table will look like:

Kids_First_Participant_ID	sample_id	Kids_First_Biospecimen_ID	age at diagnosis (days)	glioma brain region	H3F3A K28M	HIST1H3B K28M	H3F3A G35R/V	ACRV1 mutated	TP53 mutated	ATRX mutated	PDGFRA copy status	PTEN copy status	FGFR1 mutated or fused	SETD2 mutated	NTRK fused	FOXG1 z-score	OLIG2 z-score	chr7 status	chr10 status	IDH1 R132	MYCN copy status	TERT mutated	...

Where the * mutated and * fused columns are binary outcomes. At first, I would limit the files that you look at for an individual gene to those that are explicitly mentioned in #249, e.g., only look at NTRK in the fusion file. The rationale is that this is going to be a lot of information to digest and we can always go back and look at additional data types if something is ambiguous or if it is requested. As for the chr7 and chr10 status, I think you can use the broad_values_by_arm.txt file from GISTIC (related: #344 (comment)).

You might want to play with transposing this table. Alternatively you may want to split this up into multiple tables, one for each of the named subtypes, which I think might accomplish something similar to your Cooccuring_lesions column, We'll have to figure out how the information is best presented.

jaclyn-taroni · 2019-12-30T18:00:08Z