Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886

Merged
merged 25 commits into from
Feb 10, 2021

Conversation

kgaonkar6
Copy link
Collaborator

@kgaonkar6 kgaonkar6 commented Dec 24, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

In this PR we will add TP53 alteration status as discussed in #837:

TP53 altered - loss, if :

  • A sample contains a TP53 hotspot SNV mutation. (Cancer hotspot database and downloadable file available). Please also crosscheck that all mutations from this table are included.
  • A sample contains two TP53 alterations, suggesting (but not confirming) that both alleles are affected (SNV+SNV, CNV+SNV).
  • A sample contains one alteration (SNV or CNV) + has cancer_predispositions == "Li-Fraumeni syndrome", suggesting there is a germline variant in addition to the somatic variant we observe.
  • A sample does not have a TP53 alterations, but has cancer_predispositions == "Li-Fraumeni syndrome" and TP53 classifier score for matched RNA-Seq > 0.5 (or higher cutoff we decide upon later).

TP53 altered - activated, if:

  • A sample contains one of the two TP53 activating mutations R273C and R248W. Reference and reference.

Note this annotation is also applicable to #807 HGG TP53 annotation

What was your approach?

Steps:

  • gather consensus SNV for TP53 removing Silent and Intron
  • gather filtered CNV losses
  • gather values for cancer_predispositions and tp53 classififer scores for TP53
  • For annotation purposes I used counts of SNV and CNVs as values since we wanted to check for presence/absence and multi-hit TP53 variants. If AA position overlaps statistically significant SNV/Indels hotspots is assigned "Yes", if AA position overlaps gain-of-function sites according to literature activating column is assigned "Yes".

What GitHub issue does your pull request address?

#837

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

While matching by sample_id in these annotations should cell-lines be included in this annotation?

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

Is there anything that you want to discuss further?

I added a couple plots to visualize the distributions of scores, do we want to discuss those further in this PR?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

yes

Results

What types of results are included (e.g., table, figure)?

table , notebook figures

What is your summary of the results?

In OpenPBTA have 26 sample_ids with putative gain-of-function mutations, 96 with multiple loss of function mutations/CNVs and 980 as unknown/no evidence for tp53 altered status.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@kgaonkar6 kgaonkar6 changed the title Integrate tp53 alt Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) Jan 4, 2021
@jharenza
Copy link
Collaborator

jharenza commented Jan 4, 2021

Thanks for this @kgaonkar6 - looks like a great start!

I have a few comments:

While matching by sample_id in these annotations should cell-lines be included in this annotation?

yes, please include

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

Yes, it looks like I forgot that category above.

I added a couple plots to visualize the distributions of scores, do we want to discuss those further in this PR?

Yes, these are great!

It is interesting that the samples with activating mutations also have "high" scores. Based on the description of the classifier predicting inactivation, I would have thought that they would have been low scores. I wonder if this really means the high scores are more suggestive of "oncogenic" TP53? In your plot "Explore distribution of tp53_altered status vs tp53 inactivation scores
", will you plot RNA expression of these same samples and groups? I wonder if those with activating mutations will be high expression outliers? cc @cgreene and @gwaygenomics

Re: "Check if other cancer predisposition have high TP53 inactivation scores" (image below)

  • Samples with LFS and TP53 loss have very high scores (looks like ~0.75+) and there are two samples with no evidence for TP53 loss, which also have low scores. We should first check those samples to see if we can find TP53 alterations which were presumably the basis of the LFS diagnosis. They may be outside of the hotspot region, or may not have been caught in consensus calls, or just not exist in our data. I think this in itself is a really cool result, either way, because it could possibly suggest a misdiagnosis of LFS or a means of monitoring LFS patients and their TP53 functional status. Presuming, these patients still have non-oncogenic, functional TP53, which is an advantage. We should dig more into this. cc @cgreene and @adamcresnick

Screen Shot 2021-01-04 at 2 18 43 PM

The other plots look as expected - let's see how they look with the updated condition above (if we see more of the "non-altered" samples with high scores being removed from that group).

When you have a chance, will you also update the README with the full links for the hotspots and Chen source information, since these are used as input files? Thanks!

@jharenza jharenza self-requested a review January 4, 2021 19:30
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments!

@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Jan 4, 2021

Thanks for the review @jharenza! With regards to including cell-lines I also need to add the cell-line composition file like we used in molecular-subtyping-HGG since they have the same sample_id, matching by sample_id causes some duplicates in the outputs.

And to explore the expression for activating/loss samples the direct visualization that came to my mind was a PCA but since it seems we are expecting to look at high outliers specifically were you expecting like a boxplot/scatter with specific labels given a gene list instead?

@jharenza
Copy link
Collaborator

jharenza commented Jan 5, 2021

And to explore the expression for activating/loss samples the direct visualization that came to my mind was a PCA but since it seems we are expecting to look at high outliers specifically were you expecting like a boxplot/scatter with specific labels given a gene list instead?

I just was thinking of this plot with TP53 expression on the y-axis:
Screen Shot 2021-01-05 at 1 37 09 PM

@gwaybio
Copy link
Collaborator

gwaybio commented Jan 6, 2021

It is interesting that the samples with activating mutations also have "high" scores. Based on the description of the classifier predicting inactivation, I would have thought that they would have been low scores. I wonder if this really means the high scores are more suggestive of "oncogenic" TP53? In your plot "Explore distribution of tp53_altered status vs tp53 inactivation scores", will you plot RNA expression of these same samples and groups? I wonder if those with activating mutations will be high expression outliers? cc @cgreene and @gwaygenomics

Cool! The classifier as an oncogenic TP53 detector seems like a reasonable hypothesis to me. In the original model, we did not disambiguate activating vs. inactivating mutations during training. One could go back to the original analysis and make the binary classifier multiclass. This would require bit of time annotating the TP53 mutations though, and it can get tricky since some variants are uncertain.

@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Jan 6, 2021

Here's the distribution of TP53 for activated and loss annotated samples:
image

In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv)
which was used while applying the classifier plus TP53 in stranded samples . This might give some insight in how the expression of these genes looks like for these subset of tp53 activated/loss status? For example these genes seem to have significant differences when tested with wilcox test so it seems some genes might help with the differentiating different classes of TP53 status :
Screen Shot 2021-01-06 at 1 26 27 PM

@jharenza
Copy link
Collaborator

jharenza commented Jan 11, 2021

Here's the distribution of TP53 for activated and loss annotated samples:
image

In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv)
which was used while applying the classifier plus TP53 in stranded samples . This might give some insight in how the expression of these genes looks like for these subset of tp53 activated/loss status? For example these genes seem to have significant differences when tested with wilcox test so it seems some genes might help with the differentiating different classes of TP53 status :
Screen Shot 2021-01-06 at 1 26 27 PM

This looks good! Would you replot as a boxplot with jitter or use the Enhanced Volcano package to add a boxplot in the volcano to show the medians here and add to the notebook?

I may want to follow up with the samples annotated as GOF, and if so, will submit a new issue for that.

Did you have a chance to annotate these samples?

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

@jharenza jharenza closed this Jan 11, 2021
@jharenza jharenza reopened this Jan 11, 2021
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the distribution of TP53 for activated and loss annotated samples:
image

This looks good! Would you replot as a boxplot with jitter or use the Enhanced Volcano package to add a boxplot in the volcano to show the medians here? I may want to follow up with the samples annotated as GOF, and if so, will submit a new issue for that.

In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv)
which was used while applying the classifier plus TP53 in stranded samples . This might give some insight in how the expression of these genes looks like for these subset of tp53 activated/loss status? For example these genes seem to have significant differences when tested with wilcox test so it seems some genes might help with the differentiating different classes of TP53 status :
Screen Shot 2021-01-06 at 1 26 27 PM

I think this could be something we work on later, but not within this paper.

Did you have a chance to annotate these samples?

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

@jharenza jharenza self-requested a review January 11, 2021 22:02
kgaonkar6 and others added 2 commits January 11, 2021 17:17
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Jan 22, 2021

good catch! I haven't run the classifier on v18 was just adding the TP53 annotations I'll re-run 01 to get the updated scores.

For the multiple Kids_First_Biospecimen_ID_RNA , I could comma collapse the ids then do we plot the mean value of tp53_score to be plotted? 7316-87 for example has quite different score for each bs id :

sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts CNV_loss_counts HGVSp_Short copy_number hotspot activating tp53_altered
7316-87 BS_4DY83R02 BS_1HQ76V6D None documented 0.44161220 0 0 NA NA     Other
7316-87 BS_4DY83R02 BS_JGA9BP3A None documented 0.06645831 0 0 NA NA     Other

With 7316-1463 type situation with multiple WGS and multiple RNA per sample_id I think maybe selecting ids from the independent might help?

sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts CNV_loss_counts HGVSp_Short copy_number hotspot activating tp53_altered
7316-1463 BS_KH3859M5 BS_2HN96129 NF-2,Schwannomatosis 0.1688424 0 0 NA NA     Other
7316-1463 BS_KH3859M5 BS_RR12T74P NF-2,Schwannomatosis 0.1513061 0 0 NA NA     Other
7316-1463 BS_NASADC3P BS_2HN96129 NF-2,Schwannomatosis 0.1688424 0 0 NA NA     Other
7316-1463 BS_NASADC3P BS_RR12T74P NF-2,Schwannomatosis 0.1513061 0 0 NA NA     Other

@jharenza
Copy link
Collaborator

jharenza commented Jan 22, 2021

@kgaonkar6

7316-87

Interesting - these are both initial tumor, but one is polyA and one is stranded, the latter having the higher score.

do we plot the mean value of tp53_score to be plotted

No - I would treat each BS as unique (rather than patient), since in many cases, these may be different regions of tissue or different phases of therapy. For 7316-87, I am not sure what to make of that - we could assess the paired stranded/polyA and see if there is a trend for higher scores in stranded. Perhaps that can be a part of QC, and we might favor the stranded (due to larger N of a cohort) and just drop polyA.

@kgaonkar6 kgaonkar6 mentioned this pull request Jan 22, 2021
5 tasks
@kgaonkar6
Copy link
Collaborator Author

I've now updated the tp53 annotation to v18. There are still duplicated RNA-Seqs in the ouptut either because samples have stranded and polya RNA or there are multiple bs_ids per sample_id (maybe because of multiple locations?).

With regards to the classifier score in the QC script we did plot the stranded and polya samples, the scores don't seem to have a drastically different trends per library.
image

image

I can look into plotting sample_id-matched stranded and polya scores if that would be interesting to look at for selection?

@jharenza
Copy link
Collaborator

jharenza commented Jan 29, 2021

I can look into plotting sample_id-matched stranded and polya scores if that would be interesting to look at for selection?

Oh yes, this is what I meant - if we plot all pairs side by side, or just plot the polyA-scores vs stranded scores, is there a clear difference in score by library (ie - are stranded always higher)?

@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Feb 1, 2021

@jharenza here's a sample_id matched plot:

image

On average it seems that polya samples on average have lower scores compared to sample_id matched stranded bs ids. Since we use a cut off tp53_score >0.5 the RNA_library only affects annotation for 7316-161 and 7316-1455 where stranded samples have score higher than 0.5 but polya samples don't.

@kgaonkar6
Copy link
Collaborator Author

Also there was a comment before to update the format of output so I've added the SNV evidence and CNS copy number as well in columns HGVSp_Short and copy_number respectively. Does ⤵️ look right?

sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts CNV_loss_counts HGVSp_Short copy_number hotspot activating tp53_altered

@jharenza
Copy link
Collaborator

jharenza commented Feb 1, 2021

Also there was a comment before to update the format of output so I've added the SNV evidence and CNS copy number as well in columns HGVSp_Short and copy_number respectively. Does ⤵️ look right?

sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts CNV_loss_counts HGVSp_Short copy_number hotspot activating tp53_altered

Yes - just a few questions:

  1. are CNV_loss_counts and copy_number sort of redundant information? Should we just keep one of those columns?
  2. Should we make all binary columns binary for consistency (eg hotspot and activating as well?)

@kgaonkar6
Copy link
Collaborator Author

  1. are CNV_loss_counts and copy_number sort of redundant information? Should we just keep one of those columns?
  2. Should we make all binary columns binary for consistency (eg hotspot and activating as well?)

CNV_loss_counts is the number of copy number losses found for tp53 in the sample and copy_number is the actually value of copy loss for tp53. So it is redundant information but I guess I added copy_number as the evidence column for CNV_loss_counts . For example BS_ZV21J6YW is the only sample that has copy_number ==0 overlapping P53_tetramer, P53, P53_TAD domains of TP53 which would be lost with just binary column? Maybe I can change the name from copy_number to CNV_loss_evidence?

    SNV_indel_counts= length(unique(HGVSp_Short[!is.na(HGVSp_Short)])), 
    CNV_loss_counts = length(unique(copy_number[!is.na(copy_number)])),
    HGVSp_Short = toString(unique(HGVSp_Short)),
    copy_number = toString(unique(copy_number))

@jharenza
Copy link
Collaborator

jharenza commented Feb 2, 2021

  1. are CNV_loss_counts and copy_number sort of redundant information? Should we just keep one of those columns?
  2. Should we make all binary columns binary for consistency (eg hotspot and activating as well?)

CNV_loss_counts is the number of copy number losses found for tp53 in the sample and copy_number is the actually value of copy loss for tp53. So it is redundant information but I guess I added copy_number as the evidence column for CNV_loss_counts . For example BS_ZV21J6YW is the only sample that has copy_number ==0 overlapping P53_tetramer, P53, P53_TAD domains of TP53 which would be lost with just binary column? Maybe I can change the name from copy_number to CNV_loss_evidence?

    SNV_indel_counts= length(unique(HGVSp_Short[!is.na(HGVSp_Short)])), 
    CNV_loss_counts = length(unique(copy_number[!is.na(copy_number)])),
    HGVSp_Short = toString(unique(HGVSp_Short)),
    copy_number = toString(unique(copy_number))

Ok, sure, sounds good.

@kgaonkar6
Copy link
Collaborator Author

We have these multiple RNA matched with multiple DNA I don't think there are unique values in columns to match, I used "sample_id",
"cell_line_composition",
"cancer_predispositions",
"sample_type",
"Kids_First_Participant_ID"

multi_rows_RNA.txt

@jharenza
Copy link
Collaborator

jharenza commented Feb 3, 2021

We have these multiple RNA matched with multiple DNA I don't think there are unique values in columns to match, I used "sample_id",
"cell_line_composition",
"cancer_predispositions",
"sample_type",
"Kids_First_Participant_ID"

multi_rows_RNA.txt

I think having these in the output file should be OK, however, when we go to do any plots for ROC or other, we should make sure we are using unique RNA biospecimens.

@kgaonkar6
Copy link
Collaborator Author

@jharenza thanks! I've also updated the readme with descriptions for each column now.

@jaclyn-taroni jaclyn-taroni self-requested a review February 6, 2021 21:34
@kgaonkar6 kgaonkar6 requested a review from jharenza February 8, 2021 14:21
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ready!

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I did not run this myself but the logic seems to do what is described in the comments.

@jaclyn-taroni jaclyn-taroni merged commit e301f69 into AlexsLemonade:master Feb 10, 2021
@kgaonkar6 kgaonkar6 deleted the integrate_tp53_alt branch May 11, 2021 13:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants