Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886

kgaonkar6 · 2020-12-24T19:00:44Z

Purpose/implementation Section

What scientific question is your analysis addressing?

In this PR we will add TP53 alteration status as discussed in #837:

TP53 altered - loss, if :

A sample contains a TP53 hotspot SNV mutation. (Cancer hotspot database and downloadable file available). Please also crosscheck that all mutations from this table are included.
A sample contains two TP53 alterations, suggesting (but not confirming) that both alleles are affected (SNV+SNV, CNV+SNV).
A sample contains one alteration (SNV or CNV) + has cancer_predispositions == "Li-Fraumeni syndrome", suggesting there is a germline variant in addition to the somatic variant we observe.
A sample does not have a TP53 alterations, but has cancer_predispositions == "Li-Fraumeni syndrome" and TP53 classifier score for matched RNA-Seq > 0.5 (or higher cutoff we decide upon later).

TP53 altered - activated, if:

A sample contains one of the two TP53 activating mutations R273C and R248W. Reference and reference.

Note this annotation is also applicable to #807 HGG TP53 annotation

What was your approach?

Steps:

gather consensus SNV for TP53 removing Silent and Intron
gather filtered CNV losses
gather values for cancer_predispositions and tp53 classififer scores for TP53
For annotation purposes I used counts of SNV and CNVs as values since we wanted to check for presence/absence and multi-hit TP53 variants. If AA position overlaps statistically significant SNV/Indels hotspots is assigned "Yes", if AA position overlaps gain-of-function sites according to literature activating column is assigned "Yes".

What GitHub issue does your pull request address?

#837

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

While matching by sample_id in these annotations should cell-lines be included in this annotation?

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

Is there anything that you want to discuss further?

I added a couple plots to visualize the distributions of scores, do we want to discuss those further in this PR?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

yes

Results

What types of results are included (e.g., table, figure)?

table , notebook figures

What is your summary of the results?

In OpenPBTA have 26 sample_ids with putative gain-of-function mutations, 96 with multiple loss of function mutations/CNVs and 980 as unknown/no evidence for tp53 altered status.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

jharenza · 2021-01-04T19:28:00Z

Thanks for this @kgaonkar6 - looks like a great start!

I have a few comments:

While matching by sample_id in these annotations should cell-lines be included in this annotation?

yes, please include

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

Yes, it looks like I forgot that category above.

I added a couple plots to visualize the distributions of scores, do we want to discuss those further in this PR?

Yes, these are great!

It is interesting that the samples with activating mutations also have "high" scores. Based on the description of the classifier predicting inactivation, I would have thought that they would have been low scores. I wonder if this really means the high scores are more suggestive of "oncogenic" TP53? In your plot "Explore distribution of tp53_altered status vs tp53 inactivation scores
", will you plot RNA expression of these same samples and groups? I wonder if those with activating mutations will be high expression outliers? cc @cgreene and @gwaygenomics

Re: "Check if other cancer predisposition have high TP53 inactivation scores" (image below)

Samples with LFS and TP53 loss have very high scores (looks like ~0.75+) and there are two samples with no evidence for TP53 loss, which also have low scores. We should first check those samples to see if we can find TP53 alterations which were presumably the basis of the LFS diagnosis. They may be outside of the hotspot region, or may not have been caught in consensus calls, or just not exist in our data. I think this in itself is a really cool result, either way, because it could possibly suggest a misdiagnosis of LFS or a means of monitoring LFS patients and their TP53 functional status. Presuming, these patients still have non-oncogenic, functional TP53, which is an advantage. We should dig more into this. cc @cgreene and @adamcresnick

The other plots look as expected - let's see how they look with the updated condition above (if we see more of the "non-altered" samples with high scores being removed from that group).

When you have a chance, will you also update the README with the full links for the hotspots and Chen source information, since these are used as input files? Thanks!

jharenza

See comments!

kgaonkar6 · 2021-01-04T20:53:03Z

Thanks for the review @jharenza! With regards to including cell-lines I also need to add the cell-line composition file like we used in molecular-subtyping-HGG since they have the same sample_id, matching by sample_id causes some duplicates in the outputs.

And to explore the expression for activating/loss samples the direct visualization that came to my mind was a PCA but since it seems we are expecting to look at high outliers specifically were you expecting like a boxplot/scatter with specific labels given a gene list instead?

jharenza · 2021-01-05T18:37:59Z

And to explore the expression for activating/loss samples the direct visualization that came to my mind was a PCA but since it seems we are expecting to look at high outliers specifically were you expecting like a boxplot/scatter with specific labels given a gene list instead?

I just was thinking of this plot with TP53 expression on the y-axis:

gwaybio · 2021-01-06T18:24:26Z

It is interesting that the samples with activating mutations also have "high" scores. Based on the description of the classifier predicting inactivation, I would have thought that they would have been low scores. I wonder if this really means the high scores are more suggestive of "oncogenic" TP53? In your plot "Explore distribution of tp53_altered status vs tp53 inactivation scores", will you plot RNA expression of these same samples and groups? I wonder if those with activating mutations will be high expression outliers? cc @cgreene and @gwaygenomics

Cool! The classifier as an oncogenic TP53 detector seems like a reasonable hypothesis to me. In the original model, we did not disambiguate activating vs. inactivating mutations during training. One could go back to the original analysis and make the binary classifier multiclass. This would require bit of time annotating the TP53 mutations though, and it can get tricky since some variants are uncertain.

kgaonkar6 · 2021-01-06T18:28:05Z

Here's the distribution of TP53 for activated and loss annotated samples:

In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv)
which was used while applying the classifier plus TP53 in stranded samples . This might give some insight in how the expression of these genes looks like for these subset of tp53 activated/loss status? For example these genes seem to have significant differences when tested with wilcox test so it seems some genes might help with the differentiating different classes of TP53 status :

jharenza · 2021-01-11T21:08:02Z

Here's the distribution of TP53 for activated and loss annotated samples:

In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv)
which was used while applying the classifier plus TP53 in stranded samples . This might give some insight in how the expression of these genes looks like for these subset of tp53 activated/loss status? For example these genes seem to have significant differences when tested with wilcox test so it seems some genes might help with the differentiating different classes of TP53 status :

This looks good! Would you replot as a boxplot with jitter or use the Enhanced Volcano package to add a boxplot in the volcano to show the medians here and add to the notebook?

I may want to follow up with the samples annotated as GOF, and if so, will submit a new issue for that.

Did you have a chance to annotate these samples?

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

jharenza

Here's the distribution of TP53 for activated and loss annotated samples:

This looks good! Would you replot as a boxplot with jitter or use the Enhanced Volcano package to add a boxplot in the volcano to show the medians here? I may want to follow up with the samples annotated as GOF, and if so, will submit a new issue for that.

In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv)
which was used while applying the classifier plus TP53 in stranded samples . This might give some insight in how the expression of these genes looks like for these subset of tp53 activated/loss status? For example these genes seem to have significant differences when tested with wilcox test so it seems some genes might help with the differentiating different classes of TP53 status :

I think this could be something we work on later, but not within this paper.

Did you have a chance to annotate these samples?

I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?

analyses/tp53_nf1_score/04-tp53-altered-annotation.Rmd

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

…nPBTA-analysis into integrate_tp53_alt

kgaonkar6 · 2021-01-22T20:50:51Z

good catch! I haven't run the classifier on v18 was just adding the TP53 annotations I'll re-run 01 to get the updated scores.

For the multiple Kids_First_Biospecimen_ID_RNA , I could comma collapse the ids then do we plot the mean value of tp53_score to be plotted? 7316-87 for example has quite different score for each bs id :

sample_id	Kids_First_Biospecimen_ID_DNA	Kids_First_Biospecimen_ID_RNA	cancer_predispositions	tp53_score	SNV_indel_counts	CNV_loss_counts	HGVSp_Short	copy_number	hotspot	activating	tp53_altered
7316-87	BS_4DY83R02	BS_1HQ76V6D	None documented	0.44161220	0	0	NA	NA			Other
7316-87	BS_4DY83R02	BS_JGA9BP3A	None documented	0.06645831	0	0	NA	NA			Other

With 7316-1463 type situation with multiple WGS and multiple RNA per sample_id I think maybe selecting ids from the independent might help?

sample_id	Kids_First_Biospecimen_ID_DNA	Kids_First_Biospecimen_ID_RNA	cancer_predispositions	tp53_score	HGVSp_Short	copy_number	tp53_altered
7316-1463	BS_KH3859M5	BS_2HN96129	NF-2,Schwannomatosis	0.1688424	NA	NA	Other
7316-1463	BS_KH3859M5	BS_RR12T74P	NF-2,Schwannomatosis	0.1513061	NA	NA	Other
7316-1463	BS_NASADC3P	BS_2HN96129	NF-2,Schwannomatosis	0.1688424	NA	NA	Other
7316-1463	BS_NASADC3P	BS_RR12T74P	NF-2,Schwannomatosis	0.1513061	NA	NA	Other

jharenza · 2021-01-22T20:53:46Z

@kgaonkar6

7316-87

Interesting - these are both initial tumor, but one is polyA and one is stranded, the latter having the higher score.

do we plot the mean value of tp53_score to be plotted

No - I would treat each BS as unique (rather than patient), since in many cases, these may be different regions of tissue or different phases of therapy. For 7316-87, I am not sure what to make of that - we could assess the paired stranded/polyA and see if there is a trend for higher scores in stranded. Perhaps that can be a part of QC, and we might favor the stranded (due to larger N of a cohort) and just drop polyA.

kgaonkar6 · 2021-01-26T15:57:07Z

I've now updated the tp53 annotation to v18. There are still duplicated RNA-Seqs in the ouptut either because samples have stranded and polya RNA or there are multiple bs_ids per sample_id (maybe because of multiple locations?).

With regards to the classifier score in the QC script we did plot the stranded and polya samples, the scores don't seem to have a drastically different trends per library.

I can look into plotting sample_id-matched stranded and polya scores if that would be interesting to look at for selection?

jharenza · 2021-01-29T00:31:01Z

I can look into plotting sample_id-matched stranded and polya scores if that would be interesting to look at for selection?

Oh yes, this is what I meant - if we plot all pairs side by side, or just plot the polyA-scores vs stranded scores, is there a clear difference in score by library (ie - are stranded always higher)?

kgaonkar6 · 2021-02-01T16:20:11Z

@jharenza here's a sample_id matched plot:

On average it seems that polya samples on average have lower scores compared to sample_id matched stranded bs ids. Since we use a cut off tp53_score >0.5 the RNA_library only affects annotation for 7316-161 and 7316-1455 where stranded samples have score higher than 0.5 but polya samples don't.

kgaonkar6 · 2021-02-01T16:43:07Z

Also there was a comment before to update the format of output so I've added the SNV evidence and CNS copy number as well in columns HGVSp_Short and copy_number respectively. Does ⤵️ look right?

sample_id	Kids_First_Biospecimen_ID_DNA	Kids_First_Biospecimen_ID_RNA	cancer_predispositions	tp53_score	SNV_indel_counts	CNV_loss_counts	HGVSp_Short	copy_number	hotspot	activating	tp53_altered

jharenza · 2021-02-01T21:12:30Z

Also there was a comment before to update the format of output so I've added the SNV evidence and CNS copy number as well in columns HGVSp_Short and copy_number respectively. Does ⤵️ look right?

sample_id Kids_First_Biospecimen_ID_DNA Kids_First_Biospecimen_ID_RNA cancer_predispositions tp53_score SNV_indel_counts CNV_loss_counts HGVSp_Short copy_number hotspot activating tp53_altered

Yes - just a few questions:

are CNV_loss_counts and copy_number sort of redundant information? Should we just keep one of those columns?
Should we make all binary columns binary for consistency (eg hotspot and activating as well?)

kgaonkar6 · 2021-02-02T15:12:41Z

are CNV_loss_counts and copy_number sort of redundant information? Should we just keep one of those columns?

Should we make all binary columns binary for consistency (eg hotspot and activating as well?)

CNV_loss_counts is the number of copy number losses found for tp53 in the sample and copy_number is the actually value of copy loss for tp53. So it is redundant information but I guess I added copy_number as the evidence column for CNV_loss_counts . For example BS_ZV21J6YW is the only sample that has copy_number ==0 overlapping P53_tetramer, P53, P53_TAD domains of TP53 which would be lost with just binary column? Maybe I can change the name from copy_number to CNV_loss_evidence?

    SNV_indel_counts= length(unique(HGVSp_Short[!is.na(HGVSp_Short)])), 
    CNV_loss_counts = length(unique(copy_number[!is.na(copy_number)])),
    HGVSp_Short = toString(unique(HGVSp_Short)),
    copy_number = toString(unique(copy_number))

jharenza · 2021-02-02T20:10:42Z

are CNV_loss_counts and copy_number sort of redundant information? Should we just keep one of those columns?

Should we make all binary columns binary for consistency (eg hotspot and activating as well?)

CNV_loss_counts is the number of copy number losses found for tp53 in the sample and copy_number is the actually value of copy loss for tp53. So it is redundant information but I guess I added copy_number as the evidence column for CNV_loss_counts . For example BS_ZV21J6YW is the only sample that has copy_number ==0 overlapping P53_tetramer, P53, P53_TAD domains of TP53 which would be lost with just binary column? Maybe I can change the name from copy_number to CNV_loss_evidence?
    SNV_indel_counts= length(unique(HGVSp_Short[!is.na(HGVSp_Short)])), 
    CNV_loss_counts = length(unique(copy_number[!is.na(copy_number)])),
    HGVSp_Short = toString(unique(HGVSp_Short)),
    copy_number = toString(unique(copy_number))

Ok, sure, sounds good.

kgaonkar6 · 2021-02-02T21:55:55Z

We have these multiple RNA matched with multiple DNA I don't think there are unique values in columns to match, I used "sample_id",
"cell_line_composition",
"cancer_predispositions",
"sample_type",
"Kids_First_Participant_ID"

multi_rows_RNA.txt

jharenza · 2021-02-03T18:32:44Z

We have these multiple RNA matched with multiple DNA I don't think there are unique values in columns to match, I used "sample_id",
"cell_line_composition",
"cancer_predispositions",
"sample_type",
"Kids_First_Participant_ID"

multi_rows_RNA.txt

I think having these in the output file should be OK, however, when we go to do any plots for ROC or other, we should make sure we are using unique RNA biospecimens.

kgaonkar6 · 2021-02-05T21:56:07Z

@jharenza thanks! I've also updated the readme with descriptions for each column now.

jharenza

looks ready!

jaclyn-taroni

👍 I did not run this myself but the logic seems to do what is described in the comments.

kgaonkar6 added 3 commits December 23, 2020 17:22

tp53 alt integrate

a2d8c1e

adding tp53 altered status

cd4d443

update README

d89858b

kgaonkar6 changed the title ~~Integrate tp53 alt~~ Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) Jan 4, 2021

jharenza self-requested a review January 4, 2021 19:30

jharenza suggested changes Jan 4, 2021

View reviewed changes

match with cell line composition

7bcce0e

kgaonkar6 added the review after release label Jan 5, 2021

adding plots for activated/loss samples

9480803

jharenza closed this Jan 11, 2021

jharenza reopened this Jan 11, 2021

jharenza reviewed Jan 11, 2021

View reviewed changes

jharenza self-requested a review January 11, 2021 22:02

jharenza reviewed Jan 11, 2021

View reviewed changes

analyses/tp53_nf1_score/04-tp53-altered-annotation.Rmd Outdated Show resolved Hide resolved

analyses/tp53_nf1_score/04-tp53-altered-annotation.Rmd Outdated Show resolved Hide resolved

kgaonkar6 and others added 2 commits January 11, 2021 17:17

Update analyses/tp53_nf1_score/04-tp53-altered-annotation.Rmd

d66ad7e

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

Update analyses/tp53_nf1_score/04-tp53-altered-annotation.Rmd

a7abbb3

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

jaclyn-taroni removed the review after release label Jan 12, 2021

kgaonkar6 and others added 6 commits January 19, 2021 17:32

add CNV/SNV+ >0.5 score as loss

b833720

Merge branch 'integrate_tp53_alt' of https://github.com/kgaonkar6/Ope…

4659f10

…nPBTA-analysis into integrate_tp53_alt

re-run with rremote changes

9b8092e

Merge branch 'master' into integrate_tp53_alt

63a01da

add tp53 code edits to remove coef

1a73d3b

Merge branch 'integrate_tp53_alt' of https://github.com/kgaonkar6/Ope…

9c8f8dd

…nPBTA-analysis into integrate_tp53_alt

jaclyn-taroni requested a review from jharenza January 20, 2021 18:53

kgaonkar6 mentioned this pull request Jan 22, 2021

re-run tp53 classifier with v18 #922

Merged

5 tasks

kgaonkar6 and others added 2 commits January 26, 2021 09:53

Merge branch 'master' into integrate_tp53_alt

f40b134

updated with v18

29748e9

Merge branch 'master' into integrate_tp53_alt

8239a11

This was referenced Feb 1, 2021

Updated analysis: investigate putative functionality of TP53 SNV and CNV calls #837

Closed

#837 updated: RNA QC compare stranded vs polya matched sample_ids #930

Merged

kgaonkar6 and others added 3 commits February 2, 2021 16:12

Merge branch 'master' into integrate_tp53_alt

689787b

all binary

94f61d3

save changes

d72a4f3

kgaonkar6 added 2 commits February 5, 2021 16:54

Update README.md

04b9bd6

Update README.md

0fbbb75

jaclyn-taroni self-requested a review February 6, 2021 21:34

kgaonkar6 requested a review from jharenza February 8, 2021 14:21

jharenza approved these changes Feb 9, 2021

View reviewed changes

Merge branch 'master' into integrate_tp53_alt

7fbb3b0

jaclyn-taroni approved these changes Feb 10, 2021

View reviewed changes

jaclyn-taroni merged commit e301f69 into AlexsLemonade:master Feb 10, 2021

kgaonkar6 deleted the integrate_tp53_alt branch May 11, 2021 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886

Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886

kgaonkar6 commented Dec 24, 2020 •

edited

Loading

jharenza commented Jan 4, 2021 •

edited

Loading

jharenza left a comment

kgaonkar6 commented Jan 4, 2021 •

edited

Loading

jharenza commented Jan 5, 2021

gwaybio commented Jan 6, 2021

kgaonkar6 commented Jan 6, 2021 •

edited

Loading

jharenza commented Jan 11, 2021 •

edited

Loading

jharenza left a comment •

edited

Loading

kgaonkar6 commented Jan 22, 2021 •

edited

Loading

jharenza commented Jan 22, 2021 •

edited

Loading

kgaonkar6 commented Jan 26, 2021

jharenza commented Jan 29, 2021 •

edited

Loading

kgaonkar6 commented Feb 1, 2021 •

edited

Loading

kgaonkar6 commented Feb 1, 2021

jharenza commented Feb 1, 2021

kgaonkar6 commented Feb 2, 2021

jharenza commented Feb 2, 2021

kgaonkar6 commented Feb 2, 2021

jharenza commented Feb 3, 2021

kgaonkar6 commented Feb 5, 2021

jharenza left a comment

jaclyn-taroni left a comment

Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886

Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886

Conversation

kgaonkar6 commented Dec 24, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jharenza commented Jan 4, 2021 • edited Loading

jharenza left a comment

Choose a reason for hiding this comment

kgaonkar6 commented Jan 4, 2021 • edited Loading

jharenza commented Jan 5, 2021

gwaybio commented Jan 6, 2021

kgaonkar6 commented Jan 6, 2021 • edited Loading

jharenza commented Jan 11, 2021 • edited Loading

jharenza left a comment • edited Loading

Choose a reason for hiding this comment

kgaonkar6 commented Jan 22, 2021 • edited Loading

jharenza commented Jan 22, 2021 • edited Loading

kgaonkar6 commented Jan 26, 2021

jharenza commented Jan 29, 2021 • edited Loading

kgaonkar6 commented Feb 1, 2021 • edited Loading

kgaonkar6 commented Feb 1, 2021

jharenza commented Feb 1, 2021

kgaonkar6 commented Feb 2, 2021

jharenza commented Feb 2, 2021

kgaonkar6 commented Feb 2, 2021

jharenza commented Feb 3, 2021

kgaonkar6 commented Feb 5, 2021

jharenza left a comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

kgaonkar6 commented Dec 24, 2020 •

edited

Loading

jharenza commented Jan 4, 2021 •

edited

Loading

kgaonkar6 commented Jan 4, 2021 •

edited

Loading

kgaonkar6 commented Jan 6, 2021 •

edited

Loading

jharenza commented Jan 11, 2021 •

edited

Loading

jharenza left a comment •

edited

Loading

kgaonkar6 commented Jan 22, 2021 •

edited

Loading

jharenza commented Jan 22, 2021 •

edited

Loading

jharenza commented Jan 29, 2021 •

edited

Loading

kgaonkar6 commented Feb 1, 2021 •

edited

Loading