-
Notifications
You must be signed in to change notification settings - Fork 83
Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886
Integrate tp53 alteration annotation as multi hits (molecular and/or classifier and/or germline predisposition) #886
Conversation
Thanks for this @kgaonkar6 - looks like a great start! I have a few comments:
yes, please include
Yes, it looks like I forgot that category above.
Yes, these are great! It is interesting that the samples with activating mutations also have "high" scores. Based on the description of the classifier predicting inactivation, I would have thought that they would have been low scores. I wonder if this really means the high scores are more suggestive of "oncogenic" TP53? In your plot "Explore distribution of tp53_altered status vs tp53 inactivation scores Re: "Check if other cancer predisposition have high TP53 inactivation scores" (image below)
The other plots look as expected - let's see how they look with the updated condition above (if we see more of the "non-altered" samples with high scores being removed from that group). When you have a chance, will you also update the README with the full links for the hotspots and Chen source information, since these are used as input files? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments!
Thanks for the review @jharenza! With regards to including cell-lines I also need to add the cell-line composition file like we used in molecular-subtyping-HGG since they have the same sample_id, matching by sample_id causes some duplicates in the outputs. And to explore the expression for activating/loss samples the direct visualization that came to my mind was a PCA but since it seems we are expecting to look at high outliers specifically were you expecting like a boxplot/scatter with specific labels given a gene list instead? |
Cool! The classifier as an oncogenic TP53 detector seems like a reasonable hypothesis to me. In the original model, we did not disambiguate activating vs. inactivating mutations during training. One could go back to the original analysis and make the binary classifier multiclass. This would require bit of time annotating the TP53 mutations though, and it can get tricky since some variants are uncertain. |
Here's the distribution of TP53 for activated and loss annotated samples: In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv) |
This looks good! Would you replot as a boxplot with jitter or use the Enhanced Volcano package to add a boxplot in the volcano to show the medians here and add to the notebook? I may want to follow up with the samples annotated as GOF, and if so, will submit a new issue for that. Did you have a chance to annotate these samples?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the distribution of TP53 for activated and loss annotated samples:
This looks good! Would you replot as a boxplot with jitter or use the Enhanced Volcano package to add a boxplot in the volcano to show the medians here? I may want to follow up with the samples annotated as GOF, and if so, will submit a new issue for that.
In addition, I did some comparison of genes which have non zero coefficient [tp53_classifier_coefficients.tsv] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/integrate_tp53_alt/analyses/tp53_nf1_score/reference/tp53_classifier_coefficients.tsv)
which was used while applying the classifier plus TP53 in stranded samples . This might give some insight in how the expression of these genes looks like for these subset of tp53 activated/loss status? For example these genes seem to have significant differences when tested with wilcox test so it seems some genes might help with the differentiating different classes of TP53 status :
I think this could be something we work on later, but not within this paper.
Did you have a chance to annotate these samples?
I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
…nPBTA-analysis into integrate_tp53_alt
…nPBTA-analysis into integrate_tp53_alt
good catch! I haven't run the classifier on v18 was just adding the TP53 annotations I'll re-run 01 to get the updated scores. For the multiple Kids_First_Biospecimen_ID_RNA , I could comma collapse the ids then do we plot the mean value of tp53_score to be plotted? 7316-87 for example has quite different score for each bs id :
With 7316-1463 type situation with multiple WGS and multiple RNA per sample_id I think maybe selecting ids from the independent might help?
|
Interesting - these are both initial tumor, but one is polyA and one is stranded, the latter having the higher score.
No - I would treat each BS as unique (rather than patient), since in many cases, these may be different regions of tissue or different phases of therapy. For 7316-87, I am not sure what to make of that - we could assess the paired stranded/polyA and see if there is a trend for higher scores in stranded. Perhaps that can be a part of QC, and we might favor the stranded (due to larger N of a cohort) and just drop polyA. |
Oh yes, this is what I meant - if we plot all pairs side by side, or just plot the polyA-scores vs stranded scores, is there a clear difference in score by library (ie - are stranded always higher)? |
@jharenza here's a sample_id matched plot: On average it seems that polya samples on average have lower scores compared to sample_id matched stranded bs ids. Since we use a cut off tp53_score >0.5 the RNA_library only affects annotation for 7316-161 and 7316-1455 where stranded samples have score higher than 0.5 but polya samples don't. |
Also there was a comment before to update the format of output so I've added the SNV evidence and CNS copy number as well in columns HGVSp_Short and copy_number respectively. Does
|
Yes - just a few questions:
|
CNV_loss_counts is the number of copy number losses found for tp53 in the sample and copy_number is the actually value of copy loss for tp53. So it is redundant information but I guess I added SNV_indel_counts= length(unique(HGVSp_Short[!is.na(HGVSp_Short)])),
CNV_loss_counts = length(unique(copy_number[!is.na(copy_number)])),
HGVSp_Short = toString(unique(HGVSp_Short)),
copy_number = toString(unique(copy_number)) |
Ok, sure, sounds good. |
We have these multiple RNA matched with multiple DNA I don't think there are unique values in columns to match, I used "sample_id", |
I think having these in the output file should be OK, however, when we go to do any plots for ROC or other, we should make sure we are using unique RNA biospecimens. |
@jharenza thanks! I've also updated the readme with descriptions for each column now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ready!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I did not run this myself but the logic seems to do what is described in the comments.
Purpose/implementation Section
What scientific question is your analysis addressing?
In this PR we will add TP53 alteration status as discussed in #837:
TP53 altered - loss, if :
TP53 altered - activated, if:
Note this annotation is also applicable to #807 HGG TP53 annotation
What was your approach?
Steps:
What GitHub issue does your pull request address?
#837
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
While matching by sample_id in these annotations should cell-lines be included in this annotation?
I think 1 condition to be discussed for annotation would be CNV/SNV loss + >0.5 TP53 score , should this also included as tp53_altered=="loss"?
Is there anything that you want to discuss further?
I added a couple plots to visualize the distributions of scores, do we want to discuss those further in this PR?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
yes
Results
What types of results are included (e.g., table, figure)?
table , notebook figures
What is your summary of the results?
In OpenPBTA have 26 sample_ids with putative gain-of-function mutations, 96 with multiple loss of function mutations/CNVs and 980 as unknown/no evidence for tp53 altered status.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.