-
Notifications
You must be signed in to change notification settings - Fork 83
#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938
Conversation
calls_recurrence <- calls_recurrence %>% | ||
left_join(hotspot_database,by=c("Amino_Acid_Position","Hugo_Symbol"))%>% | ||
write_tsv(file.path(results_dir,"snv_recurrence.tsv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kgaonkar6! Nice work on this. I have a few more suggestions.
After this step, I think it would be good to add some summary information. For ex, when I looked at the mutations not in the MSK list, I see 270 and 61 genes. It would also be good to look at distribution of VAFs.
For genes that have multiple mutations, I think we should plot some lollipop diagrams.
Another QC step we should do is make sure that none of these are SNPs.
One thing I noticed when inspecting AKT1 for ex, was that many amino acid positions are NA and one is -
. In looking at pedcbio, it looks like many of these are splice variants. All of them are at low VAF as well, so maybe we should discuss with David on our call tomorrow.
Thanks for this @kgaonkar6 ! Per our call with St. Jude last week, will you also:
From this, I think that the samples I really need to inspect are the 26 in brain-goi only. To the |
I separated the plots into SNP and INDEL and it does look like majority of the indel counts come from vardict AND OR lancet. Output has rsIDs are added now. |
Thanks for adding this, but I was asking more specifically for the 26 mutations that were novel (brain-goi only). Do we see them per patient only in 1 caller? If yes, I think that this could be a FP and I would suggest for those novel mutations that perhaps we require that the mutations be present in two callers for us to call it a novel hotspot. Then, of these, I will review dbSNP and manually inspect. |
Just want to add that there is an update to the results: Recurrence sites (now 192 ) have changed because now I'm using an
Note , the counts in plotVaf() per gene will be different compared to what is found in |
## Read database | ||
```{r} | ||
|
||
db_file <- file.path(root_dir, params$db_file) | ||
|
||
# Start up connection | ||
con <- DBI::dbConnect(RSQLite::SQLite(), db_file,path = ":memory:") | ||
|
||
``` | ||
|
||
## Designate caller tables from SQL file | ||
```{r} | ||
strelka <- dplyr::tbl(con, "strelka") | ||
mutect <- dplyr::tbl(con, "mutect") | ||
vardict <- dplyr::tbl(con, "vardict") | ||
lancet <- dplyr::tbl(con, "lancet") | ||
|
||
``` | ||
|
||
## Subset maf per caller | ||
```{r} | ||
source ("utils/prepMaf.R") | ||
strelka_subset<-prepMaf(strelka, gene_table = gene_table) %>% | ||
mutate(caller="strelka") | ||
mutect_subset<-prepMaf(mutect, gene_table = gene_table) %>% | ||
mutate(caller="mutect") | ||
vardict_subset<-prepMaf(vardict, gene_table = gene_table) %>% | ||
mutate(caller="vardict") | ||
lancet_subset<-prepMaf(lancet, gene_table = gene_table) %>% | ||
mutate(caller="lancet") | ||
|
||
# combine | ||
combined_maf_subsets <-bind_rows(strelka_subset, | ||
mutect_subset, | ||
vardict_subset, | ||
lancet_subset) | ||
|
||
|
||
# save for future runs | ||
saveRDS(combined_maf_subsets, file.path("results","combined_maf_subsets.RDS")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgaonkar6 now that #946 is in, can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to clarify, are you suggesting we don't save the RDS file or the whole code chunk? Only the db_file is created in #946 we need to combine the callers and format it with prepMaf() for our purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, gotcha (I should have looked at the database PR - was assuming this was a part of that) - then I think this prep work should be moved into its own PR to make this PR more manageable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! I'll divide the PR
|
||
## Get filtered numbers | ||
|
||
Upset plots to visualize caller contributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing these rendered in the HTML file - maybe committed too fast? Can you update?
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
getUpset(uniqhits,c("INS","DEL")) | ||
``` | ||
Vardict and Lancet again uniquely support a lot of novel indels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is NA
in this upset plot? Should that be one of the variant callers?
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Closing this PR for #947 |
Purpose/implementation Section
What scientific question is your analysis addressing?
In general the question in #932 is to generate/check for hotspots in our data specially to look for sites we don't see them in MSKCC cancer database downloaded here .
In this PR I'm using strelka2,mutect2,vardict and lancet calls
Filter for oncogene,transcriptionfactor, kinase, TSGs genes list + brain-goi from @jharenza
Find recurrence
Filter Tumor_Sample_Barcode to keep samples to independent-specimens.wgswxs.primary-plus.tsv to include primary tumors when available, if not select other tumor_descriptor types. Now any genomic site found in 2 or more independent sample and annotated as brain-goi is checked for recurrence.
What was your approach?
I used 01-setup_db.py from snv-callers by @jashapiro and @cansavvy to set up sql database of all the callers. I just updated
needed_cols
to include IMPACT, HGVSp_Short,VAF and Protein_position in all mafs which were needed for the filtering hotspots detections.01-reccurence-hotspot-overlap.Rmd filters and combines calls from all callers and summarises the data to save recurrent mutated hotspots and additionally annotates the site as seen in MSKCC cancer hotspot database (file in input folder)
What GitHub issue does your pull request address?
#932
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
The recurrent counts were obtained per unique
Chromosome,Start_Position,End_Position,Amino_Acid_Position,Hugo_Symbol
I kept the chrom start/end mostly because Amino_Acid_Position is derived from the Protein_position column in maf which we might change depending on transcript used for annotation. But also as in previous slack convos we discussed that there might be multiple different nucleotide changes could lead to the same amino acid change. Let me know if this column selection should be updated.Is there anything that you want to discuss further?
Columns
type
was added to annotate gene types , and to filter genes of interest (here we are only keeping genes annotated as "brain-goi" ) and columnhotspot-database
was added only as annotation to show if sites that are recurrently mutated in genes of interest in OpenPBTA are found in MSKCC cancer hotspot database or Cosmic Census gene listIs the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
tables
What is your summary of the results?
309 sites are recurrent with the current filtering hotspots-detection/results/snv_recurrence.tsv
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.