#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938

kgaonkar6 · 2021-02-09T15:43:54Z

Purpose/implementation Section

What scientific question is your analysis addressing?

In general the question in #932 is to generate/check for hotspots in our data specially to look for sites we don't see them in MSKCC cancer database downloaded here .

In this PR I'm using strelka2,mutect2,vardict and lancet calls

Filtering for deleterious

Filter IMPACT ='HIGH|MODERATE'
Filter Variant_Classification = 'Missense_Mutation|Splice_Region|In_Frame_Del|Frame_Shift_Del|Splice_Site|Splice_Region|Nonsense_Mutation|Nonstop_Mutation|In_Frame_Ins|Frameshift_Ins'

Filter for oncogene,transcriptionfactor, kinase, TSGs genes list + brain-goi from @jharenza
Find recurrence
Filter Tumor_Sample_Barcode to keep samples to independent-specimens.wgswxs.primary-plus.tsv to include primary tumors when available, if not select other tumor_descriptor types. Now any genomic site found in 2 or more independent sample and annotated as brain-goi is checked for recurrence.

What was your approach?

I used 01-setup_db.py from snv-callers by @jashapiro and @cansavvy to set up sql database of all the callers. I just updated needed_cols to include IMPACT, HGVSp_Short,VAF and Protein_position in all mafs which were needed for the filtering hotspots detections.

01-reccurence-hotspot-overlap.Rmd filters and combines calls from all callers and summarises the data to save recurrent mutated hotspots and additionally annotates the site as seen in MSKCC cancer hotspot database (file in input folder)

What GitHub issue does your pull request address?

#932

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The recurrent counts were obtained per unique Chromosome,Start_Position,End_Position,Amino_Acid_Position,Hugo_Symbol I kept the chrom start/end mostly because Amino_Acid_Position is derived from the Protein_position column in maf which we might change depending on transcript used for annotation. But also as in previous slack convos we discussed that there might be multiple different nucleotide changes could lead to the same amino acid change. Let me know if this column selection should be updated.

Is there anything that you want to discuss further?

Columns type was added to annotate gene types , and to filter genes of interest (here we are only keeping genes annotated as "brain-goi" ) and column hotspot-database was added only as annotation to show if sites that are recurrently mutated in genes of interest in OpenPBTA are found in MSKCC cancer hotspot database or Cosmic Census gene list

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

tables

What is your summary of the results?

309 sites are recurrent with the current filtering hotspots-detection/results/snv_recurrence.tsv

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

…A-analysis into recurrence-snv

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

jharenza · 2021-02-11T20:18:27Z

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

+calls_recurrence <- calls_recurrence %>%
+	left_join(hotspot_database,by=c("Amino_Acid_Position","Hugo_Symbol"))%>%
+	write_tsv(file.path(results_dir,"snv_recurrence.tsv"))


Hi @kgaonkar6! Nice work on this. I have a few more suggestions.

After this step, I think it would be good to add some summary information. For ex, when I looked at the mutations not in the MSK list, I see 270 and 61 genes. It would also be good to look at distribution of VAFs.

For genes that have multiple mutations, I think we should plot some lollipop diagrams.

Another QC step we should do is make sure that none of these are SNPs.

One thing I noticed when inspecting AKT1 for ex, was that many amino acid positions are NA and one is -. In looking at pedcbio, it looks like many of these are splice variants. All of them are at low VAF as well, so maybe we should discuss with David on our call tomorrow.

…A-analysis into recurrence-snv

jharenza · 2021-02-16T22:07:01Z

Thanks for this @kgaonkar6 ! Per our call with St. Jude last week, will you also:

Add an upset plot or table for just the splice variants to visualize which callers they came from? ie - are they all from VarDict and maybe FP?
For any of the sites that we are seeing multiple times, are they always low VAF or do we see some instances of high VAF?

From this, I think that the samples I really need to inspect are the 26 in brain-goi only. To the snv_recurrence.tsv file, will you add the rsIDs of the samples so I can inspect?

kgaonkar6 · 2021-02-17T22:23:31Z

Thanks for this @kgaonkar6 ! Per our call with St. Jude last week, will you also:

Add an upset plot or table for just the splice variants to visualize which callers they came from? ie - are they all from VarDict and maybe FP?

For any of the sites that we are seeing multiple times, are they always low VAF or do we see some instances of high VAF?

From this, I think that the samples I really need to inspect are the 26 in brain-goi only. To the snv_recurrence.tsv file, will you add the rsIDs of the samples so I can inspect?

I separated the plots into SNP and INDEL and it does look like majority of the indel counts come from vardict AND OR lancet.
For the next question about the sites and VAF, I've the plots for VAF I added some comments to just summarize what we see in terms of VAFs in our dataset, is there any other dataset I should look at?

Output has rsIDs are added now.

jharenza · 2021-02-18T21:04:58Z

Thanks for this @kgaonkar6 ! Per our call with St. Jude last week, will you also:

Add an upset plot or table for just the splice variants to visualize which callers they came from? ie - are they all from VarDict and maybe FP?

For any of the sites that we are seeing multiple times, are they always low VAF or do we see some instances of high VAF?

From this, I think that the samples I really need to inspect are the 26 in brain-goi only. To the snv_recurrence.tsv file, will you add the rsIDs of the samples so I can inspect?

I separated the plots into SNP and INDEL and it does look like majority of the indel counts come from vardict AND OR lancet.
For the next question about the sites and VAF, I've the plots for VAF I added some comments to just summarize what we see in terms of VAFs in our dataset, is there any other dataset I should look at?

Output has rsIDs are added now.

Thanks for adding this, but I was asking more specifically for the 26 mutations that were novel (brain-goi only). Do we see them per patient only in 1 caller? If yes, I think that this could be a FP and I would suggest for those novel mutations that perhaps we require that the mutations be present in two callers for us to call it a novel hotspot. Then, of these, I will review dbSNP and manually inspect.

kgaonkar6 · 2021-02-22T16:04:20Z

Just want to add that there is an update to the results:

Recurrence sites (now 192 ) have changed because now I'm using an independent sample ids (primary , if no primary found add recurrence) sample set. I'm also using the following columns to be equivalent the mafs used in plotVaf() and plottiTv() functions.

          Chromosome = Chromosome with the mutation
          Start_Position = Genomic start position of the mutation
          End_Position = Genomic end position of the mutation
          Amino_Acid_Position = Amino acid position extracted from Protein_position maf column
          Hugo_Symbol = gene sybol
          type = gene annotation type
          gnomad_AF_common = gnomad_AF > 0.001 
          hotspot_database = site found in MSKCC cancer hotspot database or gene in Cancer Census gene list
          Variant_Classification = VEP variant classification
          dbSNP_RS = dbSNP ids
          Reference_Allele = Reference allele at site
          Tumor_Seq_Allele2 = Tumor allele detected at the site
          HGVSp_Short = short protein change annotation 
          Variant_Type = variant type description (SNP, Indel)

Note , the counts in plotVaf() per gene will be different compared to what is found in results/snv_recurrence.tsv because of unique VAFs that we find in each Tumor_Sample_Barcode which is not used in recurrence counts.

jharenza · 2021-02-22T18:25:14Z

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

+## Read database 
+```{r}
+
+db_file <- file.path(root_dir, params$db_file)
+
+# Start up connection
+con <- DBI::dbConnect(RSQLite::SQLite(), db_file,path = ":memory:")
+
+```
+
+## Designate caller tables from SQL file
+```{r}
+strelka <- dplyr::tbl(con, "strelka")
+mutect <- dplyr::tbl(con, "mutect")
+vardict <- dplyr::tbl(con, "vardict")
+lancet <- dplyr::tbl(con, "lancet") 	
+
+```
+
+## Subset maf per caller
+```{r}
+source ("utils/prepMaf.R")
+strelka_subset<-prepMaf(strelka, gene_table = gene_table) %>%
+       mutate(caller="strelka")	
+mutect_subset<-prepMaf(mutect, gene_table = gene_table) %>%
+	mutate(caller="mutect")
+vardict_subset<-prepMaf(vardict, gene_table = gene_table) %>%
+	mutate(caller="vardict")
+lancet_subset<-prepMaf(lancet, gene_table = gene_table) %>%
+	mutate(caller="lancet")
+
+# combine
+combined_maf_subsets <-bind_rows(strelka_subset,
+mutect_subset,
+vardict_subset,
+lancet_subset) 
+
+
+# save for future runs
+saveRDS(combined_maf_subsets, file.path("results","combined_maf_subsets.RDS"))
+


@kgaonkar6 now that #946 is in, can we remove this?

Just wanted to clarify, are you suggesting we don't save the RDS file or the whole code chunk? Only the db_file is created in #946 we need to combine the callers and format it with prepMaf() for our purposes.

ah, gotcha (I should have looked at the database PR - was assuming this was a part of that) - then I think this prep work should be moved into its own PR to make this PR more manageable

sounds good! I'll divide the PR

jharenza · 2021-02-22T18:26:43Z

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

+
+## Get filtered numbers
+
+Upset plots to visualize caller contributions


I am not seeing these rendered in the HTML file - maybe committed too fast? Can you update?

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

jharenza · 2021-02-22T18:34:40Z

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

+getUpset(uniqhits,c("INS","DEL"))
+```
+Vardict and Lancet again uniquely support a lot of novel indels.


What is NA in this upset plot? Should that be one of the variant callers?

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

kgaonkar6 · 2021-02-22T21:48:56Z

Closing this PR for #947

kgaonkar6 and others added 20 commits February 3, 2021 14:54

recurrence strelka

3b60681

n>-2

1a52a9a

add Protein_position

8dbe7af

combined snv

3f22239

snv-recurrence

8ab5d4b

re-run filter more than 2

904826e

removeing old folder

2c07160

removing unused functions

11d1cb0

add a readme

1b0c78b

Update README.md

4133e9a

combine types

1a86b59

Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…

723ba7a

…A-analysis into recurrence-snv

combine types

fecbe3a

uniq

cbcba83

Update README.md

9dfa128

Update README.md

deeaba2

Update README.md

b3ebf55

comment edits

8f8bf74

Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…

84133b3

…A-analysis into recurrence-snv

Merge branch 'master' into recurrence-snv

4591e99

kgaonkar6 added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Feb 9, 2021

kgaonkar6 added 2 commits February 9, 2021 16:31

update brain-goi

090d690

Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…

3dcae0b

…A-analysis into recurrence-snv

kgaonkar6 requested a review from jharenza February 9, 2021 21:57

kgaonkar6 added 2 commits February 9, 2021 17:21

Merge branch 'master' into recurrence-snv

405b26a

Merge branch 'master' into recurrence-snv

c55484c

kgaonkar6 added ready for review Used to label pull requests that are ready for review and removed in progress Someone is working on this issue, but feel free to propose an alternative approach! labels Feb 11, 2021

jharenza suggested changes Feb 11, 2021

View reviewed changes

updating cols to use

c38cedb

kgaonkar6 and others added 3 commits February 12, 2021 13:05

Merge branch 'master' into recurrence-snv

ec3fa43

add uniq hits plots

4a0dfed

Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…

a41e20c

…A-analysis into recurrence-snv

added dbSNP_RS

6e4b373

Merge branch 'master' into recurrence-snv

9c3bb4b

kgaonkar6 mentioned this pull request Feb 18, 2021

Update setup db #946

Merged

5 tasks

kgaonkar6 added 8 commits February 18, 2021 18:30

adding upset plots for each type of calls

8781fc5

pull updates

e63b85e

update to snv-caller path

ba803f8

adding comments for maf creation

c50ce1b

remove swp files

b795ca0

adding upset function

fd6e75d

independent samples in recurrence

79c66dd

adding Ref_Allele Tumor_Allele to recurrence

5d99b39

jharenza reviewed Feb 22, 2021

View reviewed changes

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd Outdated Show resolved Hide resolved

Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

9aa4c63

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

jharenza reviewed Feb 22, 2021

View reviewed changes

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd Outdated Show resolved Hide resolved

jharenza reviewed Feb 22, 2021

View reviewed changes

analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd Outdated Show resolved Hide resolved

kgaonkar6 and others added 2 commits February 22, 2021 14:32

Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

cefac9c

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd

a727316

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

kgaonkar6 closed this Feb 22, 2021

kgaonkar6 deleted the recurrence-snv branch May 13, 2021 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938

#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938

kgaonkar6 commented Feb 9, 2021 •

edited

Loading

jharenza Feb 11, 2021

jharenza commented Feb 16, 2021 •

edited

Loading

kgaonkar6 commented Feb 17, 2021

jharenza commented Feb 18, 2021

kgaonkar6 commented Feb 22, 2021

jharenza Feb 22, 2021

kgaonkar6 Feb 22, 2021 •

edited

Loading

jharenza Feb 22, 2021

kgaonkar6 Feb 22, 2021

jharenza Feb 22, 2021

jharenza Feb 22, 2021

kgaonkar6 commented Feb 22, 2021


		## Get filtered numbers

		Upset plots to visualize caller contributions

#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938

#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938

Conversation

kgaonkar6 commented Feb 9, 2021 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jharenza Feb 11, 2021

Choose a reason for hiding this comment

jharenza commented Feb 16, 2021 • edited Loading

kgaonkar6 commented Feb 17, 2021

jharenza commented Feb 18, 2021

kgaonkar6 commented Feb 22, 2021

jharenza Feb 22, 2021

Choose a reason for hiding this comment

kgaonkar6 Feb 22, 2021 • edited Loading

Choose a reason for hiding this comment

jharenza Feb 22, 2021

Choose a reason for hiding this comment

kgaonkar6 Feb 22, 2021

Choose a reason for hiding this comment

jharenza Feb 22, 2021

Choose a reason for hiding this comment

jharenza Feb 22, 2021

Choose a reason for hiding this comment

kgaonkar6 commented Feb 22, 2021

kgaonkar6 commented Feb 9, 2021 •

edited

Loading

jharenza commented Feb 16, 2021 •

edited

Loading

kgaonkar6 Feb 22, 2021 •

edited

Loading