Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938

Closed
wants to merge 43 commits into from

Conversation

kgaonkar6
Copy link
Collaborator

@kgaonkar6 kgaonkar6 commented Feb 9, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

In general the question in #932 is to generate/check for hotspots in our data specially to look for sites we don't see them in MSKCC cancer database downloaded here .

In this PR I'm using strelka2,mutect2,vardict and lancet calls

  • Filtering for deleterious
Filter IMPACT ='HIGH|MODERATE'
Filter Variant_Classification = 'Missense_Mutation|Splice_Region|In_Frame_Del|Frame_Shift_Del|Splice_Site|Splice_Region|Nonsense_Mutation|Nonstop_Mutation|In_Frame_Ins|Frameshift_Ins'
  • Filter for oncogene,transcriptionfactor, kinase, TSGs genes list + brain-goi from @jharenza

  • Find recurrence
    Filter Tumor_Sample_Barcode to keep samples to independent-specimens.wgswxs.primary-plus.tsv to include primary tumors when available, if not select other tumor_descriptor types. Now any genomic site found in 2 or more independent sample and annotated as brain-goi is checked for recurrence.

What was your approach?

I used 01-setup_db.py from snv-callers by @jashapiro and @cansavvy to set up sql database of all the callers. I just updated needed_cols to include IMPACT, HGVSp_Short,VAF and Protein_position in all mafs which were needed for the filtering hotspots detections.

01-reccurence-hotspot-overlap.Rmd filters and combines calls from all callers and summarises the data to save recurrent mutated hotspots and additionally annotates the site as seen in MSKCC cancer hotspot database (file in input folder)

What GitHub issue does your pull request address?

#932

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The recurrent counts were obtained per unique Chromosome,Start_Position,End_Position,Amino_Acid_Position,Hugo_Symbol I kept the chrom start/end mostly because Amino_Acid_Position is derived from the Protein_position column in maf which we might change depending on transcript used for annotation. But also as in previous slack convos we discussed that there might be multiple different nucleotide changes could lead to the same amino acid change. Let me know if this column selection should be updated.

Is there anything that you want to discuss further?

Columns type was added to annotate gene types , and to filter genes of interest (here we are only keeping genes annotated as "brain-goi" ) and column hotspot-database was added only as annotation to show if sites that are recurrently mutated in genes of interest in OpenPBTA are found in MSKCC cancer hotspot database or Cosmic Census gene list

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

tables

What is your summary of the results?

309 sites are recurrent with the current filtering hotspots-detection/results/snv_recurrence.tsv

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@kgaonkar6 kgaonkar6 added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Feb 9, 2021
@kgaonkar6 kgaonkar6 requested a review from jharenza February 9, 2021 21:57
@kgaonkar6 kgaonkar6 added ready for review Used to label pull requests that are ready for review and removed in progress Someone is working on this issue, but feel free to propose an alternative approach! labels Feb 11, 2021
Comment on lines 143 to 145
calls_recurrence <- calls_recurrence %>%
left_join(hotspot_database,by=c("Amino_Acid_Position","Hugo_Symbol"))%>%
write_tsv(file.path(results_dir,"snv_recurrence.tsv"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgaonkar6! Nice work on this. I have a few more suggestions.

After this step, I think it would be good to add some summary information. For ex, when I looked at the mutations not in the MSK list, I see 270 and 61 genes. It would also be good to look at distribution of VAFs.

For genes that have multiple mutations, I think we should plot some lollipop diagrams.

Another QC step we should do is make sure that none of these are SNPs.

One thing I noticed when inspecting AKT1 for ex, was that many amino acid positions are NA and one is -. In looking at pedcbio, it looks like many of these are splice variants. All of them are at low VAF as well, so maybe we should discuss with David on our call tomorrow.

@jharenza
Copy link
Collaborator

jharenza commented Feb 16, 2021

Thanks for this @kgaonkar6 ! Per our call with St. Jude last week, will you also:

  1. Add an upset plot or table for just the splice variants to visualize which callers they came from? ie - are they all from VarDict and maybe FP?
  2. For any of the sites that we are seeing multiple times, are they always low VAF or do we see some instances of high VAF?

From this, I think that the samples I really need to inspect are the 26 in brain-goi only. To the snv_recurrence.tsv file, will you add the rsIDs of the samples so I can inspect?

@kgaonkar6
Copy link
Collaborator Author

Thanks for this @kgaonkar6 ! Per our call with St. Jude last week, will you also:

  1. Add an upset plot or table for just the splice variants to visualize which callers they came from? ie - are they all from VarDict and maybe FP?
  2. For any of the sites that we are seeing multiple times, are they always low VAF or do we see some instances of high VAF?

From this, I think that the samples I really need to inspect are the 26 in brain-goi only. To the snv_recurrence.tsv file, will you add the rsIDs of the samples so I can inspect?

I separated the plots into SNP and INDEL and it does look like majority of the indel counts come from vardict AND OR lancet.
For the next question about the sites and VAF, I've the plots for VAF I added some comments to just summarize what we see in terms of VAFs in our dataset, is there any other dataset I should look at?

Output has rsIDs are added now.

@kgaonkar6 kgaonkar6 mentioned this pull request Feb 18, 2021
5 tasks
@jharenza
Copy link
Collaborator

Thanks for this @kgaonkar6 ! Per our call with St. Jude last week, will you also:

  1. Add an upset plot or table for just the splice variants to visualize which callers they came from? ie - are they all from VarDict and maybe FP?
  2. For any of the sites that we are seeing multiple times, are they always low VAF or do we see some instances of high VAF?

From this, I think that the samples I really need to inspect are the 26 in brain-goi only. To the snv_recurrence.tsv file, will you add the rsIDs of the samples so I can inspect?

I separated the plots into SNP and INDEL and it does look like majority of the indel counts come from vardict AND OR lancet.
For the next question about the sites and VAF, I've the plots for VAF I added some comments to just summarize what we see in terms of VAFs in our dataset, is there any other dataset I should look at?

Output has rsIDs are added now.

Thanks for adding this, but I was asking more specifically for the 26 mutations that were novel (brain-goi only). Do we see them per patient only in 1 caller? If yes, I think that this could be a FP and I would suggest for those novel mutations that perhaps we require that the mutations be present in two callers for us to call it a novel hotspot. Then, of these, I will review dbSNP and manually inspect.

@kgaonkar6
Copy link
Collaborator Author

Just want to add that there is an update to the results:

Recurrence sites (now 192 ) have changed because now I'm using an independent sample ids (primary , if no primary found add recurrence) sample set. I'm also using the following columns to be equivalent the mafs used in plotVaf() and plottiTv() functions.

          Chromosome = Chromosome with the mutation
          Start_Position = Genomic start position of the mutation
          End_Position = Genomic end position of the mutation
          Amino_Acid_Position = Amino acid position extracted from Protein_position maf column
          Hugo_Symbol = gene sybol
          type = gene annotation type
          gnomad_AF_common = gnomad_AF > 0.001 
          hotspot_database = site found in MSKCC cancer hotspot database or gene in Cancer Census gene list
          Variant_Classification = VEP variant classification
          dbSNP_RS = dbSNP ids
          Reference_Allele = Reference allele at site
          Tumor_Seq_Allele2 = Tumor allele detected at the site
          HGVSp_Short = short protein change annotation 
          Variant_Type = variant type description (SNP, Indel)

Note , the counts in plotVaf() per gene will be different compared to what is found in results/snv_recurrence.tsv because of unique VAFs that we find in each Tumor_Sample_Barcode which is not used in recurrence counts.

Comment on lines +91 to +131
## Read database
```{r}

db_file <- file.path(root_dir, params$db_file)

# Start up connection
con <- DBI::dbConnect(RSQLite::SQLite(), db_file,path = ":memory:")

```

## Designate caller tables from SQL file
```{r}
strelka <- dplyr::tbl(con, "strelka")
mutect <- dplyr::tbl(con, "mutect")
vardict <- dplyr::tbl(con, "vardict")
lancet <- dplyr::tbl(con, "lancet")

```

## Subset maf per caller
```{r}
source ("utils/prepMaf.R")
strelka_subset<-prepMaf(strelka, gene_table = gene_table) %>%
mutate(caller="strelka")
mutect_subset<-prepMaf(mutect, gene_table = gene_table) %>%
mutate(caller="mutect")
vardict_subset<-prepMaf(vardict, gene_table = gene_table) %>%
mutate(caller="vardict")
lancet_subset<-prepMaf(lancet, gene_table = gene_table) %>%
mutate(caller="lancet")

# combine
combined_maf_subsets <-bind_rows(strelka_subset,
mutect_subset,
vardict_subset,
lancet_subset)


# save for future runs
saveRDS(combined_maf_subsets, file.path("results","combined_maf_subsets.RDS"))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgaonkar6 now that #946 is in, can we remove this?

Copy link
Collaborator Author

@kgaonkar6 kgaonkar6 Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to clarify, are you suggesting we don't save the RDS file or the whole code chunk? Only the db_file is created in #946 we need to combine the callers and format it with prepMaf() for our purposes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, gotcha (I should have looked at the database PR - was assuming this was a part of that) - then I think this prep work should be moved into its own PR to make this PR more manageable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! I'll divide the PR


## Get filtered numbers

Upset plots to visualize caller contributions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing these rendered in the HTML file - maybe committed too fast? Can you update?

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Comment on lines +276 to +278
getUpset(uniqhits,c("INS","DEL"))
```
Vardict and Lancet again uniquely support a lot of novel indels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is NA in this upset plot? Should that be one of the variant callers?

kgaonkar6 and others added 2 commits February 22, 2021 14:32
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
@kgaonkar6
Copy link
Collaborator Author

Closing this PR for #947

@kgaonkar6 kgaonkar6 closed this Feb 22, 2021
@kgaonkar6 kgaonkar6 deleted the recurrence-snv branch May 13, 2021 17:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ready for review Used to label pull requests that are ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants