This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
#932 Part1 : Recurrence snv/indels in OpenPBTA using all calls from strelka2,mutect2,vardict and lancet #938
Closed
Closed
Changes from 40 commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
3b60681
recurrence strelka
kgaonkar6 1a52a9a
n>-2
kgaonkar6 8dbe7af
add Protein_position
kgaonkar6 3f22239
combined snv
kgaonkar6 8ab5d4b
snv-recurrence
kgaonkar6 904826e
re-run filter more than 2
kgaonkar6 2c07160
removeing old folder
kgaonkar6 11d1cb0
removing unused functions
kgaonkar6 1b0c78b
add a readme
kgaonkar6 4133e9a
Update README.md
kgaonkar6 1a86b59
combine types
kgaonkar6 723ba7a
Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…
kgaonkar6 fecbe3a
combine types
kgaonkar6 cbcba83
uniq
kgaonkar6 9dfa128
Update README.md
kgaonkar6 deeaba2
Update README.md
kgaonkar6 b3ebf55
Update README.md
kgaonkar6 8f8bf74
comment edits
kgaonkar6 84133b3
Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…
kgaonkar6 4591e99
Merge branch 'master' into recurrence-snv
kgaonkar6 090d690
update brain-goi
kgaonkar6 3dcae0b
Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…
kgaonkar6 405b26a
Merge branch 'master' into recurrence-snv
kgaonkar6 c55484c
Merge branch 'master' into recurrence-snv
kgaonkar6 c38cedb
updating cols to use
kgaonkar6 5de2ddf
Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…
kgaonkar6 6cb8067
adding plots
kgaonkar6 ec3fa43
Merge branch 'master' into recurrence-snv
kgaonkar6 4a0dfed
add uniq hits plots
kgaonkar6 a41e20c
Merge branch 'recurrence-snv' of https://github.com/kgaonkar6/OpenPBT…
kgaonkar6 6e4b373
added dbSNP_RS
kgaonkar6 9c3bb4b
Merge branch 'master' into recurrence-snv
kgaonkar6 8781fc5
adding upset plots for each type of calls
kgaonkar6 e63b85e
pull updates
kgaonkar6 ba803f8
update to snv-caller path
kgaonkar6 c50ce1b
adding comments for maf creation
kgaonkar6 b795ca0
remove swp files
kgaonkar6 fd6e75d
adding upset function
kgaonkar6 79c66dd
independent samples in recurrence
kgaonkar6 5d99b39
adding Ref_Allele Tumor_Allele to recurrence
kgaonkar6 9aa4c63
Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd
kgaonkar6 cefac9c
Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd
kgaonkar6 a727316
Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd
kgaonkar6 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
383 changes: 383 additions & 0 deletions
383
analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,383 @@ | ||
--- | ||
title: "Recurrent gene hotspots" | ||
output: html_notebook | ||
params: | ||
db_file: | ||
label: "all callers db" | ||
value: scratch/snv_db.sqlite | ||
input: file | ||
cancer_hotspot_file: | ||
label: "MSKCC cancer hotspot file" | ||
value: input/hotspots_v2.xls | ||
input: file | ||
gene_list : | ||
label: gene table with gene type annotation | ||
value: ../fusion_filtering/references/genelistreference.txt | ||
input: file | ||
--- | ||
|
||
In this notebook we will use look at recurrence and hotspots in OpenPBTA snv | ||
|
||
Recurrence: | ||
|
||
- Filter Variant_Classification ='Missense_Mutation|Splice_Region|In_Frame_Del|Frame_Shift_Del|Splice_Site|Splice_Region|Nonsense_Mutation|Nonstop_Mutation|In_Frame_Ins|Frameshift_Ins', | ||
- Filter IMPACT ='HIGH|MODERATE' | ||
- Filter given gene list from fusion_filtering/reference/genelistreference.txt | ||
- Count reoccurence per Hugo_Symbol+Protein_position+DOMAINS+type | ||
|
||
Hotspots: | ||
-Protein_position overlaps MSKCC cancer hotspot database | ||
|
||
|
||
|
||
## Setup | ||
```{r} | ||
library("tidyverse") | ||
library("maftools") | ||
|
||
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git")) | ||
data_dir <- file.path(root_dir, "data") | ||
mskcc_cancer_hotspot_file <- params$cancer_hotspot_file | ||
internal_hotspot_file <- params$internal_hotspot_file | ||
|
||
results_dir <- file.path(root_dir, | ||
"analyses", | ||
"hotspots-detection", | ||
"results") | ||
|
||
if (!dir.exists(results_dir)) { | ||
dir.create(results_dir) | ||
} | ||
|
||
# Select independent sample set | ||
# Include primary tumors when available, if not select other tumor_descriptor types | ||
independentsamples <- read_tsv(file.path( | ||
root_dir, | ||
"analyses", | ||
"independent-samples", | ||
"results", | ||
"independent-specimens.wgswxs.primary-plus.tsv")) %>% | ||
pull(Kids_First_Biospecimen_ID) | ||
``` | ||
|
||
## Cancer hotspot files | ||
```{r} | ||
|
||
# Cancer database | ||
hotspot_database_2017_snv <- readxl::read_xls(mskcc_cancer_hotspot_file,sheet = 1) | ||
hotspot_database_2017_indel <- readxl::read_xls(mskcc_cancer_hotspot_file,sheet = 2) | ||
|
||
hotspot_database <- bind_rows( hotspot_database_2017_snv, hotspot_database_2017_indel) %>% | ||
dplyr::select("Amino_Acid_Position","Hugo_Symbol") %>% | ||
dplyr::mutate(hotspot_database="MSKCC") %>% | ||
unique() | ||
|
||
``` | ||
|
||
## Gene table for reccurence checks | ||
```{r} | ||
gene_table <- read_tsv("../fusion_filtering/references/genelistreference.txt")%>% | ||
dplyr::rename(Hugo_Symbol=Gene_Symbol) %>% | ||
dplyr::select(Hugo_Symbol,type) | ||
|
||
brain_goi <- read_tsv("input/brain-goi-list-new.txt",col_names = "Hugo_Symbol") %>% | ||
dplyr::mutate(type="brain-goi") | ||
|
||
gene_table <- bind_rows(gene_table, | ||
brain_goi) | ||
|
||
``` | ||
|
||
## Read database | ||
```{r} | ||
|
||
db_file <- file.path(root_dir, params$db_file) | ||
|
||
# Start up connection | ||
con <- DBI::dbConnect(RSQLite::SQLite(), db_file,path = ":memory:") | ||
|
||
``` | ||
|
||
## Designate caller tables from SQL file | ||
```{r} | ||
strelka <- dplyr::tbl(con, "strelka") | ||
mutect <- dplyr::tbl(con, "mutect") | ||
vardict <- dplyr::tbl(con, "vardict") | ||
lancet <- dplyr::tbl(con, "lancet") | ||
|
||
``` | ||
|
||
## Subset maf per caller | ||
```{r} | ||
source ("utils/prepMaf.R") | ||
strelka_subset<-prepMaf(strelka, gene_table = gene_table) %>% | ||
mutate(caller="strelka") | ||
mutect_subset<-prepMaf(mutect, gene_table = gene_table) %>% | ||
mutate(caller="mutect") | ||
vardict_subset<-prepMaf(vardict, gene_table = gene_table) %>% | ||
mutate(caller="vardict") | ||
lancet_subset<-prepMaf(lancet, gene_table = gene_table) %>% | ||
mutate(caller="lancet") | ||
|
||
# combine | ||
combined_maf_subsets <-bind_rows(strelka_subset, | ||
mutect_subset, | ||
vardict_subset, | ||
lancet_subset) | ||
|
||
|
||
# save for future runs | ||
saveRDS(combined_maf_subsets, file.path("results","combined_maf_subsets.RDS")) | ||
|
||
``` | ||
|
||
|
||
## Get filtered numbers | ||
|
||
Upset plots to visualize caller contributions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not seeing these rendered in the HTML file - maybe committed too fast? Can you update? |
||
|
||
```{r} | ||
source(file.path("utils","getUpset.R")) | ||
|
||
``` | ||
|
||
## Upset graphs | ||
|
||
SNP: Single nucleotide polymorphism -- a substitution in one nucleotide | ||
DNP: Double nucleotide polymorphism -- a substitution in two consecutive nucleotides | ||
TNP: Triple nucleotide polymorphism -- a substitution in three consecutive nucleotides | ||
ONP: Oligo-nucleotide polymorphism -- a substitution in more than three consecutive nucleotides | ||
|
||
|
||
```{r} | ||
getUpset(combined_maf_subsets,c("SNP","DNP","TNP","ONP")) | ||
``` | ||
Vardict uniquely support a lot of the filtered sites but interestingly majority of the subtitution mutations are supported by all callers | ||
|
||
INS: Insertion -- the addition of nucleotides | ||
DEL: Deletion -- the removal of nucleotides | ||
```{r} | ||
getUpset(combined_maf_subsets,c("INS","DEL")) | ||
``` | ||
Vardict and Lancet uniquely support a lot of filtred indels | ||
kgaonkar6 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
VAF plot | ||
```{r} | ||
|
||
ggplot(combined_maf_subsets)+geom_violin(aes(x=caller,y=VAF,color=caller))+theme_bw() | ||
``` | ||
|
||
## Summarise calls per site in Tumor_Sample_Barcode for brain GOI only | ||
```{r} | ||
summarise_sites <- combined_maf_subsets %>% | ||
left_join(gene_table,by="Hugo_Symbol") %>% | ||
mutate(gnomad_AF_common = gnomAD_AF>0.001 & !is.na(gnomAD_AF)) %>% | ||
group_by(Chromosome,Start_Position,End_Position, | ||
Amino_Acid_Position,Hugo_Symbol, | ||
Tumor_Sample_Barcode) %>% | ||
summarise(type=toString(unique(type)), | ||
gnomad_AF_common = any(gnomad_AF_common), | ||
VAF = mean(VAF), | ||
dbSNP_RS = toString(unique(dbSNP_RS)), | ||
Variant_Classification=toString(unique(Variant_Classification)), | ||
Reference_Allele=toString(unique(Reference_Allele)), | ||
Tumor_Seq_Allele2=toString(unique(Tumor_Seq_Allele2)), | ||
HGVSp_Short=toString(unique(HGVSp_Short)), | ||
Variant_Type=toString(unique(Variant_Type))) %>% | ||
left_join(hotspot_database,by=c("Amino_Acid_Position","Hugo_Symbol")) %>% | ||
mutate(hotspot_database = case_when( | ||
grepl("Cosmic",type) & !is.na(hotspot_database) ~ paste0(hotspot_database,",Cosmic"), | ||
grepl("Cosmic",type) & is.na(hotspot_database) ~ "Cosmic", | ||
TRUE ~ hotspot_database | ||
)) %>% | ||
dplyr::filter( grepl("brain-goi",type)) %>% | ||
dplyr::filter(Tumor_Sample_Barcode %in% independentsamples) | ||
|
||
|
||
``` | ||
|
||
## Recurrence in genes part of brain-goi from filtered snv calls | ||
```{r} | ||
calls_recurrence <- summarise_sites %>% | ||
count(Chromosome, | ||
Start_Position, | ||
End_Position, | ||
Amino_Acid_Position, | ||
Hugo_Symbol,type, | ||
gnomad_AF_common, | ||
hotspot_database, | ||
Variant_Classification, | ||
dbSNP_RS, | ||
Reference_Allele, | ||
Tumor_Seq_Allele2, | ||
HGVSp_Short, | ||
Variant_Type, | ||
sort = TRUE) %>% | ||
dplyr::filter(n>2) | ||
|
||
``` | ||
|
||
## Plot VAF separating by present or absent in MSKCC database | ||
```{r} | ||
calls_recurrence %>% | ||
left_join(summarise_sites) %>% | ||
ggplot(aes(x=hotspot_database,y=VAF))+ geom_violin()+theme_bw() | ||
``` | ||
|
||
## Sites not part of MSKCC cancer hotspots/Cosmic | ||
```{r} | ||
# maftools needs a maf file to be read via read.maf | ||
# so we are using the exact sites that don't overlap | ||
# MSKCC cancer database nor are Cosmic census gene lists | ||
uniqhits <- calls_recurrence %>% | ||
left_join(combined_maf_subsets) %>% | ||
dplyr::filter(is.na(hotspot_database) ) | ||
|
||
uniqhitsGenes<- uniqhits %>% | ||
pull(Hugo_Symbol) %>% | ||
unique() | ||
|
||
# save as maf for maftools functions | ||
uniqhits %>% | ||
dplyr::select(-caller) %>% | ||
group_by(Chromosome,Start_Position,End_Position, | ||
Amino_Acid_Position,Hugo_Symbol, | ||
Tumor_Sample_Barcode, | ||
Variant_Type, | ||
Reference_Allele, | ||
Tumor_Seq_Allele2, | ||
HGVSp_Short, | ||
Variant_Classification) %>% | ||
# doing a mean so that caller specific VAF | ||
# calculations doesn't unique rows | ||
summarise(VAF = mean(VAF)) %>% | ||
write_tsv("results/uniqhits.maf") | ||
``` | ||
|
||
|
||
### Upset plots for the uniq hits | ||
kgaonkar6 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
SNP: Single nucleotide polymorphism -- a substitution in one nucleotide | ||
DNP: Double nucleotide polymorphism -- a substitution in two consecutive nucleotides | ||
TNP: Triple nucleotide polymorphism -- a substitution in three consecutive nucleotides | ||
ONP: Oligo-nucleotide polymorphism -- a substitution in more than three consecutive nucleotides | ||
|
||
|
||
```{r} | ||
getUpset(uniqhits,c("SNP","DNP","TNP","ONP")) | ||
``` | ||
Interestingly a lot of the uniq hits are supported by strelka only, followed by vardict. | ||
|
||
|
||
INS: Insertion -- the addition of nucleotides | ||
DEL: Deletion -- the removal of nucleotides | ||
```{r} | ||
getUpset(uniqhits,c("INS","DEL")) | ||
``` | ||
Vardict and Lancet again uniquely support a lot of novel indels. | ||
Comment on lines
+276
to
+278
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is |
||
|
||
## Summary plots | ||
```{r} | ||
uniqhitsmaf <- maftools::read.maf("results/uniqhits.maf") | ||
uniq.titv = titv(maf = uniqhitsmaf, plot = FALSE, useSyn = TRUE) | ||
#plot titv summary | ||
plotTiTv(res = uniq.titv) | ||
|
||
# Here the VAF column is unique to each Tumor_Sample_Barcode | ||
# which we didn't considere as a column to count recurrence on. | ||
# So the values here maybe difference compared to what we find | ||
# in calls_recurrence | ||
plotVaf(uniqhitsmaf,vafCol = "VAF") | ||
|
||
|
||
``` | ||
Except KMT5B, all calls above which are in genes not in Cosmic Census and don't overlap sites of cancer hotspot database (MSKCC) have low VAF. | ||
|
||
|
||
## Genes with multiple hits in recurrence | ||
```{r} | ||
multihitsGenes <- calls_recurrence %>% | ||
group_by(Hugo_Symbol) %>% | ||
tally() %>% | ||
filter(n>5) %>% | ||
pull(Hugo_Symbol) | ||
|
||
# maftools needs a maf file to be read via read.maf | ||
multihits <- combined_maf_subsets %>% | ||
# selecting sites in multiHits genes from filtered maf | ||
dplyr::filter(Hugo_Symbol %in% multihitsGenes) %>% | ||
# doing an inner join with recurrent sites | ||
# to visualize just the VAFs of these sites | ||
inner_join(calls_recurrence) | ||
|
||
multihits %>% | ||
dplyr::select(-caller) %>% | ||
group_by(Chromosome,Start_Position,End_Position, | ||
Amino_Acid_Position,Hugo_Symbol, | ||
Tumor_Sample_Barcode, | ||
Variant_Type, | ||
Reference_Allele, | ||
Tumor_Seq_Allele2, | ||
HGVSp_Short, | ||
Variant_Classification) %>% | ||
# doing a mean so that caller specific VAF | ||
# calculations doesn't unique rows | ||
summarise(VAF = mean(VAF)) %>% | ||
write_tsv("results/multihits.maf") | ||
|
||
``` | ||
|
||
### Upset plots for the uniq hits | ||
|
||
SNP: Single nucleotide polymorphism -- a substitution in one nucleotide | ||
DNP: Double nucleotide polymorphism -- a substitution in two consecutive nucleotides | ||
TNP: Triple nucleotide polymorphism -- a substitution in three consecutive nucleotides | ||
ONP: Oligo-nucleotide polymorphism -- a substitution in more than three consecutive nucleotides | ||
|
||
|
||
```{r} | ||
getUpset(multihits,c("SNP","DNP","TNP","ONP")) | ||
``` | ||
Vardict uniquely support a lot of the multihit sites but interestingly majority of the subtitution mutations are supported by all callers | ||
|
||
INS: Insertion -- the addition of nucleotides | ||
DEL: Deletion -- the removal of nucleotides | ||
```{r} | ||
getUpset(multihits,c("INS","DEL")) | ||
``` | ||
Vardict and Lancet uniquely support a lot of the multihit sites | ||
|
||
|
||
### Plot lollipops | ||
```{r} | ||
multihitsmaf <- maftools::read.maf("results/multihits.maf") | ||
pdf("results/multihits.pdf",width = 15,height = 8) | ||
lapply(multihitsGenes,function(x){ | ||
maftools::lollipopPlot(multihitsmaf,gene = x ,AACol = "HGVSp_Short")}) | ||
dev.off() | ||
``` | ||
```{r} | ||
multi.titv = titv(maf = multihitsmaf, plot = FALSE, useSyn = TRUE) | ||
#plot titv summary | ||
plotTiTv(res = multi.titv) | ||
|
||
# Here the VAF column is unique to each Tumor_Sample_Barcode | ||
# which we didn't considere as a column to count recurrence on. | ||
kgaonkar6 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# So the values here maybe difference compared to what we find | ||
# in calls_recurrence | ||
plotVaf(multihitsmaf,vafCol = "VAF") | ||
``` | ||
|
||
|
||
## Save | ||
```{r} | ||
calls_recurrence %>% write_tsv(file.path(results_dir,"snv_recurrence.tsv")) | ||
``` | ||
|
||
|
||
|
||
```{r} | ||
DBI::dbDisconnect(con) | ||
``` | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgaonkar6 now that #946 is in, can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to clarify, are you suggesting we don't save the RDS file or the whole code chunk? Only the db_file is created in #946 we need to combine the callers and format it with prepMaf() for our purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, gotcha (I should have looked at the database PR - was assuming this was a part of that) - then I think this prep work should be moved into its own PR to make this PR more manageable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! I'll divide the PR