Proposed Analysis: add scavenging of cancer hotspots to consensus SNV calls #819

jharenza · 2020-10-20T14:01:48Z

What analysis are you proposing and why?

Create a new MAF which contains consensus SNV calls from consensus snv calling and cancer hotspot calls missed by consensus, as noted below.

We previously noticed that by taking a 3/3 approach for consensus calls, we are inevitably missing some cancer hotspot mutations. We got around that for one specific cancer (DMGs) because we have clinical reports containing histone variant calls that we can add into molecular subtyping pathology module (#735 and #751). However, we are likely still missing some cancer hotspot mutations and I propose that we add a final step in which we scavenge back cancer hotspot mutations using a well-curated and downloadable list of these.

What changes need to be made? Please provide enough detail for another participant to make the update.

The next step would be to assess if any of these hotspot mutations are being missed using a 3/3 method and then determining a set of rules for adding these mutations back to the consensus SNV file. For example:

If the hotspot is present in 2/3, then retain
If the hotspot is present in 1/3, and has X reads supporting the tumor allele (5?), then retain

Perhaps the new file can be called pbta-consensus-snvs-plus-hotspot.maf.gz

What input data should be used? Which data were used in the version being updated?

Cancer hotspots table, downloadable here: https://www.cancerhotspots.org/#/download plus TERT promoter mutations, noted from this paper.

In 2013, two hotspot point mutations were found in the TERT promoter in 71% of melanomas (32,33). The mutations were located 124bp and 146bp upstream of the translation start site and referred to as C228T and C250T, respectively, based on their hg19 genomic coordinates.

pbta-snv-consensus-mutation.maf.tsv.gz
pbta-snv-lancet.vep.maf.gz
pbta-snv-mutect2.vep.maf.gz
pbta-snv-strelka2.vep.maf.gz
pbta-snv-vardict.vep.maf.gz (maybe? TCGA was not run with VarDict, so perhaps we should add a separate step to assess what hotspots VarDict only detects)

pbta-tcga-snv-lancet.vep.maf.gz
pbta-tcga-snv-mutect2.vep.maf.gz
pbta-tcga-snv-strelka2.vep.maf.gz
tcga-snv-consensus-snv.maf.tsv.gz

When do you expect the analysis will be completed?

not sure

Who will complete the updated analysis?

~~@migbro~~ @kgaonkar6

The text was updated successfully, but these errors were encountered:

jashapiro · 2020-10-20T14:10:44Z

I think this is a good idea, but I would keep it as a separate analysis from the general consensus. In other words, I would not "scavenge" back mutations into the consensus, but rather include an entirely separate analysis that evaluates known mutations. This would keep the standards clear and separate de novo analysis from analysis with outside influence.

jharenza · 2020-10-20T14:28:14Z

I think this is a good idea, but I would keep it as a separate analysis from the general consensus. In other words, I would not "scavenge" back mutations into the consensus, but rather include an entirely separate analysis that evaluates known mutations. This would keep the standards clear and separate de novo analysis from analysis with outside influence.

Ok - yeah I went back and forth on that. Thanks!

jharenza · 2021-01-18T19:09:05Z

@kgaonkar6, after our internal discussion, I think we need to first determine our hotspot list:

Look into using both versions of the hotspot table linked above. There are 470 hotspots in V1, 1110 in V2, and 221 of them from V1 are not in V2. I initially was thinking we would want a union, but I don't really recognize the list of V1 only, so maybe they were removed because they were FP. So, we may go with V2. I was thinking we could first assess how many of those V1 only hotspots are being missed in our dataset and if it makes sense to keep them or not.
Add the TERT promoter mutations above.
Download the latest version of COSMIC mutations and determine whether we are missing any of these from V2 - these could also possibly be added.

If that makes sense, I think that can be the first PR for this series. Thanks@

jharenza · 2021-01-24T18:50:22Z

We are having a call on Thursday Jan 28 with David Wheeler (St Jude, formerly BCM) who has done this sort of thing while leading the BCM Genomics Lab. We might also want to add pediatric0-specific genes such as those from Ma, 2018 and Grobner, 2018

kgaonkar6 · 2021-03-01T20:19:07Z

Don't think there are annotations in maf format to filter using the information in the paper describing the TERT promoter variant, should I use other filtering the exact genomic site to capture?

I believe chr5 | 1295113 | 1295113 which is also annotated as existing_variant rs1242535815,COSM1716563,COSM1716558 which is 66bp away from TSS is what we are looking for corresponding to C228T.

and chr5 | 1295135 | 1295135 | is 88 bp away from TSS is the COSM1716559 variant which corresponds to C250T promoter variant.

From my google searches :D
https://www.slideshare.net/ThermoFisher/taqman-dpcr-liquid-biopsy-assays-targeting-the-tert-promoter-region
https://assets.thermofisher.com/TFS-Assets/LSG/posters/taqman-dpcr-tert-promoter-poster.pdf

I checked strelka for upstream variants as a check and we have both these sites (along with others) :

	Chromosome	Start_Position	End_Position	Reference_Allele	Tumor_Seq_Allele2	Hugo_Symbol	Variant_Classification	IMPACT	Tumor_Sample_Barcode	Protein_position	Existing_variation	DISTANCE
1	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_KAZENYZE	NA	rs1242535815,COSM1716563,COSM1716558	66
2	chr5	1299855	1299855	A	C	TERT	5'Flank	MODIFIER	BS_1RF75MK2	NA	NA	4808
3	chr5	1299236	1299237	-	C	TERT	5'Flank	MODIFIER	BS_S0T3CQ97	NA	NA	4189
4	chr5	1295088	1295088	A	C	TERT	5'Flank	MODIFIER	BS_8Q8CAY84	NA	NA	41
5	chr5	1299748	1299748	T	C	TERT	5'Flank	MODIFIER	BS_4QFSH7C4	NA	NA	4701
6	chr5	1295677	1295677	T	G	TERT	5'Flank	MODIFIER	BS_4ZKN0WGS	NA	NA	630
7	chr5	1295442	1295442	C	T	TERT	5'Flank	MODIFIER	BS_02YBZSBY	NA	NA	395
8	chr5	1295135	1295135	G	A	TERT	5'Flank	MODIFIER	BS_F8K4VQMF	NA	COSM1716559	88
9	chr5	1295997	1295997	A	C	TERT	5'Flank	MODIFIER	BS_WH8KWW5J	NA	NA	950
10	chr5	1298925	1298925	A	C	TERT	5'Flank	MODIFIER	BS_VW4XN9Y7	NA	NA	3878
11	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_1S2BHJ8K	NA	rs1242535815,COSM1716563,COSM1716558	66
12	chr5	1295146	1295146	A	C	TERT	5'Flank	MODIFIER	BS_BM95DGCQ	NA	NA	99
13	chr5	1295407	1295407	A	C	TERT	5'Flank	MODIFIER	BS_9ZFXXJPK	NA	NA	360
14	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_BFDEZK1C	NA	rs1242535815,COSM1716563,COSM1716558	66
15	chr5	1297846	1297846	G	T	TERT	5'Flank	MODIFIER	BS_VF099E8S	NA	NA	2799
16	chr5	1296053	1296053	C	A	TERT	5'Flank	MODIFIER	BS_0FQKT8EY	NA	NA	1006
17	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_JSNJZERZ	NA	rs1242535815,COSM1716563,COSM1716558	66
18	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_T7WMJ08W	NA	rs1242535815,COSM1716563,COSM1716558	66
19	chr5	1295136	1295136	A	C	TERT	5'Flank	MODIFIER	BS_QX754ADQ	NA	NA	89
20	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_MJJZJMTK	NA	rs1242535815,COSM1716563,COSM1716558	66
21	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_SK4H5MJQ	NA	rs1242535815,COSM1716563,COSM1716558	66
22	chr5	1295112	1295112	A	C	TERT	5'Flank	MODIFIER	BS_K3PPH522	NA	NA	65
23	chr5	1295113	1295113	G	A	TERT	5'Flank	MODIFIER	BS_KAD49R68	NA	rs1242535815,COSM1716563,COSM1716558	66
24	chr5	1298190	1298190	G	A	TERT	5'Flank	MODIFIER	BS_AF5D41PD	NA	rs929384767	3143

kgaonkar6 · 2021-03-02T17:20:00Z

We still want to filter by IMPACT == 'HIGH|MODERATE|MODIFIER' to remove any LOW impact mutations ( like silent mutations) in the given amino acid position in hotspot database, right?

jharenza · 2021-03-02T18:39:19Z

We still want to filter by IMPACT == 'HIGH|MODERATE|MODIFIER' to remove any LOW impact mutations ( like silent mutations) in the given amino acid position in hotspot database, right?

Are you saying there are low impact mutations on the MSK list? I would assume they would not be low.

jharenza · 2021-03-02T18:43:58Z

I believe chr5 | 1295113 | 1295113 which is also annotated as existing_variant rs1242535815,COSM1716563,COSM1716558 which is 66bp away from TSS is what we are looking for corresponding to C228T.

and chr5 | 1295135 | 1295135 | is 88 bp away from TSS is the COSM1716559 variant which corresponds to C250T promoter variant.

This looks right to me, and nucleotides are reversed because TERT is on the reverse strand. So, I think we should use the genomic coordinates here + nucleotides.

kgaonkar6 · 2021-03-02T19:34:23Z

We still want to filter by IMPACT == 'HIGH|MODERATE|MODIFIER' to remove any LOW impact mutations ( like silent mutations) in the given amino acid position in hotspot database, right?

Are you saying there are low impact mutations on the MSK list? I would assume they would not be low.

There were a few instances that the hotspot amino acid site had silent mutation for example we have 644 in SDHA is a hotspot in MSKCC but if we have p.V644= in our dataset we should remove it right? Only if it is a high if the hotspot is actually high impact mutations like p.V644M we will keep them.

* recurrence strelka * n>-2 * add Protein_position * combined snv * snv-recurrence * re-run filter more than 2 * removeing old folder * removing unused functions * add a readme * Update README.md * combine types * combine types * uniq * Update README.md * Update README.md * Update README.md * comment edits * update brain-goi * updating cols to use * adding plots * add uniq hits plots * added dbSNP_RS * adding upset plots for each type of calls * update to snv-caller path * adding comments for maf creation * remove swp files * adding upset function * independent samples in recurrence * adding Ref_Allele Tumor_Allele to recurrence * Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> * Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> * Update analyses/hotspots-detection/01-reccurence-hotspot-overlap.Rmd Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> * combine and filter * re-dp script 1 * filter for known hotspots only * Delete brain-goi-list-new.txt * add html * add images * add comments * adding all maf columns in filtered file * run script * add per caller filters * add per caller filters * Delete 01-combine-maf.Rmd * Delete 01-combine-maf.nb.html * Delete combined_maf_hotspots.RDS * adding vardict subset * files un-committed * Update analyses/hotspots-detection/README.md Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/README.md Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/utils/prepMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/utils/prepMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/utils/prepMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/utils/prepMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/run_overlaps_hotspot.sh Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/run_overlaps_hotspot.sh Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/run_overlaps_hotspot.sh Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/00-subset-maf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * updates * add just strelka * update grep genes before R filtering * styling * removing genes as input param * subsetByOverlap * read_tsv seems to assign the columns accurately compared to fread * add to ci * add task name * Update README.md * splice and indel * adding indels with ? and filter for canonical transcripts * fixing fail because updated file was not committed * Update analyses/hotspots-detection/utils/filterMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/utils/filterMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/run_overlaps_hotspot.sh Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/utils/filterMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/00-subset-maf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * description update * asuggested changes from review;comments update * add error if MSKCC hotspot is not complete * uniq gene list * Update analyses/hotspots-detection/utils/filterMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/hotspots-detection/utils/filterMaf.R Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org> Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com> Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>

* update col types * script to add col types Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>

jharenza · 2021-05-13T12:46:25Z

Closed with #819

jharenza added the updated analysis label Oct 20, 2020

jashapiro added proposed analysis snv Related to or requires SNV data and removed updated analysis labels Oct 20, 2020

jharenza mentioned this issue Nov 25, 2020

update molecular subtyping pathology #854

Merged

5 tasks

jharenza assigned jharenza and kgaonkar6 Jan 15, 2021

jharenza mentioned this issue Jan 22, 2021

Question: Do we expect many participant IDs to be associated with more than one harmonized_diagnosis? #913

Open

jharenza mentioned this issue Feb 3, 2021

Proposed Analysis: Generate hotspot and hot region lists #932

Closed

kgaonkar6 mentioned this issue Mar 1, 2021

Part 1 #819 Combine snv per caller and filter to scavenge hotspots #947

Closed

5 tasks

This was referenced Mar 11, 2021

Snv database re-run to include all maf columns #954

Closed

#819 Part 1 V2 : Per caller filter hotspots #956

Merged

jharenza mentioned this issue Mar 16, 2021

Updated analysis: Rerun all molecular subtyping modules with consensus+hotspot SNV MAF #959

Closed

kgaonkar6 mentioned this issue Mar 18, 2021

#819 Part2 : Scavenge back hotspots to add to consensus calls #961

Merged

5 tasks

jharenza mentioned this issue Mar 26, 2021

Planned release: V19 #867

Closed

21 tasks

jharenza mentioned this issue Apr 13, 2021

Updated analysis: interaction plots - assess oncogene/TSG co-occurrence/mutual exclusivity #1001

Closed

jharenza added the blocking release label Apr 13, 2021

jashapiro added a commit that referenced this issue May 10, 2021

#819 Part1 update: Script to add maf col types (#1050)

9d036f9

* update col types * script to add col types Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>

jharenza mentioned this issue May 10, 2021

Update oncoprints to use histology-specific goi lists #1046

Merged

5 tasks

jharenza pushed a commit that referenced this issue May 13, 2021

#819 Part2 : Scavenge back hotspots to add to consensus calls (#961)

e760c46

jharenza closed this as completed May 13, 2021

kgaonkar6 mentioned this issue May 24, 2021

#959 HGG update hotspots #1077

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: add scavenging of cancer hotspots to consensus SNV calls #819

Proposed Analysis: add scavenging of cancer hotspots to consensus SNV calls #819

jharenza commented Oct 20, 2020 •

edited by kgaonkar6

Loading

jashapiro commented Oct 20, 2020

jharenza commented Oct 20, 2020 •

edited

Loading

jharenza commented Jan 18, 2021

jharenza commented Jan 24, 2021

kgaonkar6 commented Mar 1, 2021 •

edited

Loading

kgaonkar6 commented Mar 2, 2021

jharenza commented Mar 2, 2021

jharenza commented Mar 2, 2021

kgaonkar6 commented Mar 2, 2021

jharenza commented May 13, 2021

Proposed Analysis: add scavenging of cancer hotspots to consensus SNV calls #819

Proposed Analysis: add scavenging of cancer hotspots to consensus SNV calls #819

Comments

jharenza commented Oct 20, 2020 • edited by kgaonkar6 Loading

What analysis are you proposing and why?

What changes need to be made? Please provide enough detail for another participant to make the update.

What input data should be used? Which data were used in the version being updated?

When do you expect the analysis will be completed?

Who will complete the updated analysis?

jashapiro commented Oct 20, 2020

jharenza commented Oct 20, 2020 • edited Loading

jharenza commented Jan 18, 2021

jharenza commented Jan 24, 2021

kgaonkar6 commented Mar 1, 2021 • edited Loading

kgaonkar6 commented Mar 2, 2021

jharenza commented Mar 2, 2021

jharenza commented Mar 2, 2021

kgaonkar6 commented Mar 2, 2021

jharenza commented May 13, 2021

jharenza commented Oct 20, 2020 •

edited by kgaonkar6

Loading

jharenza commented Oct 20, 2020 •

edited

Loading

kgaonkar6 commented Mar 1, 2021 •

edited

Loading