Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline #257

cansavvy · 2019-11-11T16:29:39Z

What are the scientific goals of the analysis?

Related to #3 Tumor Mutation Burden Analysis.

In order to make the pediatric brain tumor SNV comparison to TCGA adult data more interpretable, we should run those data through the same mutation callers and parameters as was used for the PBTA data.

What methods do you plan to use to accomplish the scientific goals?

This would include running the brain-related TCGA data through Strelka2, Lancet, and Mutect2 each with whatever parameters are used for PBTA data. Upon recieving the output from these callers for the TCGA data, I would make SNV consensus file for TCGA data in the same way as I did for the PBTA data.

I think VarDict can be excluded from this analysis, since it has been oversensitive in its calls with the PBTA data.

What input data are required for this analysis?

These brain related TCGA data:
TCGA-LGG
TCGA-GBM
TCGA-PCPG

These data can be obtained from https://portal.gdc.cancer.gov/repository but require controlled access.

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

This is something that I am unsure of, but I expect it would be about the same amount of time it took to run the PBTA dataset through the mutation callers.

Who will complete the analysis (please add a GitHub handle here if relevant)?

@jharenza who do you suggest?

What relevant scientific literature relates to this analysis?

GBM : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3910500/
LGG : https://www.nejm.org/doi/full/10.1056/NEJMoa1402121
PCPG: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5643159/pdf/nihms849815.pdf

jharenza · 2019-11-22T22:13:28Z

@yuankunzhu can you share here the figures you recently generated comparing GDC pipelines with our pipelines? We did a pilot comparison between TMB from their pipeline and ours and do not see much of a difference in TMB, so in an effort to save money and time on computation, we likely will not be performing this analysis.

cgreene · 2019-11-23T14:03:24Z

@jharenza : the processing has changed quite a bit since the initial dataset (adding new callers, etc). It might be wise to process at least some random subset of the TCGA tumors with at least some subset of callers and then compare over regions that have adequate coverage for that caller with our samples.

As a peer reviewer, I'd probably see this as a fatal flaw in a TMB comparison if something like this wasn't done. It wouldn't surprise me if our reviewers would raise the same issue.

I don't think this analysis needs to be comprehensive across the TCGA samples if the subset is randomly selected (stratified by tumor type).

Perhaps this is what @yuankunzhu has already done.

jharenza · 2019-11-23T16:41:14Z

@cgreene yes, this is what was done on a small subset and what I am asking @yuankunzhu to share. The results are very comparable.

jharenza · 2020-01-08T20:44:48Z

Hi @cansavvy and @cgreene! For the TCGA MAFs used in #3, which variant caller was used when calculating TMB? @tkoganti and I had an idea since this came up for another project, and our results look as expected. We calculated TMB using Mutect2 MAFs from TCGA and Mutect2 MAFs from PBTA and found that the TMBs for pediatric tumors are definitely lower than the adult tumors, as expected. However, we were not able to find Mutect2 MAFs for the brain tumors as listed above, only the other 3 variant caller MAFs. So, we can try two things:

Obtain the BAM files for these samples and run our Mutect2 pipeline and assess TMB using Mutect2 for all callers and/or, as you suggested (but more time consuming):
Obtain BAM files for these samples and run all 4 pipelines to assess TMB from consensus calls.

When we simply did the comparison, we used the same BED region file, selected only non-synonymous and nonsense from those regions and divided by the BED Mb for TMB, so there was an apples to apples comparison.

I suggest we start with 1 for now, so we can try to get that done quickly for the first paper submission and followup longer term with 2 - does that sound like a plan?

tkoganti · 2020-01-08T20:52:01Z

Hello, please see the plots we have from our analysis -

This is from TCGA mutect MAF files -

This is from PBTA mutect2 MAF files -

cansavvy · 2020-01-08T21:08:59Z

For the TCGA MAFs used in #3, which variant caller was used when calculating TMB?

For the TCGA TMB calculations, I used the MAF data from MC3.

Which includes Mutect1 and some other callers to make their variant calls. It definitely would be better to have the same variant caller data for comparing TMB.

Obtain the BAM files for these samples and run our Mutect2 pipeline and assess TMB using Mutect2 for all callers and/or, as you suggested (but more time consuming):

To save time, you can skip running VarDict (Which I believe has the longest runtime). We didn't end up using VarDict for the TMB calculations. See methods summary here.

Obtain BAM files for these samples and run all 4 pipelines to assess TMB from consensus calls.

Are you suggesting we compare our all pediatric brain tumor samples to all adult non-CNS tissues? I'm not sure this would help the interpretability/comparability problem since we wouldn't be able to know if any differences are due to CNS vs non-CNS tissues OR whether its because pediatric vs adult tissues. But then again even the CNS diseases between adult and pediatric are very different to begin with so maybe this comparison is just rough regardless.

When we simply did the comparison, we used the same BED region file, selected only non-synonymous and nonsense from those regions and divided by the BED Mb for TMB, so there was an apples to apples comparison.

What do you mean by using the same BED regions file? At what stage(s)? And, would you be able to share this code?

tkoganti · 2020-01-08T21:37:48Z

Hi @cansavvy

Not using vardict helps if we go with method two showed above
I don't think @jharenza was suggesting using non-CNS tissues. We did this for a different project and just used what we had available there. This is an example
Please see @kgaonkar6 comment here (Planned Analysis: Tumor Mutation Burden #3 (comment)) about BED file. We took a similar approach but since what we did was for a clinical trial, we used the BED for WXS captured region and only considered variants within this region for consistency across all cohorts we used
Here is the BED file - https://cavatica.sbgenomics.com/u/kfdrc-harmonization/sd-8y99qzjj/files/5dfbe088e4b09d9aaf41d45a/
Here is code that generates the .tsv file for the figures I sent above (It takes all samples in a cohort, and implements this formula - (# of missense + # of nonsense)*1000000/BED_length)
https://github.com/d3b-center/scripts-/blob/master/TMB_calculation_from_MAFfiles

cansavvy · 2020-01-08T21:58:10Z

I don't think @jharenza was suggesting using non-CNS tissues. We did this for a different project and just used what we had available there. This is an example

I'm a bit confused on what these TCGA samples are then since she mentioned you were not able to find Mutect2 MAFs for the brain tumors as listed above but I see your plot includes something called GBM, and that was included in the list I put above.

Please see @kgaonkar6 comment here (#3 (comment)) about BED file. We took a similar approach but since what we did was for a clinical trial, we used the BED for WXS captured region and only considered variants within this region for consistency across all cohorts we used

For coding only TMB variants (which is what we used for the TCGA comparison) we used a similar to tactic to what you describe here, except we used the CDS annotation from the gtf.gencode file included in the data release.

Here is code that generates the .tsv file for the figures I sent above (It takes all samples in a cohort, and implements this formula - (# of missense + # of nonsense)*1000000/BED_length)
https://github.com/d3b-center/scripts-/blob/master/TMB_calculation_from_MAFfiles

Thanks for sending this! Just want to compare notes on TMB calculations!

jharenza · 2020-01-08T22:28:32Z

@cansavvy - the MAFs may very well be mutect1 - @tkoganti, do you know? They may have only been labeled Mutect and made an assumption, but we will definitely seek access and reprocess!

tkoganti · 2020-01-09T15:10:15Z

All the TCGA are mutect. I had them labelled wrong yesterday. I corrected in the comment above with the figure.

@cansavvy and @cgreene We were wondering if it was possible to get a manifest file for the TCGA BAM files with the disease types you would like to use so we can run consensus calling on those? I see three disease types that we would like to use(GBM, PCPG and LGG) but are there sub-types within those? Also, how many samples should we run under each disease type?

cgreene · 2020-01-09T15:52:36Z

My understanding is that the goal of this analysis is to compare TCGA mutation burden with PBTA mutation burden. The ideal world would be to:

Identify the intersection of regions that were measured by the various kits between TCGA/PBTA.
Apply the same callers to identify variants in the intersecting regions for as many cancers as possible for brain tumors (could do other tumor types too, but I think the brain tumors are the most important).
Calculate TMB using only the TCGA/PBTA intersect sets.

jharenza · 2020-01-09T17:25:24Z

Hi @cgreene - that is the plan, but there are 3 broad histologies listed: TCGA-LGG, TCGA-GBM, TCGA-PCPG, but looks like files @cansavvy was able to use in her first comparison had subtypes within those - the BAMs are labeled broadly. @cansavvy can you send the link to a manifest in which you obtained these narrower histologies, so we can try to select N BAMS per histology? Could not find it readily.

@cgreene - how many samples per group do you think would be sufficient - 10? more? Trying to minimize costs and time for processing by rationally selecting the cohort to analyze.

Thanks!

cgreene · 2020-01-09T17:26:04Z

Since I don't know what effect size we'd be looking for, I don't have a way to say how many would be sufficient.

jharenza · 2020-01-09T17:51:02Z

@cansavvy just found where you got the clinical data here.

Will keep you posted about this.

jharenza · 2020-01-10T16:00:30Z

@cgreene we are starting with primary tumors only, and running 20 random per group from the 8 brain tumor histologies with >20 samples.

jaclyn-taroni · 2020-01-18T12:54:23Z

I believe what is now left for the TCGA data included in v13 (#444) is

I would make SNV consensus file for TCGA data in the same way as I did for the PBTA data.

from @cansavvy

cansavvy · 2020-03-09T18:46:50Z

I think this issue is covered except for 1) More samples that will be added to the TCGA cohorts in v16 (#601 ) and 2) BED file issues noted here: #568 which I think also will be updated in v16.

cansavvy added the proposed analysis label Nov 11, 2019

cansavvy changed the title ~~Proposed Analysis: Run Brain TCGA SNV data through PBTA SNV Caller pipeline~~ Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline Nov 11, 2019

This was referenced Dec 15, 2019

Planned Analysis: Tumor Mutation Burden #3

Closed

Documentation: tmb-compare-tcga README #338

Closed

jharenza added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Jan 10, 2020

This was referenced Jan 13, 2020

Planned data release: V14 #432

Closed

Planned data release: V13 #373

Closed

jaclyn-taroni assigned cansavvy Jan 18, 2020

jaclyn-taroni added the updated analysis label Jan 18, 2020

cansavvy mentioned this issue Jan 28, 2020

Updated analysis: Use TCGA SNV consensus calls to get mutational signatures #481

Closed

cansavvy mentioned this issue Feb 6, 2020

TCGA Consensus Run #521

Closed

5 tasks

kgaonkar6 mentioned this issue Feb 18, 2020

Planned release: v15 #543

Closed

5 tasks

cansavvy mentioned this issue Feb 21, 2020

Proposed Analysis: PCAWG WGS Brain samples to run through SNV caller pipeline #551

Closed

jharenza mentioned this issue Feb 25, 2020

Updated analysis: PBTA vs TCGA TMB analysis #556

Closed

cansavvy mentioned this issue Feb 26, 2020

TCGA Consensus and Comparison Revised (1 of 2) #562

Merged

5 tasks

jaclyn-taroni closed this as completed Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline #257

Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline #257

cansavvy commented Nov 11, 2019

jharenza commented Nov 22, 2019

cgreene commented Nov 23, 2019

jharenza commented Nov 23, 2019

jharenza commented Jan 8, 2020

tkoganti commented Jan 8, 2020 •

edited

Loading

cansavvy commented Jan 8, 2020

tkoganti commented Jan 8, 2020 •

edited

Loading

cansavvy commented Jan 8, 2020

jharenza commented Jan 8, 2020

tkoganti commented Jan 9, 2020 •

edited

Loading

cgreene commented Jan 9, 2020

jharenza commented Jan 9, 2020

cgreene commented Jan 9, 2020

jharenza commented Jan 9, 2020

jharenza commented Jan 10, 2020

jaclyn-taroni commented Jan 18, 2020

cansavvy commented Mar 9, 2020

Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline #257

Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline #257

Comments

cansavvy commented Nov 11, 2019

What are the scientific goals of the analysis?

What methods do you plan to use to accomplish the scientific goals?

What input data are required for this analysis?

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Who will complete the analysis (please add a GitHub handle here if relevant)?

What relevant scientific literature relates to this analysis?

jharenza commented Nov 22, 2019

cgreene commented Nov 23, 2019

jharenza commented Nov 23, 2019

jharenza commented Jan 8, 2020

tkoganti commented Jan 8, 2020 • edited Loading

cansavvy commented Jan 8, 2020

tkoganti commented Jan 8, 2020 • edited Loading

cansavvy commented Jan 8, 2020

jharenza commented Jan 8, 2020

tkoganti commented Jan 9, 2020 • edited Loading

cgreene commented Jan 9, 2020

jharenza commented Jan 9, 2020

cgreene commented Jan 9, 2020

jharenza commented Jan 9, 2020

jharenza commented Jan 10, 2020

jaclyn-taroni commented Jan 18, 2020

cansavvy commented Mar 9, 2020

tkoganti commented Jan 8, 2020 •

edited

Loading

tkoganti commented Jan 8, 2020 •

edited

Loading

tkoganti commented Jan 9, 2020 •

edited

Loading