Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline #257

Closed
cansavvy opened this issue Nov 11, 2019 · 17 comments
Closed
Assignees
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis updated analysis

Comments

@cansavvy
Copy link
Collaborator

What are the scientific goals of the analysis?

Related to #3 Tumor Mutation Burden Analysis.

In order to make the pediatric brain tumor SNV comparison to TCGA adult data more interpretable, we should run those data through the same mutation callers and parameters as was used for the PBTA data.

What methods do you plan to use to accomplish the scientific goals?

This would include running the brain-related TCGA data through Strelka2, Lancet, and Mutect2 each with whatever parameters are used for PBTA data. Upon recieving the output from these callers for the TCGA data, I would make SNV consensus file for TCGA data in the same way as I did for the PBTA data.

I think VarDict can be excluded from this analysis, since it has been oversensitive in its calls with the PBTA data.

What input data are required for this analysis?

These brain related TCGA data:
TCGA-LGG
TCGA-GBM
TCGA-PCPG

These data can be obtained from https://portal.gdc.cancer.gov/repository but require controlled access.

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

This is something that I am unsure of, but I expect it would be about the same amount of time it took to run the PBTA dataset through the mutation callers.

Who will complete the analysis (please add a GitHub handle here if relevant)?

@jharenza who do you suggest?

What relevant scientific literature relates to this analysis?

GBM : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3910500/
LGG : https://www.nejm.org/doi/full/10.1056/NEJMoa1402121
PCPG: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5643159/pdf/nihms849815.pdf

@cansavvy cansavvy changed the title Proposed Analysis: Run Brain TCGA SNV data through PBTA SNV Caller pipeline Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline Nov 11, 2019
@jharenza
Copy link
Collaborator

@yuankunzhu can you share here the figures you recently generated comparing GDC pipelines with our pipelines? We did a pilot comparison between TMB from their pipeline and ours and do not see much of a difference in TMB, so in an effort to save money and time on computation, we likely will not be performing this analysis.

@cgreene
Copy link
Collaborator

cgreene commented Nov 23, 2019

@jharenza : the processing has changed quite a bit since the initial dataset (adding new callers, etc). It might be wise to process at least some random subset of the TCGA tumors with at least some subset of callers and then compare over regions that have adequate coverage for that caller with our samples.

As a peer reviewer, I'd probably see this as a fatal flaw in a TMB comparison if something like this wasn't done. It wouldn't surprise me if our reviewers would raise the same issue.

I don't think this analysis needs to be comprehensive across the TCGA samples if the subset is randomly selected (stratified by tumor type).

Perhaps this is what @yuankunzhu has already done.

@jharenza
Copy link
Collaborator

@cgreene yes, this is what was done on a small subset and what I am asking @yuankunzhu to share. The results are very comparable.

@jharenza
Copy link
Collaborator

jharenza commented Jan 8, 2020

Hi @cansavvy and @cgreene! For the TCGA MAFs used in #3, which variant caller was used when calculating TMB? @tkoganti and I had an idea since this came up for another project, and our results look as expected. We calculated TMB using Mutect2 MAFs from TCGA and Mutect2 MAFs from PBTA and found that the TMBs for pediatric tumors are definitely lower than the adult tumors, as expected. However, we were not able to find Mutect2 MAFs for the brain tumors as listed above, only the other 3 variant caller MAFs. So, we can try two things:

  1. Obtain the BAM files for these samples and run our Mutect2 pipeline and assess TMB using Mutect2 for all callers and/or, as you suggested (but more time consuming):
  2. Obtain BAM files for these samples and run all 4 pipelines to assess TMB from consensus calls.

When we simply did the comparison, we used the same BED region file, selected only non-synonymous and nonsense from those regions and divided by the BED Mb for TMB, so there was an apples to apples comparison.

I suggest we start with 1 for now, so we can try to get that done quickly for the first paper submission and followup longer term with 2 - does that sound like a plan?

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 8, 2020

Hello, please see the plots we have from our analysis -

This is from TCGA mutect MAF files -
image

This is from PBTA mutect2 MAF files -
image

@cansavvy
Copy link
Collaborator Author

cansavvy commented Jan 8, 2020

For the TCGA MAFs used in #3, which variant caller was used when calculating TMB?

For the TCGA TMB calculations, I used the MAF data from MC3.

Which includes Mutect1 and some other callers to make their variant calls. It definitely would be better to have the same variant caller data for comparing TMB.

  1. Obtain the BAM files for these samples and run our Mutect2 pipeline and assess TMB using Mutect2 for all callers and/or, as you suggested (but more time consuming):

To save time, you can skip running VarDict (Which I believe has the longest runtime). We didn't end up using VarDict for the TMB calculations. See methods summary here.

  1. Obtain BAM files for these samples and run all 4 pipelines to assess TMB from consensus calls.

Are you suggesting we compare our all pediatric brain tumor samples to all adult non-CNS tissues? I'm not sure this would help the interpretability/comparability problem since we wouldn't be able to know if any differences are due to CNS vs non-CNS tissues OR whether its because pediatric vs adult tissues. But then again even the CNS diseases between adult and pediatric are very different to begin with so maybe this comparison is just rough regardless.

When we simply did the comparison, we used the same BED region file, selected only non-synonymous and nonsense from those regions and divided by the BED Mb for TMB, so there was an apples to apples comparison.

What do you mean by using the same BED regions file? At what stage(s)? And, would you be able to share this code?

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 8, 2020

Hi @cansavvy

  1. Not using vardict helps if we go with method two showed above
  2. I don't think @jharenza was suggesting using non-CNS tissues. We did this for a different project and just used what we had available there. This is an example
  3. Please see @kgaonkar6 comment here (Planned Analysis: Tumor Mutation Burden #3 (comment)) about BED file. We took a similar approach but since what we did was for a clinical trial, we used the BED for WXS captured region and only considered variants within this region for consistency across all cohorts we used
    Here is the BED file - https://cavatica.sbgenomics.com/u/kfdrc-harmonization/sd-8y99qzjj/files/5dfbe088e4b09d9aaf41d45a/
    Here is code that generates the .tsv file for the figures I sent above (It takes all samples in a cohort, and implements this formula - (# of missense + # of nonsense)*1000000/BED_length)
    https://github.com/d3b-center/scripts-/blob/master/TMB_calculation_from_MAFfiles

@cansavvy
Copy link
Collaborator Author

cansavvy commented Jan 8, 2020

  1. I don't think @jharenza was suggesting using non-CNS tissues. We did this for a different project and just used what we had available there. This is an example

I'm a bit confused on what these TCGA samples are then since she mentioned you were not able to find Mutect2 MAFs for the brain tumors as listed above but I see your plot includes something called GBM, and that was included in the list I put above.

  1. Please see @kgaonkar6 comment here (#3 (comment)) about BED file. We took a similar approach but since what we did was for a clinical trial, we used the BED for WXS captured region and only considered variants within this region for consistency across all cohorts we used

For coding only TMB variants (which is what we used for the TCGA comparison) we used a similar to tactic to what you describe here, except we used the CDS annotation from the gtf.gencode file included in the data release.

Here is code that generates the .tsv file for the figures I sent above (It takes all samples in a cohort, and implements this formula - (# of missense + # of nonsense)*1000000/BED_length)
https://github.com/d3b-center/scripts-/blob/master/TMB_calculation_from_MAFfiles

Thanks for sending this! Just want to compare notes on TMB calculations!

@jharenza
Copy link
Collaborator

jharenza commented Jan 8, 2020

@cansavvy - the MAFs may very well be mutect1 - @tkoganti, do you know? They may have only been labeled Mutect and made an assumption, but we will definitely seek access and reprocess!

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 9, 2020

All the TCGA are mutect. I had them labelled wrong yesterday. I corrected in the comment above with the figure.

@cansavvy and @cgreene We were wondering if it was possible to get a manifest file for the TCGA BAM files with the disease types you would like to use so we can run consensus calling on those? I see three disease types that we would like to use(GBM, PCPG and LGG) but are there sub-types within those? Also, how many samples should we run under each disease type?

@cgreene
Copy link
Collaborator

cgreene commented Jan 9, 2020

My understanding is that the goal of this analysis is to compare TCGA mutation burden with PBTA mutation burden. The ideal world would be to:

  1. Identify the intersection of regions that were measured by the various kits between TCGA/PBTA.
  2. Apply the same callers to identify variants in the intersecting regions for as many cancers as possible for brain tumors (could do other tumor types too, but I think the brain tumors are the most important).
  3. Calculate TMB using only the TCGA/PBTA intersect sets.

@jharenza
Copy link
Collaborator

jharenza commented Jan 9, 2020

Hi @cgreene - that is the plan, but there are 3 broad histologies listed: TCGA-LGG, TCGA-GBM, TCGA-PCPG, but looks like files @cansavvy was able to use in her first comparison had subtypes within those - the BAMs are labeled broadly. @cansavvy can you send the link to a manifest in which you obtained these narrower histologies, so we can try to select N BAMS per histology? Could not find it readily.

@cgreene - how many samples per group do you think would be sufficient - 10? more? Trying to minimize costs and time for processing by rationally selecting the cohort to analyze.

Thanks!

@cgreene
Copy link
Collaborator

cgreene commented Jan 9, 2020

Since I don't know what effect size we'd be looking for, I don't have a way to say how many would be sufficient.

@jharenza
Copy link
Collaborator

jharenza commented Jan 9, 2020

@cansavvy just found where you got the clinical data here.

Will keep you posted about this.

@jharenza jharenza added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Jan 10, 2020
@jharenza
Copy link
Collaborator

@cgreene we are starting with primary tumors only, and running 20 random per group from the 8 brain tumor histologies with >20 samples.

This was referenced Jan 13, 2020
@jaclyn-taroni
Copy link
Member

I believe what is now left for the TCGA data included in v13 (#444) is

I would make SNV consensus file for TCGA data in the same way as I did for the PBTA data.

from @cansavvy

@cansavvy
Copy link
Collaborator Author

cansavvy commented Mar 9, 2020

I think this issue is covered except for 1) More samples that will be added to the TCGA cohorts in v16 (#601 ) and 2) BED file issues noted here: #568 which I think also will be updated in v16.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis updated analysis
Projects
None yet
Development

No branches or pull requests

5 participants