-
Notifications
You must be signed in to change notification settings - Fork 83
Proposed Analysis: Run brain TCGA SNV data through PBTA SNV caller pipeline #257
Comments
@yuankunzhu can you share here the figures you recently generated comparing GDC pipelines with our pipelines? We did a pilot comparison between TMB from their pipeline and ours and do not see much of a difference in TMB, so in an effort to save money and time on computation, we likely will not be performing this analysis. |
@jharenza : the processing has changed quite a bit since the initial dataset (adding new callers, etc). It might be wise to process at least some random subset of the TCGA tumors with at least some subset of callers and then compare over regions that have adequate coverage for that caller with our samples. As a peer reviewer, I'd probably see this as a fatal flaw in a TMB comparison if something like this wasn't done. It wouldn't surprise me if our reviewers would raise the same issue. I don't think this analysis needs to be comprehensive across the TCGA samples if the subset is randomly selected (stratified by tumor type). Perhaps this is what @yuankunzhu has already done. |
@cgreene yes, this is what was done on a small subset and what I am asking @yuankunzhu to share. The results are very comparable. |
Hi @cansavvy and @cgreene! For the TCGA MAFs used in #3, which variant caller was used when calculating TMB? @tkoganti and I had an idea since this came up for another project, and our results look as expected. We calculated TMB using Mutect2 MAFs from TCGA and Mutect2 MAFs from PBTA and found that the TMBs for pediatric tumors are definitely lower than the adult tumors, as expected. However, we were not able to find Mutect2 MAFs for the brain tumors as listed above, only the other 3 variant caller MAFs. So, we can try two things:
When we simply did the comparison, we used the same BED region file, selected only non-synonymous and nonsense from those regions and divided by the BED Mb for TMB, so there was an apples to apples comparison. I suggest we start with 1 for now, so we can try to get that done quickly for the first paper submission and followup longer term with 2 - does that sound like a plan? |
For the TCGA TMB calculations, I used the MAF data from MC3. Which includes Mutect1 and some other callers to make their variant calls. It definitely would be better to have the same variant caller data for comparing TMB.
To save time, you can skip running VarDict (Which I believe has the longest runtime). We didn't end up using VarDict for the TMB calculations. See methods summary here.
Are you suggesting we compare our all pediatric brain tumor samples to all adult non-CNS tissues? I'm not sure this would help the interpretability/comparability problem since we wouldn't be able to know if any differences are due to CNS vs non-CNS tissues OR whether its because pediatric vs adult tissues. But then again even the CNS diseases between adult and pediatric are very different to begin with so maybe this comparison is just rough regardless.
What do you mean by using the same BED regions file? At what stage(s)? And, would you be able to share this code? |
Hi @cansavvy
|
I'm a bit confused on what these TCGA samples are then since she mentioned you were not able to find
For coding only TMB variants (which is what we used for the TCGA comparison) we used a similar to tactic to what you describe here, except we used the CDS annotation from the gtf.gencode file included in the data release.
Thanks for sending this! Just want to compare notes on TMB calculations! |
All the TCGA are mutect. I had them labelled wrong yesterday. I corrected in the comment above with the figure. @cansavvy and @cgreene We were wondering if it was possible to get a manifest file for the TCGA BAM files with the disease types you would like to use so we can run consensus calling on those? I see three disease types that we would like to use(GBM, PCPG and LGG) but are there sub-types within those? Also, how many samples should we run under each disease type? |
My understanding is that the goal of this analysis is to compare TCGA mutation burden with PBTA mutation burden. The ideal world would be to:
|
Hi @cgreene - that is the plan, but there are 3 broad histologies listed: TCGA-LGG, TCGA-GBM, TCGA-PCPG, but looks like files @cansavvy was able to use in her first comparison had subtypes within those - the BAMs are labeled broadly. @cansavvy can you send the link to a manifest in which you obtained these narrower histologies, so we can try to select N BAMS per histology? Could not find it readily. @cgreene - how many samples per group do you think would be sufficient - 10? more? Trying to minimize costs and time for processing by rationally selecting the cohort to analyze. Thanks! |
Since I don't know what effect size we'd be looking for, I don't have a way to say how many would be sufficient. |
@cgreene we are starting with primary tumors only, and running 20 random per group from the 8 brain tumor histologies with >20 samples. |
What are the scientific goals of the analysis?
Related to #3 Tumor Mutation Burden Analysis.
In order to make the pediatric brain tumor SNV comparison to TCGA adult data more interpretable, we should run those data through the same mutation callers and parameters as was used for the PBTA data.
What methods do you plan to use to accomplish the scientific goals?
This would include running the brain-related TCGA data through Strelka2, Lancet, and Mutect2 each with whatever parameters are used for PBTA data. Upon recieving the output from these callers for the TCGA data, I would make SNV consensus file for TCGA data in the same way as I did for the PBTA data.
I think VarDict can be excluded from this analysis, since it has been oversensitive in its calls with the PBTA data.
What input data are required for this analysis?
These brain related TCGA data:
TCGA-LGG
TCGA-GBM
TCGA-PCPG
These data can be obtained from https://portal.gdc.cancer.gov/repository but require controlled access.
How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?
This is something that I am unsure of, but I expect it would be about the same amount of time it took to run the PBTA dataset through the mutation callers.
Who will complete the analysis (please add a GitHub handle here if relevant)?
@jharenza who do you suggest?
What relevant scientific literature relates to this analysis?
GBM : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3910500/
LGG : https://www.nejm.org/doi/full/10.1056/NEJMoa1402121
PCPG: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5643159/pdf/nihms849815.pdf
The text was updated successfully, but these errors were encountered: