TCGA Consensus Run #521

cansavvy · 2020-02-06T14:39:24Z

Purpose/implementation Section

What scientific question is your analysis addressing?

If we run the TCGA data through the same variant callers and consensus methods we used for PBTA data, do we get a comparison that more accurately meets our expectations? (aka that adults have a higher tumor mutation burden).

This PR is draft because the results are still quite raw and I don't totally trust them yet. But I want to get this posted in case someone notices something with the consensus pipeline for TCGA that I have missed.

What was your approach?

I ran the TCGA files from v14 through the same pipeline we used for the PBTA data. You will note that there were a few metadata cleaning steps that needed to be adjusted for TCGA so there is a new option (--tcga) that needs to be used for the 03-calculate-tmb.R script.

What GitHub issue does your pull request address?

#257

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything where the TCGA data is not being handled properly by the snv-caller pipeline?
I was assuming that the same BED files should be used for TMB calculations
Is there anything that needs adjusting for the TCGA data that I have missed?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

There are results (see notebooks below), but I don't trust them yet. They need some fine tooth combing and investigating.

Results

What types of results are included (e.g., table, figure)?

Here are the updated rendered notebooks:
https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots-tcga.nb.html

https://cansavvy.github.io/openpbta-notebook-concept/tmb-compare-tcga/compare-tmb-update.nb.html

For reference, this is what this comparison notebook looked like before using the updated data:
https://cansavvy.github.io/openpbta-notebook-concept/tmb-compare-tcga/compare-tmb.nb.html

What is your summary of the results?

The comparison doesn't look like what we'd expect, need to go over all methods to determine whether there is anything in the analysis that is amiss.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

jashapiro · 2020-02-06T15:43:44Z

Looking at the agreement among callers, the TCGA Lancet results still seem strange. Here are the agreements among callers from the PBTA samples for exons:

And here are the TCGA samples

Aside from the lack of VarDict, the big difference is that the Lancet calls are much more likely to be Lancet-only in the TCGA set (even after adding in Lancet + VarDict in the PBTA), and we have a much smaller proportion of of agreement among all three (or four) callers.

The nucleotide biases in the Lancet samples also seem very different between the two data sets. In PBTA all callers have similar biases, but in the TCGA data, the Lancet data is very skewed toward A>C and T>G. (Mutect looks a bit strange in TCGA too, but not nearly as strange).

I'm wondering if it is possible that the Lancet data was processed with a different genome build? Or some other difference in the way lancet was run on the two sets that would result in odd patterns like this?

I am tagging in @jharenza, @tkoganti, @migbro, and @yuankunzhu for this, as they were all mentioned in #512

jharenza · 2020-02-06T16:32:49Z

@jashapiro GRCh38.d1.vd1.fa was the genome version used for TCGA here. It looks like this is patch 0, while PBTA is patch 12. At the time of the initial run, I recall it was decided to not convert from BAM->FQ->CRAM because that process was going to push back getting the results for several days and it wouldn't make it into the last data release (but I guess that is moot now!).

@tkoganti also just pointed out that the BED files used were the same as were used for the PBTA WXS, but we should have used the WXS BED files for whichever kit(s) were used in TCGA. @cansavvy do I recall from another ticket that you had found these BED files accessible or kit information somewhere?

We will try to locate the kit information and WXS BED files and put this in as an end-to-end workflow to-do for the next release.

jashapiro · 2020-02-06T16:52:23Z

I don't feel like patch changes should make enough of a difference to explain these discrepancies, but I don't really have other ideas why lancet should behave so differently with this set when it was fairly concordant in the PBTA set. We will investigate some more at this end.

cansavvy · 2020-02-06T18:49:22Z

@cansavvy do I recall from another ticket that you had found these BED files accessible or kit information somewhere?

Because I understood the BED files you use to be variant caller specific, I used the same BED files for the TCGA data as I did for the PBTA data. Previously I had used the MC3 data and their associated WXS BED files, but I am not sure if it makes sense to use it here. The target BED regions file can be found on this page: https://gdc.cancer.gov/about-data/publications/mc3-2017
It's the gencode.v19.basic.exome.bed that I used.

We will try to locate the kit information and WXS BED files and put this in as an end-to-end workflow to-do for the next release.

If you can share whatever BED files you use in the release, then I can calculate TMB based on them.

jharenza · 2020-02-06T19:04:50Z

Thanks @cansavvy!

So the normal workflow should be to use in the variant calling the same BED intervals as were used to create the WXS or targeted capture panel. Since we only had WXS from one lab and they only used one specific library prep kit for this, it was easy to just use that one WXS BED (StrexomeLite_hg38_liftover_100bp_padded.bed) we had in the release. This changed when we processed the targeted panel data, as the BED had fewer regions, and that is the StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed.

I somehow missed the fact that we used the first BED above when processing TCGA data. We should have used the BED you just pointed out. That being said, you can process with what is in the release now, while we redo this processing using the correct BED. I noticed that gencode v19 is in hg19, so we will have to liftover those regions to hg38, then add the 100bp padding, as we did before, then process and release those BEDs as well. Does that help?

cansavvy · 2020-02-26T16:00:00Z

Going to close this PR because I think it served its purpose of sharing the revamped TCGA data and facilitating the conversation we had around it.

cansavvy added 12 commits January 21, 2020 15:48

make separate bash script for TCGA data

d39bd92

Starting to set up for TCGA flexibility

4211158

Keep the test I did for now.

4338b4d

Make it so TCGA metadata works with the TMB stuff

3a59ea4

Merge remote-tracking branch 'upstream/master' into tcga-consensus

f257fff

Add one more caveat for tcga data

230a1b3

Use data in tmb-compare module

45cee50

Adjusted some plot aesthetics

e2da876

Add comparison plots notebook

dd4da06

Refresh TCGA notebook

b318f88

Update the READMEs

2386378

Add to CircleCI

2761d54

cansavvy added the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Feb 6, 2020

Adjust for WXS TCGA

4dd0750

Adjust metadata

f34dcd2

kgaonkar6 mentioned this pull request Feb 18, 2020

Planned release: v15 #543

Closed

5 tasks

cansavvy mentioned this pull request Feb 19, 2020

TCGA vs PBTA exploratory analysis #548

Closed

5 tasks

Use revised data

23b97ef

cansavvy mentioned this pull request Feb 20, 2020

Proposed Analysis: PCAWG WGS Brain samples to run through SNV caller pipeline #551

Closed

cansavvy added 4 commits February 25, 2020 14:01

Update TMBs to only be calculated based on strelka and mutect

8c10a3f

Make name parallel to tcga

3f78f1e

Update TMB calculation to only strelka mutect

31d45e0

Believe I have pinpointed the join problem I was having

b8f2ab6

cansavvy closed this Feb 26, 2020

cansavvy mentioned this pull request Feb 28, 2020

TCGA Data Notebook Updates #573

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCGA Consensus Run #521

TCGA Consensus Run #521

cansavvy commented Feb 6, 2020

jashapiro commented Feb 6, 2020

jharenza commented Feb 6, 2020

jashapiro commented Feb 6, 2020

cansavvy commented Feb 6, 2020

jharenza commented Feb 6, 2020 •

edited

Loading

cansavvy commented Feb 26, 2020

TCGA Consensus Run #521

TCGA Consensus Run #521

Conversation

cansavvy commented Feb 6, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jashapiro commented Feb 6, 2020

jharenza commented Feb 6, 2020

jashapiro commented Feb 6, 2020

cansavvy commented Feb 6, 2020

jharenza commented Feb 6, 2020 • edited Loading

cansavvy commented Feb 26, 2020

jharenza commented Feb 6, 2020 •

edited

Loading