Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

TCGA Consensus Run #521

Closed
wants to merge 19 commits into from
Closed

Conversation

cansavvy
Copy link
Collaborator

@cansavvy cansavvy commented Feb 6, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

If we run the TCGA data through the same variant callers and consensus methods we used for PBTA data, do we get a comparison that more accurately meets our expectations? (aka that adults have a higher tumor mutation burden).

This PR is draft because the results are still quite raw and I don't totally trust them yet. But I want to get this posted in case someone notices something with the consensus pipeline for TCGA that I have missed.

What was your approach?

  • I ran the TCGA files from v14 through the same pipeline we used for the PBTA data. You will note that there were a few metadata cleaning steps that needed to be adjusted for TCGA so there is a new option (--tcga) that needs to be used for the 03-calculate-tmb.R script.

What GitHub issue does your pull request address?

#257

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • Is there anything where the TCGA data is not being handled properly by the snv-caller pipeline?
  • I was assuming that the same BED files should be used for TMB calculations
  • Is there anything that needs adjusting for the TCGA data that I have missed?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

There are results (see notebooks below), but I don't trust them yet. They need some fine tooth combing and investigating.

Results

What types of results are included (e.g., table, figure)?

Here are the updated rendered notebooks:
https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots-tcga.nb.html

https://cansavvy.github.io/openpbta-notebook-concept/tmb-compare-tcga/compare-tmb-update.nb.html

For reference, this is what this comparison notebook looked like before using the updated data:
https://cansavvy.github.io/openpbta-notebook-concept/tmb-compare-tcga/compare-tmb.nb.html

What is your summary of the results?

The comparison doesn't look like what we'd expect, need to go over all methods to determine whether there is anything in the analysis that is amiss.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@cansavvy cansavvy added the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Feb 6, 2020
@jashapiro
Copy link
Member

Looking at the agreement among callers, the TCGA Lancet results still seem strange. Here are the agreements among callers from the PBTA samples for exons:

download-4
And here are the TCGA samples
download-3

Aside from the lack of VarDict, the big difference is that the Lancet calls are much more likely to be Lancet-only in the TCGA set (even after adding in Lancet + VarDict in the PBTA), and we have a much smaller proportion of of agreement among all three (or four) callers.

The nucleotide biases in the Lancet samples also seem very different between the two data sets. In PBTA all callers have similar biases, but in the TCGA data, the Lancet data is very skewed toward A>C and T>G. (Mutect looks a bit strange in TCGA too, but not nearly as strange).
download-2

I'm wondering if it is possible that the Lancet data was processed with a different genome build? Or some other difference in the way lancet was run on the two sets that would result in odd patterns like this?

I am tagging in @jharenza, @tkoganti, @migbro, and @yuankunzhu for this, as they were all mentioned in #512

@jharenza
Copy link
Collaborator

jharenza commented Feb 6, 2020

@jashapiro GRCh38.d1.vd1.fa was the genome version used for TCGA here. It looks like this is patch 0, while PBTA is patch 12. At the time of the initial run, I recall it was decided to not convert from BAM->FQ->CRAM because that process was going to push back getting the results for several days and it wouldn't make it into the last data release (but I guess that is moot now!).

@tkoganti also just pointed out that the BED files used were the same as were used for the PBTA WXS, but we should have used the WXS BED files for whichever kit(s) were used in TCGA. @cansavvy do I recall from another ticket that you had found these BED files accessible or kit information somewhere?

We will try to locate the kit information and WXS BED files and put this in as an end-to-end workflow to-do for the next release.

@jashapiro
Copy link
Member

I don't feel like patch changes should make enough of a difference to explain these discrepancies, but I don't really have other ideas why lancet should behave so differently with this set when it was fairly concordant in the PBTA set. We will investigate some more at this end.

@cansavvy
Copy link
Collaborator Author

cansavvy commented Feb 6, 2020

@cansavvy do I recall from another ticket that you had found these BED files accessible or kit information somewhere?

Because I understood the BED files you use to be variant caller specific, I used the same BED files for the TCGA data as I did for the PBTA data. Previously I had used the MC3 data and their associated WXS BED files, but I am not sure if it makes sense to use it here. The target BED regions file can be found on this page: https://gdc.cancer.gov/about-data/publications/mc3-2017
It's the gencode.v19.basic.exome.bed that I used.

We will try to locate the kit information and WXS BED files and put this in as an end-to-end workflow to-do for the next release.

If you can share whatever BED files you use in the release, then I can calculate TMB based on them.

@jharenza
Copy link
Collaborator

jharenza commented Feb 6, 2020

Thanks @cansavvy!

So the normal workflow should be to use in the variant calling the same BED intervals as were used to create the WXS or targeted capture panel. Since we only had WXS from one lab and they only used one specific library prep kit for this, it was easy to just use that one WXS BED (StrexomeLite_hg38_liftover_100bp_padded.bed) we had in the release. This changed when we processed the targeted panel data, as the BED had fewer regions, and that is the StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed.

I somehow missed the fact that we used the first BED above when processing TCGA data. We should have used the BED you just pointed out. That being said, you can process with what is in the release now, while we redo this processing using the correct BED. I noticed that gencode v19 is in hg19, so we will have to liftover those regions to hg38, then add the 100bp padding, as we did before, then process and release those BEDs as well. Does that help?

@kgaonkar6 kgaonkar6 mentioned this pull request Feb 18, 2020
5 tasks
@cansavvy cansavvy mentioned this pull request Feb 19, 2020
5 tasks
@cansavvy
Copy link
Collaborator Author

Going to close this PR because I think it served its purpose of sharing the revamped TCGA data and facilitating the conversation we had around it.

@cansavvy cansavvy closed this Feb 26, 2020
@cansavvy cansavvy mentioned this pull request Feb 28, 2020
5 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
work in progress Used to label (non-draft) pull requests that are not yet ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants