-
Notifications
You must be signed in to change notification settings - Fork 83
Conversation
Looking at the agreement among callers, the TCGA Lancet results still seem strange. Here are the agreements among callers from the PBTA samples for exons: Aside from the lack of VarDict, the big difference is that the Lancet calls are much more likely to be Lancet-only in the TCGA set (even after adding in Lancet + VarDict in the PBTA), and we have a much smaller proportion of of agreement among all three (or four) callers. The nucleotide biases in the Lancet samples also seem very different between the two data sets. In PBTA all callers have similar biases, but in the TCGA data, the Lancet data is very skewed toward I'm wondering if it is possible that the Lancet data was processed with a different genome build? Or some other difference in the way lancet was run on the two sets that would result in odd patterns like this? I am tagging in @jharenza, @tkoganti, @migbro, and @yuankunzhu for this, as they were all mentioned in #512 |
@jashapiro @tkoganti also just pointed out that the BED files used were the same as were used for the PBTA WXS, but we should have used the WXS BED files for whichever kit(s) were used in TCGA. @cansavvy do I recall from another ticket that you had found these BED files accessible or kit information somewhere? We will try to locate the kit information and WXS BED files and put this in as an end-to-end workflow to-do for the next release. |
I don't feel like patch changes should make enough of a difference to explain these discrepancies, but I don't really have other ideas why lancet should behave so differently with this set when it was fairly concordant in the PBTA set. We will investigate some more at this end. |
Because I understood the BED files you use to be variant caller specific, I used the same BED files for the TCGA data as I did for the PBTA data. Previously I had used the MC3 data and their associated WXS BED files, but I am not sure if it makes sense to use it here. The target BED regions file can be found on this page: https://gdc.cancer.gov/about-data/publications/mc3-2017
If you can share whatever BED files you use in the release, then I can calculate TMB based on them. |
Thanks @cansavvy! So the normal workflow should be to use in the variant calling the same BED intervals as were used to create the WXS or targeted capture panel. Since we only had WXS from one lab and they only used one specific library prep kit for this, it was easy to just use that one WXS BED ( I somehow missed the fact that we used the first BED above when processing TCGA data. We should have used the BED you just pointed out. That being said, you can process with what is in the release now, while we redo this processing using the correct BED. I noticed that gencode v19 is in hg19, so we will have to liftover those regions to hg38, then add the 100bp padding, as we did before, then process and release those BEDs as well. Does that help? |
Going to close this PR because I think it served its purpose of sharing the revamped TCGA data and facilitating the conversation we had around it. |
Purpose/implementation Section
What scientific question is your analysis addressing?
If we run the TCGA data through the same variant callers and consensus methods we used for PBTA data, do we get a comparison that more accurately meets our expectations? (aka that adults have a higher tumor mutation burden).
This PR is draft because the results are still quite raw and I don't totally trust them yet. But I want to get this posted in case someone notices something with the consensus pipeline for TCGA that I have missed.
What was your approach?
--tcga
) that needs to be used for the 03-calculate-tmb.R script.What GitHub issue does your pull request address?
#257
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
There are results (see notebooks below), but I don't trust them yet. They need some fine tooth combing and investigating.
Results
What types of results are included (e.g., table, figure)?
Here are the updated rendered notebooks:
https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots-tcga.nb.html
https://cansavvy.github.io/openpbta-notebook-concept/tmb-compare-tcga/compare-tmb-update.nb.html
For reference, this is what this comparison notebook looked like before using the updated data:
https://cansavvy.github.io/openpbta-notebook-concept/tmb-compare-tcga/compare-tmb.nb.html
What is your summary of the results?
The comparison doesn't look like what we'd expect, need to go over all methods to determine whether there is anything in the analysis that is amiss.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.