-
Notifications
You must be signed in to change notification settings - Fork 83
Conversation
@yuankunzhu @migbro A quick side question that came up in the meeting, we wanted to know if PBTA WXS data looked as odd as the TCGA WXS data did. So I isolated the WXS samples only and re-ran the upsetter plot: Although Lancet does not have as many calls it's making by itself, the VAF for the PBTA WXS samples is lower than it was for the WGS samples. I'd say it doesn't necessarily as bad as it did for the TCGA WXS samples, but its not great. |
There were a couple hyper mutated samples that contributed to almost ~24K variants out of the 40K variants in the TCGA dataset. We made some violin plots but even after removing those two samples, lancet did not show a lot of improvement |
I recall @migbro and/or @yuankunzhu mentioning that there were different recommended settings for lancet on WXS samples. I honestly do not have great confidence that would make enough of a difference to justify including it in the consensus for the TCGA/PBTA comparison. One question to @cansavvy is whether there are enough matched samples where we have both WXS and WGS to look at the differences within the same specimens for the various callers? (I hope that we are able to take advantage of the PCAWG data as discussed in #551, as that will be more compatible with the majority of our samples!) |
@jashapiro As the data are in v14 according to the metadata file, pbta-histologies.tsv, I don't see that there are biospecimens with both WXS and WGS data. Do paired WGS and WXS for a single biospecimen exist some place? |
So one thing I will mention, in general, I think when doing vaf plots and calculating TMBs, the regions being used, if they aren't already, really ought to be those in the UN-padded bed file. I normally run WXS with some padding to try and catch some splice region variants, as I feel these capture assays will leak a bit into the intron, but if you want to feel more confident about the distribution of calls, I'd limit that to only the regions that the capture assay was designed to capture. Anything beyond those regions, especially when you're like, 15bp beyond say is probably something that would need a better justification to look at. Does that make sense? As a disclaimer, I acknowledge that the effective padding is way more that "a little," but I think the point stands. |
@cansavvy - biospecimen IDs seem to be essentially library identifiers, try looking at other identifiers (e.g., |
I looked under |
TMB's are calculated with coding sequences only, so this won't be relevant for that, but to test your idea about the padding, I've added a notebook and pasted the main graph below. It looks pretty conclusive that Lancet does just fine for WGS but does not do well for WXS. |
@cansavvy If there are some with the same participant id, it is still probably a reasonable comparison. Not as good, but there is no particular reason to expect bias that I can think of. I wouldn’t necessarily expect the same exact mutations, but similar distributions. |
@jashapiro , I checked for this too, didn't mention it, but there are no WXS / WGS participant ID matches either (Using |
Here's the results of the within participant WGS and WXS comparison. It appears to show again that Lancet is not working well on WXS data. |
@cansavvy can you point me to the notebook you mentioned below. Try to figure out what BED/intervals had been applied. Was also looking at the data release folder, I didn't find the TCGA bed files there neither. also cc @jharenza @tkoganti.
I think what @migbro was trying to hint is that we should really use the mutations within the TCGA WXS capture regions, bc we won't catch any mutation outside of it, and I was wondering if we were using the entire coding region instead before, but maybe we were and i missed it? Also FYI, just simply count all the "coding changes" without any filtering, the TMB (although it's just counts here, but can be proportional relevant) between PBTA-WGS vs TCGA-WXS seems to align the pan cancer paper results better. i hope that would be more accurate if we only focused on the TCGA region. |
and re the LANCET called low VAF enrichment, I'm also wondering if those are artifacts caused by some old the lib prep. The GDC found a different but to me a similar issue with Mutect2 for samples based on whole-genome amplification (repli-g) WXS libs. as it tends to produce more soft-clip alignment that causes confusions for the caller to output more artificial indels. The TCGA-GBM data is def one of those lib-based project. So from that perspective, I wont be too surprise that when the denovo assembly algs come in, the softclipped bases get assembled into weird calls, but we need to take a closer look and investigate more on this. https://gdc.cancer.gov/content/mutect2-insertion-artifacts FYI |
Can you share the code you used to do this? It would be useful to know if and where we might be differing on what is counted as "coding". Thanks! |
The notebook itself is included in this PR as but the rendered version of the notebook can be seen here: https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/lancet-padded-vs-unpadded.nb.html I used the BED intervals provided for WXS for Lancet in the data release |
Looks like that link to the rendered version should be: https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/lancet-padded-vs-unpadded.nb.html |
Yeah, it was just simply a zgrep handle of searching 'Mutation|Splice|Silent', i can put it in a pr if needed, but it was something like (pseudocode) s3=s3://kf-openaccess-us-east-1-prd-pbta/data/release-v14-20200203/
zcat $s3/pbat-*-maf.gz | sed 1,2d | cut -f9,16 | egrep 'Mutation|Splice|Silent' |
Thanks @cansavvy, that's what i thought, that was padded on top of the PNOC WXS capture regions. |
@yuankunzhu Does this mean I should be doing something differently in this analysis? |
@yuankunzhu , can you tell me a bit more about how you picked these mutation categories? This would exclude most Indels if we are understanding you correctly. Since they are labeled as
|
yeah i was focusing on SNVs, so skipped InDels for this, just for quick check. |
@yuankunzhu I think this is asking a slightly different question then. Although part of the problem with TMB is the lack of standard definition. That being said, according to the paper on the issue just posted #556 they include indels:
This all being said, this still doesn't bypass our issues we are seeing with Lancet's processing of WXS data. |
Hi @cansavvy, yes you are right, what I posted up there is different from the one you had in your slides where you did the TMB with all coding mutations (SNVs and InDels). Sorry that if I didn't make myself clear, but the idea is not to try to re-do what you have done, but kind of poke around the data and look at different aspect of it and try to figure out why/how does this issues happen. It could be that old WES lib generates more artifact InDels, so looking at SNVs the TMB aligns with other pan-cancer study results paper. It could be that we should apply similar filtering as @jharenza mentioned in #556. so me and @tkoganti are planning to apply those filtrations to see if that helps the TMB distribution. |
By old WES lib, are you referring what regions TCGA's WXS methods were designed to target? Is so the best way to bypass this, this is using more updated data (or WGS) like PCAWG (like #551 discusses). But if PCAWG is not available, what alternatives are you thinking of with these analyses you are doing?
What filtrations are you referring to here? |
My read of this comment #548 (comment) is that the germline filtering discussed in #556 are the next steps. I suggest that we move any detailed discussion of next steps over to #556 and close this pull request. @cansavvy is splitting this particular pull request up (#557, #558 + upcoming PRs) for review. |
Purpose/implementation Section
What scientific question is your analysis addressing?
Investigating issues with TCGA vs PBTA comparison.
What was your approach?
This code evaluates:
What GitHub issue does your pull request address?
#521
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
None yet. Not ready for review.
Results
Reproducibility Checklist
These things aren't done because this is just for reference for now.
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.