Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

TCGA vs PBTA exploratory analysis #548

Closed
wants to merge 31 commits into from

Conversation

cansavvy
Copy link
Collaborator

Purpose/implementation Section

What scientific question is your analysis addressing?

Investigating issues with TCGA vs PBTA comparison.

What was your approach?

This code evaluates:

  • SNV caller's reported sequencing depth
  • TMB' s calculated by caller
  • Overlap between the WXS target regions of PBTA and TCGA

What GitHub issue does your pull request address?

#521

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

None yet. Not ready for review.

Results

Reproducibility Checklist

These things aren't done because this is just for reference for now.

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@cansavvy cansavvy added the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Feb 19, 2020
@cansavvy
Copy link
Collaborator Author

cansavvy commented Feb 21, 2020

@yuankunzhu @migbro A quick side question that came up in the meeting, we wanted to know if PBTA WXS data looked as odd as the TCGA WXS data did. So I isolated the WXS samples only and re-ran the upsetter plot:

PBTA WXS samples
pbta-wxs-samples-upset-novardict

Screen Shot 2020-02-21 at 10 35 03 AM

Although Lancet does not have as many calls it's making by itself, the VAF for the PBTA WXS samples is lower than it was for the WGS samples. I'd say it doesn't necessarily as bad as it did for the TCGA WXS samples, but its not great.

@tkoganti
Copy link
Collaborator

There were a couple hyper mutated samples that contributed to almost ~24K variants out of the 40K variants in the TCGA dataset. We made some violin plots but even after removing those two samples, lancet did not show a lot of improvement
violinplots_lancet_mutectandstrelka.pptx

@jashapiro
Copy link
Member

I recall @migbro and/or @yuankunzhu mentioning that there were different recommended settings for lancet on WXS samples. I honestly do not have great confidence that would make enough of a difference to justify including it in the consensus for the TCGA/PBTA comparison.

One question to @cansavvy is whether there are enough matched samples where we have both WXS and WGS to look at the differences within the same specimens for the various callers?

(I hope that we are able to take advantage of the PCAWG data as discussed in #551, as that will be more compatible with the majority of our samples!)

@cansavvy
Copy link
Collaborator Author

One question to @cansavvy is whether there are enough matched samples where we have both WXS and WGS to look at the differences within the same specimens for the various callers?

@jashapiro As the data are in v14 according to the metadata file, pbta-histologies.tsv, I don't see that there are biospecimens with both WXS and WGS data. Do paired WGS and WXS for a single biospecimen exist some place?

@migbro
Copy link
Contributor

migbro commented Feb 21, 2020

So one thing I will mention, in general, I think when doing vaf plots and calculating TMBs, the regions being used, if they aren't already, really ought to be those in the UN-padded bed file. I normally run WXS with some padding to try and catch some splice region variants, as I feel these capture assays will leak a bit into the intron, but if you want to feel more confident about the distribution of calls, I'd limit that to only the regions that the capture assay was designed to capture. Anything beyond those regions, especially when you're like, 15bp beyond say is probably something that would need a better justification to look at. Does that make sense? As a disclaimer, I acknowledge that the effective padding is way more that "a little," but I think the point stands.

@jaclyn-taroni
Copy link
Member

I don't see that there are biospecimens with both WXS and WGS data. Do paired WGS and WXS for a single biospecimen exist some place?

@cansavvy - biospecimen IDs seem to be essentially library identifiers, try looking at other identifiers (e.g., sample_id).

@cansavvy
Copy link
Collaborator Author

@cansavvy - biospecimen IDs seem to be essentially library identifiers, try looking at other identifiers (e.g., sample_id).

I looked under sample_id, I do not see any samples that have both WXS and WGS data in this data release.

@cansavvy
Copy link
Collaborator Author

cansavvy commented Feb 21, 2020

@migbro

I think when doing vaf plots and calculating TMBs, the regions being used, if they aren't already, really ought to be those in the UN-padded bed file. I normally run WXS with some padding to try and catch some splice region variants, as I feel these capture assays will leak a bit into the intron, but if you want to feel more confident about the distribution of calls, I'd limit that to only the regions that the capture assay was designed to capture.

TMB's are calculated with coding sequences only, so this won't be relevant for that, but to test your idea about the padding, I've added a notebook and pasted the main graph below. It looks pretty conclusive that Lancet does just fine for WGS but does not do well for WXS.

Screen Shot 2020-02-21 at 12 19 13 PM

@jashapiro
Copy link
Member

@cansavvy If there are some with the same participant id, it is still probably a reasonable comparison. Not as good, but there is no particular reason to expect bias that I can think of. I wouldn’t necessarily expect the same exact mutations, but similar distributions.

@cansavvy
Copy link
Collaborator Author

@cansavvy If there are some with the same participant id, it is still probably a reasonable comparison. Not as good, but there is no particular reason to expect bias that I can think of. I wouldn’t necessarily expect the same exact mutations, but similar distributions.

@jashapiro , I checked for this too, didn't mention it, but there are no WXS / WGS participant ID matches either (Using Kids_First_Participant_ID).

@cansavvy
Copy link
Collaborator Author

@cansavvy If there are some with the same participant id, it is still probably a reasonable comparison. Not as good, but there is no particular reason to expect bias that I can think of. I wouldn’t necessarily expect the same exact mutations, but similar distributions.

Here's the results of the within participant WGS and WXS comparison. It appears to show again that Lancet is not working well on WXS data.
By participant:
Screen Shot 2020-02-24 at 9 22 43 AM
Lumped all together
Screen Shot 2020-02-24 at 9 25 06 AM

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Feb 25, 2020

@cansavvy can you point me to the notebook you mentioned below. Try to figure out what BED/intervals had been applied. Was also looking at the data release folder, I didn't find the TCGA bed files there neither. also cc @jharenza @tkoganti.

@migbro

I think when doing vaf plots and calculating TMBs, the regions being used, if they aren't already, really ought to be those in the UN-padded bed file. I normally run WXS with some padding to try and catch some splice region variants, as I feel these capture assays will leak a bit into the intron, but if you want to feel more confident about the distribution of calls, I'd limit that to only the regions that the capture assay was designed to capture.

TMB's are calculated with coding sequences only, so this won't be relevant for that, but to test your idea about the padding, I've added a notebook and pasted the main graph below

I think what @migbro was trying to hint is that we should really use the mutations within the TCGA WXS capture regions, bc we won't catch any mutation outside of it, and I was wondering if we were using the entire coding region instead before, but maybe we were and i missed it?

Also FYI, just simply count all the "coding changes" without any filtering, the TMB (although it's just counts here, but can be proportional relevant) between PBTA-WGS vs TCGA-WXS seems to align the pan cancer paper results better. i hope that would be more accurate if we only focused on the TCGA region.

Screen Shot 2020-02-25 at 6 02 28 AM

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Feb 25, 2020

and re the LANCET called low VAF enrichment, I'm also wondering if those are artifacts caused by some old the lib prep. The GDC found a different but to me a similar issue with Mutect2 for samples based on whole-genome amplification (repli-g) WXS libs. as it tends to produce more soft-clip alignment that causes confusions for the caller to output more artificial indels. The TCGA-GBM data is def one of those lib-based project. So from that perspective, I wont be too surprise that when the denovo assembly algs come in, the softclipped bases get assembled into weird calls, but we need to take a closer look and investigate more on this.

https://gdc.cancer.gov/content/mutect2-insertion-artifacts FYI

@jashapiro
Copy link
Member

@yuankunzhu

Also FYI, just simply count all the "coding changes" without any filtering,

Can you share the code you used to do this? It would be useful to know if and where we might be differing on what is counted as "coding". Thanks!

@cansavvy
Copy link
Collaborator Author

cansavvy commented Feb 25, 2020

@cansavvy can you point me to the notebook you mentioned below. Try to figure out what BED/intervals had been applied.

The notebook itself is included in this PR as but the rendered version of the notebook can be seen here: https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/lancet-padded-vs-unpadded.nb.html

I used the BED intervals provided for WXS for Lancet in the data release WXS.hg38.lancet.400bp_padded.bed.

@jaclyn-taroni
Copy link
Member

Looks like that link to the rendered version should be: https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/lancet-padded-vs-unpadded.nb.html

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Feb 25, 2020

@jashapiro

Can you share the code you used to do this? It would be useful to know if and where we might be differing on what is counted as "coding". Thanks!

Yeah, it was just simply a zgrep handle of searching 'Mutation|Splice|Silent', i can put it in a pr if needed, but it was something like (pseudocode)

s3=s3://kf-openaccess-us-east-1-prd-pbta/data/release-v14-20200203/
zcat $s3/pbat-*-maf.gz | sed 1,2d | cut -f9,16 | egrep 'Mutation|Splice|Silent' 

@yuankunzhu
Copy link
Collaborator

I used the BED intervals provided for WXS for Lancet in the data release WXS.hg38.lancet.400bp_padded.bed.

Thanks @cansavvy, that's what i thought, that was padded on top of the PNOC WXS capture regions.

@cansavvy
Copy link
Collaborator Author

Thanks @cansavvy, that's what i thought, that was padded on top of the PNOC WXS capture regions.

@yuankunzhu Does this mean I should be doing something differently in this analysis?

@cansavvy
Copy link
Collaborator Author

@yuankunzhu , can you tell me a bit more about how you picked these mutation categories? This would exclude most Indels if we are understanding you correctly. Since they are labeled as Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins.

zcat $s3/pbat-*-maf.gz | sed 1,2d | cut -f9,16 | egrep 'Mutation|Splice|Silent' 

@yuankunzhu
Copy link
Collaborator

@yuankunzhu , can you tell me a bit more about how you picked these mutation categories? This would exclude most Indels if we are understanding you correctly. Since they are labeled as Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins.

zcat $s3/pbat-*-maf.gz | sed 1,2d | cut -f9,16 | egrep 'Mutation|Splice|Silent' 

yeah i was focusing on SNVs, so skipped InDels for this, just for quick check.

@cansavvy
Copy link
Collaborator Author

yeah i was focusing on SNVs, so skipped InDels for this, just for quick check.

@yuankunzhu I think this is asking a slightly different question then. Although part of the problem with TMB is the lack of standard definition. That being said, according to the paper on the issue just posted #556 they include indels:

TMB was defined as the number of somatic, coding, base substitution, and indel mutations per megabase of genome examined. All base substitutions and indels in the coding region of targeted genes, including synonymous alterations, are initially counted before filtering as described below.

This all being said, this still doesn't bypass our issues we are seeing with Lancet's processing of WXS data.

@yuankunzhu
Copy link
Collaborator

I think this is asking a slightly different question then. Although part of the problem with TMB is the lack of standard definition. That being said, according to the paper on the issue just posted #556 they include indels:

Hi @cansavvy, yes you are right, what I posted up there is different from the one you had in your slides where you did the TMB with all coding mutations (SNVs and InDels).

Sorry that if I didn't make myself clear, but the idea is not to try to re-do what you have done, but kind of poke around the data and look at different aspect of it and try to figure out why/how does this issues happen.

It could be that old WES lib generates more artifact InDels, so looking at SNVs the TMB aligns with other pan-cancer study results paper. It could be that we should apply similar filtering as @jharenza mentioned in #556.

so me and @tkoganti are planning to apply those filtrations to see if that helps the TMB distribution.

@cansavvy
Copy link
Collaborator Author

It could be that old WES lib generates more artifact InDels, so looking at SNVs the TMB aligns with other pan-cancer study results paper. It could be that we should apply similar filtering as @jharenza mentioned in #556.

By old WES lib, are you referring what regions TCGA's WXS methods were designed to target? Is so the best way to bypass this, this is using more updated data (or WGS) like PCAWG (like #551 discusses). But if PCAWG is not available, what alternatives are you thinking of with these analyses you are doing?

so me and @tkoganti are planning to apply those filtrations to see if that helps the TMB distribution.

What filtrations are you referring to here?

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Feb 26, 2020

My read of this comment #548 (comment) is that the germline filtering discussed in #556 are the next steps. I suggest that we move any detailed discussion of next steps over to #556 and close this pull request. @cansavvy is splitting this particular pull request up (#557, #558 + upcoming PRs) for review.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
work in progress Used to label (non-draft) pull requests that are not yet ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants