TCGA vs PBTA exploratory analysis #548

cansavvy · 2020-02-19T21:26:37Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Investigating issues with TCGA vs PBTA comparison.

What was your approach?

This code evaluates:

SNV caller's reported sequencing depth
TMB' s calculated by caller
Overlap between the WXS target regions of PBTA and TCGA

What GitHub issue does your pull request address?

#521

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

None yet. Not ready for review.

Results

Reproducibility Checklist

These things aren't done because this is just for reference for now.

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

…plor

…-consensus-explor

cansavvy · 2020-02-21T15:36:16Z

@yuankunzhu @migbro A quick side question that came up in the meeting, we wanted to know if PBTA WXS data looked as odd as the TCGA WXS data did. So I isolated the WXS samples only and re-ran the upsetter plot:

PBTA WXS samples

Although Lancet does not have as many calls it's making by itself, the VAF for the PBTA WXS samples is lower than it was for the WGS samples. I'd say it doesn't necessarily as bad as it did for the TCGA WXS samples, but its not great.

tkoganti · 2020-02-21T16:16:23Z

There were a couple hyper mutated samples that contributed to almost ~24K variants out of the 40K variants in the TCGA dataset. We made some violin plots but even after removing those two samples, lancet did not show a lot of improvement
violinplots_lancet_mutectandstrelka.pptx

jashapiro · 2020-02-21T16:30:22Z

I recall @migbro and/or @yuankunzhu mentioning that there were different recommended settings for lancet on WXS samples. I honestly do not have great confidence that would make enough of a difference to justify including it in the consensus for the TCGA/PBTA comparison.

One question to @cansavvy is whether there are enough matched samples where we have both WXS and WGS to look at the differences within the same specimens for the various callers?

(I hope that we are able to take advantage of the PCAWG data as discussed in #551, as that will be more compatible with the majority of our samples!)

cansavvy · 2020-02-21T16:43:03Z

One question to @cansavvy is whether there are enough matched samples where we have both WXS and WGS to look at the differences within the same specimens for the various callers?

@jashapiro As the data are in v14 according to the metadata file, pbta-histologies.tsv, I don't see that there are biospecimens with both WXS and WGS data. Do paired WGS and WXS for a single biospecimen exist some place?

migbro · 2020-02-21T16:43:49Z

So one thing I will mention, in general, I think when doing vaf plots and calculating TMBs, the regions being used, if they aren't already, really ought to be those in the UN-padded bed file. I normally run WXS with some padding to try and catch some splice region variants, as I feel these capture assays will leak a bit into the intron, but if you want to feel more confident about the distribution of calls, I'd limit that to only the regions that the capture assay was designed to capture. Anything beyond those regions, especially when you're like, 15bp beyond say is probably something that would need a better justification to look at. Does that make sense? As a disclaimer, I acknowledge that the effective padding is way more that "a little," but I think the point stands.

jaclyn-taroni · 2020-02-21T16:46:54Z

I don't see that there are biospecimens with both WXS and WGS data. Do paired WGS and WXS for a single biospecimen exist some place?

@cansavvy - biospecimen IDs seem to be essentially library identifiers, try looking at other identifiers (e.g., sample_id).

cansavvy · 2020-02-21T17:00:10Z

@cansavvy - biospecimen IDs seem to be essentially library identifiers, try looking at other identifiers (e.g., sample_id).

I looked under sample_id, I do not see any samples that have both WXS and WGS data in this data release.

cansavvy · 2020-02-21T17:17:24Z

@migbro

I think when doing vaf plots and calculating TMBs, the regions being used, if they aren't already, really ought to be those in the UN-padded bed file. I normally run WXS with some padding to try and catch some splice region variants, as I feel these capture assays will leak a bit into the intron, but if you want to feel more confident about the distribution of calls, I'd limit that to only the regions that the capture assay was designed to capture.

TMB's are calculated with coding sequences only, so this won't be relevant for that, but to test your idea about the padding, I've added a notebook and pasted the main graph below. It looks pretty conclusive that Lancet does just fine for WGS but does not do well for WXS.

jashapiro · 2020-02-21T17:18:50Z

@cansavvy If there are some with the same participant id, it is still probably a reasonable comparison. Not as good, but there is no particular reason to expect bias that I can think of. I wouldn’t necessarily expect the same exact mutations, but similar distributions.

cansavvy · 2020-02-21T17:20:03Z

@cansavvy If there are some with the same participant id, it is still probably a reasonable comparison. Not as good, but there is no particular reason to expect bias that I can think of. I wouldn’t necessarily expect the same exact mutations, but similar distributions.

@jashapiro , I checked for this too, didn't mention it, but there are no WXS / WGS participant ID matches either (Using Kids_First_Participant_ID).

…-consensus-explor

cansavvy · 2020-02-24T14:28:12Z

@cansavvy If there are some with the same participant id, it is still probably a reasonable comparison. Not as good, but there is no particular reason to expect bias that I can think of. I wouldn’t necessarily expect the same exact mutations, but similar distributions.

Here's the results of the within participant WGS and WXS comparison. It appears to show again that Lancet is not working well on WXS data.
By participant:

Lumped all together

…-consensus-explor

yuankunzhu · 2020-02-25T11:07:54Z

@cansavvy can you point me to the notebook you mentioned below. Try to figure out what BED/intervals had been applied. Was also looking at the data release folder, I didn't find the TCGA bed files there neither. also cc @jharenza @tkoganti.

@migbro

I think when doing vaf plots and calculating TMBs, the regions being used, if they aren't already, really ought to be those in the UN-padded bed file. I normally run WXS with some padding to try and catch some splice region variants, as I feel these capture assays will leak a bit into the intron, but if you want to feel more confident about the distribution of calls, I'd limit that to only the regions that the capture assay was designed to capture.

TMB's are calculated with coding sequences only, so this won't be relevant for that, but to test your idea about the padding, I've added a notebook and pasted the main graph below

I think what @migbro was trying to hint is that we should really use the mutations within the TCGA WXS capture regions, bc we won't catch any mutation outside of it, and I was wondering if we were using the entire coding region instead before, but maybe we were and i missed it?

Also FYI, just simply count all the "coding changes" without any filtering, the TMB (although it's just counts here, but can be proportional relevant) between PBTA-WGS vs TCGA-WXS seems to align the pan cancer paper results better. i hope that would be more accurate if we only focused on the TCGA region.

yuankunzhu · 2020-02-25T11:15:29Z

and re the LANCET called low VAF enrichment, I'm also wondering if those are artifacts caused by some old the lib prep. The GDC found a different but to me a similar issue with Mutect2 for samples based on whole-genome amplification (repli-g) WXS libs. as it tends to produce more soft-clip alignment that causes confusions for the caller to output more artificial indels. The TCGA-GBM data is def one of those lib-based project. So from that perspective, I wont be too surprise that when the denovo assembly algs come in, the softclipped bases get assembled into weird calls, but we need to take a closer look and investigate more on this.

https://gdc.cancer.gov/content/mutect2-insertion-artifacts FYI

jashapiro · 2020-02-25T12:31:21Z

@yuankunzhu

Also FYI, just simply count all the "coding changes" without any filtering,

Can you share the code you used to do this? It would be useful to know if and where we might be differing on what is counted as "coding". Thanks!

cansavvy · 2020-02-25T13:26:00Z

@cansavvy can you point me to the notebook you mentioned below. Try to figure out what BED/intervals had been applied.

The notebook itself is included in this PR as but the rendered version of the notebook can be seen here: https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/lancet-padded-vs-unpadded.nb.html

I used the BED intervals provided for WXS for Lancet in the data release WXS.hg38.lancet.400bp_padded.bed.

jaclyn-taroni · 2020-02-25T13:38:31Z

Looks like that link to the rendered version should be: https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/lancet-padded-vs-unpadded.nb.html

yuankunzhu · 2020-02-25T13:44:52Z

@jashapiro

Can you share the code you used to do this? It would be useful to know if and where we might be differing on what is counted as "coding". Thanks!

Yeah, it was just simply a zgrep handle of searching 'Mutation|Splice|Silent', i can put it in a pr if needed, but it was something like (pseudocode)

s3=s3://kf-openaccess-us-east-1-prd-pbta/data/release-v14-20200203/
zcat $s3/pbat-*-maf.gz | sed 1,2d | cut -f9,16 | egrep 'Mutation|Splice|Silent'

yuankunzhu · 2020-02-25T13:48:47Z

I used the BED intervals provided for WXS for Lancet in the data release WXS.hg38.lancet.400bp_padded.bed.

Thanks @cansavvy, that's what i thought, that was padded on top of the PNOC WXS capture regions.

cansavvy · 2020-02-25T14:11:33Z

Thanks @cansavvy, that's what i thought, that was padded on top of the PNOC WXS capture regions.

@yuankunzhu Does this mean I should be doing something differently in this analysis?

cansavvy · 2020-02-25T14:26:43Z

@yuankunzhu , can you tell me a bit more about how you picked these mutation categories? This would exclude most Indels if we are understanding you correctly. Since they are labeled as Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins.

zcat $s3/pbat-*-maf.gz | sed 1,2d | cut -f9,16 | egrep 'Mutation|Splice|Silent'

yuankunzhu · 2020-02-25T14:28:25Z

@yuankunzhu , can you tell me a bit more about how you picked these mutation categories? This would exclude most Indels if we are understanding you correctly. Since they are labeled as Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins.
zcat $s3/pbat-*-maf.gz | sed 1,2d | cut -f9,16 | egrep 'Mutation|Splice|Silent' 

yeah i was focusing on SNVs, so skipped InDels for this, just for quick check.

cansavvy · 2020-02-25T15:17:38Z

yeah i was focusing on SNVs, so skipped InDels for this, just for quick check.

@yuankunzhu I think this is asking a slightly different question then. Although part of the problem with TMB is the lack of standard definition. That being said, according to the paper on the issue just posted #556 they include indels:

TMB was defined as the number of somatic, coding, base substitution, and indel mutations per megabase of genome examined. All base substitutions and indels in the coding region of targeted genes, including synonymous alterations, are initially counted before filtering as described below.

This all being said, this still doesn't bypass our issues we are seeing with Lancet's processing of WXS data.

yuankunzhu · 2020-02-26T18:09:42Z

I think this is asking a slightly different question then. Although part of the problem with TMB is the lack of standard definition. That being said, according to the paper on the issue just posted #556 they include indels:

Hi @cansavvy, yes you are right, what I posted up there is different from the one you had in your slides where you did the TMB with all coding mutations (SNVs and InDels).

Sorry that if I didn't make myself clear, but the idea is not to try to re-do what you have done, but kind of poke around the data and look at different aspect of it and try to figure out why/how does this issues happen.

It could be that old WES lib generates more artifact InDels, so looking at SNVs the TMB aligns with other pan-cancer study results paper. It could be that we should apply similar filtering as @jharenza mentioned in #556.

so me and @tkoganti are planning to apply those filtrations to see if that helps the TMB distribution.

cansavvy · 2020-02-26T18:31:33Z

It could be that old WES lib generates more artifact InDels, so looking at SNVs the TMB aligns with other pan-cancer study results paper. It could be that we should apply similar filtering as @jharenza mentioned in #556.

By old WES lib, are you referring what regions TCGA's WXS methods were designed to target? Is so the best way to bypass this, this is using more updated data (or WGS) like PCAWG (like #551 discusses). But if PCAWG is not available, what alternatives are you thinking of with these analyses you are doing?

so me and @tkoganti are planning to apply those filtrations to see if that helps the TMB distribution.

What filtrations are you referring to here?

jaclyn-taroni · 2020-02-26T18:46:31Z

My read of this comment #548 (comment) is that the germline filtering discussed in #556 are the next steps. I suggest that we move any detailed discussion of next steps over to #556 and close this pull request. @cansavvy is splitting this particular pull request up (#557, #558 + upcoming PRs) for review.

cansavvy added 13 commits February 12, 2020 09:12

Push a notebook of scratch work

9704f4b

Push n_depth exploration

0b897ba

Comparing WXS to WXS

c28fd79

Tried some things; added more notes

c3b0174

Merge remote-tracking branch 'upstream/master' into tcga-consensus-ex…

d7931d9

…plor

Re-arrange and functionalize some things

404cd19

Add in sina plotting function

5eeeb74

functionalizing things

f2c7c7c

notebooks

ebbab79

Add the VennDiagrams code and etc.

02b1095

Add reduce steps and the sample sizes to the plots

212d72e

Add plots and some more notes

58eb929

tidy up some more things. Cut out some unused functions

ec73fc7

cansavvy added the work in progress Used to label (non-draft) pull requests that are not yet ready for review label Feb 19, 2020

cansavvy added 4 commits February 19, 2020 16:26

Merge branch 'master' into tcga-consensus-explor

6eafdad

Update rendered notebook

959018f

Merge remote-tracking branch 'origin/tcga-consensus-explor' into tcga…

919f08d

…-consensus-explor

Start trying to organize the functions better

d7a7443

cansavvy mentioned this pull request Feb 20, 2020

Proposed Analysis: PCAWG WGS Brain samples to run through SNV caller pipeline #551

Closed

Push exploration notebooks and their rendered versions

2c0aadf

cansavvy added 3 commits February 24, 2020 09:07

Lancet padded vs not notebook

51c41b6

Merge remote-tracking branch 'origin/tcga-consensus-explor' into tcga…

6cc2d13

…-consensus-explor

Add the WXS vs WGS notebook

17f1469

cansavvy added 4 commits February 24, 2020 09:31

Merge branch 'master' into tcga-consensus-explor

bdd7c2f

Add documentation

721dc11

Merge remote-tracking branch 'origin/tcga-consensus-explor' into tcga…

355354c

…-consensus-explor

Run all notebooks and add them

a20f440

reoorganize notebooks

1ff0dfb

jharenza mentioned this pull request Feb 25, 2020

Updated analysis: PBTA vs TCGA TMB analysis #556

Closed

This was referenced Feb 25, 2020

Lancet exploration notebook series: Part 1 TCGA-PBTA comparison variations #557

Merged

Lancet exploration notebook series: Part 2 Lancet within participant WXS-WGS test #558

Merged

jaclyn-taroni closed this Feb 26, 2020

cansavvy mentioned this pull request Feb 26, 2020

Lancet exploration notebook series: Part 3 Lancet padding vs unpadding #561

Merged

5 tasks

cansavvy deleted the tcga-consensus-explor branch February 28, 2020 21:01

yuankunzhu mentioned this pull request Mar 4, 2020

Add Analyses for TCGA brain tumor WXS capture kit investigation #597

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCGA vs PBTA exploratory analysis #548

TCGA vs PBTA exploratory analysis #548

cansavvy commented Feb 19, 2020

cansavvy commented Feb 21, 2020 •

edited

Loading

tkoganti commented Feb 21, 2020

jashapiro commented Feb 21, 2020

cansavvy commented Feb 21, 2020

migbro commented Feb 21, 2020

jaclyn-taroni commented Feb 21, 2020

cansavvy commented Feb 21, 2020

cansavvy commented Feb 21, 2020 •

edited

Loading

jashapiro commented Feb 21, 2020

cansavvy commented Feb 21, 2020

cansavvy commented Feb 24, 2020

yuankunzhu commented Feb 25, 2020 •

edited

Loading

yuankunzhu commented Feb 25, 2020 •

edited

Loading

jashapiro commented Feb 25, 2020

cansavvy commented Feb 25, 2020 •

edited

Loading

jaclyn-taroni commented Feb 25, 2020

yuankunzhu commented Feb 25, 2020 •

edited

Loading

yuankunzhu commented Feb 25, 2020

cansavvy commented Feb 25, 2020

cansavvy commented Feb 25, 2020

yuankunzhu commented Feb 25, 2020

cansavvy commented Feb 25, 2020

yuankunzhu commented Feb 26, 2020

cansavvy commented Feb 26, 2020

jaclyn-taroni commented Feb 26, 2020 •

edited

Loading

TCGA vs PBTA exploratory analysis #548

TCGA vs PBTA exploratory analysis #548

Conversation

cansavvy commented Feb 19, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Results

Reproducibility Checklist

Documentation Checklist

cansavvy commented Feb 21, 2020 • edited Loading

tkoganti commented Feb 21, 2020

jashapiro commented Feb 21, 2020

cansavvy commented Feb 21, 2020

migbro commented Feb 21, 2020

jaclyn-taroni commented Feb 21, 2020

cansavvy commented Feb 21, 2020

cansavvy commented Feb 21, 2020 • edited Loading

jashapiro commented Feb 21, 2020

cansavvy commented Feb 21, 2020

cansavvy commented Feb 24, 2020

yuankunzhu commented Feb 25, 2020 • edited Loading

yuankunzhu commented Feb 25, 2020 • edited Loading

jashapiro commented Feb 25, 2020

cansavvy commented Feb 25, 2020 • edited Loading

jaclyn-taroni commented Feb 25, 2020

yuankunzhu commented Feb 25, 2020 • edited Loading

yuankunzhu commented Feb 25, 2020

cansavvy commented Feb 25, 2020

cansavvy commented Feb 25, 2020

yuankunzhu commented Feb 25, 2020

cansavvy commented Feb 25, 2020

yuankunzhu commented Feb 26, 2020

cansavvy commented Feb 26, 2020

jaclyn-taroni commented Feb 26, 2020 • edited Loading

cansavvy commented Feb 21, 2020 •

edited

Loading

cansavvy commented Feb 21, 2020 •

edited

Loading

yuankunzhu commented Feb 25, 2020 •

edited

Loading

yuankunzhu commented Feb 25, 2020 •

edited

Loading

cansavvy commented Feb 25, 2020 •

edited

Loading

yuankunzhu commented Feb 25, 2020 •

edited

Loading

jaclyn-taroni commented Feb 26, 2020 •

edited

Loading