-
Notifications
You must be signed in to change notification settings - Fork 83
Proposed Analysis: PCAWG WGS Brain samples to run through SNV caller pipeline #551
Comments
@cansavvy @jharenza, the requested query contains data hosted on both EGA and PDC, for now, we only have access to PDC which is 110 BAMs from 60 donors (do we know why the total number is not 120 btw?). We can start downloading and looking at those first. And we probably need someone to submit the EGA request. Do we know what's those subjects age at diagnosis/sequencing, we might want to exclude their pediatric samples for the adult TMB calculation. Also, on the other hand, we had previously downloaded and processed |
@yuankunzhu do you have the breakdown of cancer types for the 110 BAMs from 60 donors? @cansavvy @jaclyn-taroni @cgreene - I can make a request for this data, but I am currently held up with our contracts office in approving an ICGC DACO for another project and don't have a clear idea of how long this will take. Looks like no one at CHOP has ICGC access and the office told me they wanted to make some agreement modifications, so in the meantime, should we just plan to use Mutect2/Strelka2 for these comparisons, using the TCGA data we have access to, and/or add more samples from TCGA if we do not have a good cohort of brain from PCAWG? |
@jharenza I can't find the detailed cancer types for those samples. the only thing i can find from the query are their originated projects. looks like they have TCGA-LGG and GBM there? |
As an update on this, I am still working with CHOP legal to get this access request documentation approved before I can go back to ICGC to submit the final application. I should know more Thursday. @yuankunzhu - did you mention that we lost data access to these files? |
@jharenza, we still have those data in the bucket, just need the DevOpt team to renew our s3 access credentials, so that we can access them on cavatica |
@stefankies can you work with allison on this^^ |
@yuankunzhu - were you able to process any of this data? In the meantime, I CHOP legal was working on this agreement as of 5/19. Just sent a followup. |
As an update, CHOP has approved this agreement and it was sent to ICGC on July 3 for final approval. They will respond within 15 business days. |
closing, as we still have not gotten access to these data |
What are the scientific goals of the analysis?
Following Grobner et al, 2019 we want to compare tumor mutation burden in our pediatric cohort with adult brain tumors.
This is a continuation of the goals of #257 and #481 that was originally to be used with TCGA data. However, upon running the TCGA data through the pipelines, we have encountered problems we believe may be due to its dated WXS target regions, or short reads, or shallower read depth. This data is documented in these two draft PRs: #548 and #521
Here's a summary report:
TCGAvsPBTAconsensus.pdf
What methods do you plan to use to accomplish the scientific goals?
After our video chat meeting, we discussed switching the comparison adult brain tumor data to the recently published PCAWG data.
This data has WGS samples, and is much more recent, which we hope will minimize the liftover and target region comparison issues we've been having between PBTA and TCGA data.
What input data are required for this analysis?
I'm posting this TSV file with the list of files that I believe we will want for this analysis:
pcawg_brain_wgs_samples.tsv.zip
I believe we would want the bam files listed in this file to be ran through Lancet, Strelka2, and Mutect2 in the same manner that the PBTA data was.
How I obtained this file list:
This data is on ICGC's repositories
I searched for all WGS, PCAWG study, brain samples that have BAM files for both blood and solid primary tumor
SQL Query to get this:
This link will also get you to this list: https://icgc.org/4ov
I exported this table as TSV and then removed the
mini
bam files.These
mini
files appear to be file copies of the regular size bams.I filtered those out with:
How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?
Whoever is going to be running the samples through the caller should probably answer this question.
Who will complete the analysis (please add a GitHub handle here if relevant)?
??
What relevant scientific literature relates to this analysis?
Grobner et al, 2019
PCAWG 2020 paper
The text was updated successfully, but these errors were encountered: