Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Planned Data Release: V16 #601

Closed
5 tasks
jharenza opened this issue Mar 4, 2020 · 15 comments
Closed
5 tasks

Planned Data Release: V16 #601

jharenza opened this issue Mar 4, 2020 · 15 comments
Labels
data in progress Someone is working on this issue, but feel free to propose an alternative approach!

Comments

@jharenza
Copy link
Collaborator

jharenza commented Mar 4, 2020

What data file(s) does this issue pertain to?

pbta-histologies.tsv
pbta-tcga-manifest.tsv
pbta-tcga-snv-lancet.vep.maf.gz
pbta-tcga-snv-mutect2.vep.maf.gz
pbta-tcga-snv-strelka2.vep.maf.gz

What release are you using?

V15

Put your question or report your issue here.

Placeholder for V16 release to include D3b team: @baileyckelly @chris-s-friedman @yuankunzhu @allisonheath

@jaclyn-taroni
Copy link
Member

Because we made changes to the pbta-fusion-putative-oncogenic.tsv with v15, we've had to update the fusion-summary files that are included in a release (fusion_summary_embryonal_foi.tsv and fusion_summary_ependymoma_foi.tsv): https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/516201f91f174c495f6d7964f4a03d6c7c6fb227/analyses/fusion-summary/results

@allisonheath
Copy link
Collaborator

For the v16 release, @baileyckelly and @chris-s-friedman are moving the histology file generation over to a new database workflow to make the manual changes more trackable. We are working on replicating v15 first, expected to be done by the end of this week and then will start working on the new issues noted in the ticket early next. Will update if this changes.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Mar 11, 2020
@allisonheath
Copy link
Collaborator

allisonheath commented Mar 11, 2020

We'll also do #624.

@jaclyn-taroni
Copy link
Member

@tkoganti gave me a heads up that the updated TCGA files are available pre-v16 if folks would like to take a look.

Here's the link to the s3 bucket and folder https://s3.amazonaws.com/kf-openaccess-us-east-1-prd-pbta/data/TCGA_mar-12-2020/

And the file list:

README.md
intersected_whole_exome_agilent_designed_120_AND_tcga_6k_genes.Gh38.bed
intersected_whole_exome_agilent_plus_tcga_6k_AND_tcga_6k_genes.Gh38.bed
pbta-tcga-manifest_all319.txt
pbta-tcga-snv-lancet.maf.gz
pbta-tcga-snv-mutect2.maf.gz
pbta-tcga-snv-strelka2.maf.gz
tcga_6k_genes.targetIntervals.bed
whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.Gh38.bed
whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed
whole_exome_agilent_designed_120.targetIntervals.Gh38.bed
whole_exome_agilent_designed_120.targetIntervals.bed
whole_exome_agilent_plus_tcga_6k.targetIntervals.Gh38.bed

Where the README includes an explanation of what each of these files are.

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Mar 20, 2020

[note-to-self]

1. sync data from previous release

last_release='release-v15-20200228'
new_release='release-v16-20200320'
bucket='s3://kf-openaccess-us-east-1-prd-pbta/data'
aws s3 sync $bucket/$last_release/ $bucket/$new_release/

2. new data

2.1 TCGA MAF

get overlapped file names between v15 and TCGA_mar-12-2020/

## get all files from v15 and TCGA_mar-12-2020/
all_files=`aws s3 ls --recursive $bucket | awk '{print $4}' | egrep "$new_release|TCGA_mar-12-2020/"`
echo $all_files | xargs -i basename {} | sort | uniq -dc

it turns 0 output, no overlaps? check all TCGA MAF files:

echo $all_files | grep -i tcga | grep -i maf

returns below

data/TCGA_mar-12-2020/pbta-tcga-snv-lancet.maf.gz
data/TCGA_mar-12-2020/pbta-tcga-snv-mutect2.maf.gz
data/TCGA_mar-12-2020/pbta-tcga-snv-strelka2.maf.gz
data/release-v16-20200320/pbta-tcga-snv-lancet.vep.maf.gz
data/release-v16-20200320/pbta-tcga-snv-mutect2.vep.maf.gz
data/release-v16-20200320/pbta-tcga-snv-strelka2.vep.maf.gz

MAF files named differently in TCGA_mar-12-2020/ folder and release folder, rename/overwrite TCGA_mar-12-2020/ MAF with vep extension to keep consistence

for caller in 'lancet' 'mutect2' 'strelka2'
do
    aws s3 cp $bucket/TCGA_mar-12-2020/pbta-tcga-snv-$caller.maf.gz $bucket/$new_release/pbta-tcga-snv-$caller.vep.maf.gz
done

2.2 TCGA BED

echo $all_files | grep -i bed | xargs -i basename {} | sort | uniq -dc
## no overlaps, copy all bed to release folder
aws s3 sync $bucket/TCGA_mar-12-2020/ $bucket/$new_release --exclude "*" --include "*.bed"

## remove old tcga BED
aws s3 rm $bucket/$new_release/gencode.v19.basic.exome.hg38liftover.100bp_padded.bed

2.3 TCGA Manifest check

$ curl -s https://s3.amazonaws.com/kf-openaccess-us-east-1-prd-pbta/data/TCGA_mar-12-2020/pbta-tcga-manifest_all319.txt | head
Normal_BAM	Tumor_BAM	tumorID	Primary_diagnosis
C494.TCGA-S9-A6U5-10A-01D-A33W-08.1_gdc_realn.bam	C494.TCGA-S9-A6U5-01A-12D-A33T-08.1_gdc_realn.bam	TCGA-S9-A6U5	Astrocytoma-NOS
C494.TCGA-TM-A7C4-10A-01D-A329-08.1_gdc_realn.bam	C494.TCGA-TM-A7C4-01A-11D-A32B-08.1_gdc_realn.bam	TCGA-TM-A7C4	Astrocytoma-NOS
C494.TCGA-HW-7490-10A-01D-2024-08.1_gdc_realn.bam	C494.TCGA-HW-7490-01A-11D-2024-08.1_gdc_realn.bam	TCGA-HW-7490	Astrocytoma-NOS
C494.TCGA-S9-A7R3-10A-01D-A34M-08.3_gdc_realn.bam	C494.TCGA-S9-A7R3-01A-11D-A34J-08.3_gdc_realn.bam	TCGA-S9-A7R3	Astrocytoma-NOS
C494.TCGA-HT-7680-10A-01D-2253-08.1_gdc_realn.bam	C494.TCGA-HT-7680-01A-11D-2253-08.1_gdc_realn.bam	TCGA-HT-7680	Astrocytoma-NOS
C494.TCGA-CS-4944-10A-01D-1468-08.3_gdc_realn.bam	C494.TCGA-CS-4944-01A-01D-1468-08.3_gdc_realn.bam	TCGA-CS-4944	Astrocytoma-NOS
C494.TCGA-P5-A5EW-10A-01D-A27N-08.4_gdc_realn.bam	C494.TCGA-P5-A5EW-01A-11D-A27K-08.4_gdc_realn.bam	TCGA-P5-A5EW	Astrocytoma-NOS
C494.TCGA-CS-6667-10A-01D-2024-08.1_gdc_realn.bam	C494.TCGA-CS-6667-01A-12D-2024-08.1_gdc_realn.bam	TCGA-CS-6667	Astrocytoma-NOS
C494.TCGA-S9-A7R7-10A-01D-A34M-08.3_gdc_realn.bam	C494.TCGA-S9-A7R7-01A-11D-A34J-08.3_gdc_realn.bam	TCGA-S9-A7R7	Astrocytoma-NOS

looks like it's missing BED information, add it up, modified get-tcga-capture_kit.py to get all 319 samples' capture kit at as new-tcga-capture-kit-info.tsv.

add that as Capture_Kit column for the manifest:
pbta-tcga-manifest.txt

upload this to release folder and overwrite old pbta-tcga-manifest.tsv

aws s3 cp pbta-tcga-manifest.txt $bucket/$new_release/pbta-tcga-manifest.tsv

### 2.4 histologies file
move histologies file update to V17 #656

new fusion results

$fusions_path='analyses/fusion_filtering/results'
aws s3 cp $fusions_path/pbta-fusion-putative-oncogenic.tsv $bucket/$new_release/
aws s3 cp $fusions_path/pbta-fusion-recurrently-fused-genes-byhistology.tsv $bucket/$new_release/
aws s3 cp $fusions_path/pbta-fusion-recurrently-fused-genes-bysample.tsv $bucket/$new_release/


## 3. update release doc
- [x] md5sum
- [x] release note
- [ ] download script

@yuankunzhu
Copy link
Collaborator

@tkoganti for the new TCGA BED files you put on TCGA_mar-12-2020/
I didn't see the Gh38 version for tcga_6k_genes.targetIntervals.bed and the original/un-liftover'd for whole_exome_agilent_plus_tcga_6k.targetIntervals.Gh38.bed?

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Mar 20, 2020

@tkoganti I think we need to also add the BED information to the manifest.
Seems like the TCGA manifest is missing the capture kit information, I will add it for this release

@tkoganti
Copy link
Collaborator

tkoganti commented Mar 20, 2020

@tkoganti for the new TCGA BED files you put on TCGA_mar-12-2020/
I didn't see the Gh38 version for tcga_6k_genes.targetIntervals.bed and the original/un-liftover'd for whole_exome_agilent_plus_tcga_6k.targetIntervals.Gh38.bed?

Just uploaded data/TCGA_mar-12-2020/tcga_6k_genes.targetIntervals.padded.Gh38.bed

This file from March 12 is in the. s3 bucket - data/TCGA_mar-12-2020/whole_exome_agilent_plus_tcga_6k.targetIntervals.bed - Is this what you are asking for?

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Mar 20, 2020

This file from March 12 is in the. s3 bucket - data/TCGA_mar-12-2020/whole_exome_agilent_plus_tcga_6k.targetIntervals.bed - Is this what you are asking for?

$ aws s3 ls $bucket/TCGA_mar-12-2020/ | grep whole_exome_agilent_plus_tcga_6k.targetIntervals.bed          <aws:saml>
2020-03-20 13:22:34    7684576 whole_exome_agilent_plus_tcga_6k.targetIntervals.bed

it looks like it's just uploaded?

But anyway thanks for checking, i think we have all the BED files in place now

@tkoganti
Copy link
Collaborator

This file from March 12 is in the. s3 bucket - data/TCGA_mar-12-2020/whole_exome_agilent_plus_tcga_6k.targetIntervals.bed - Is this what you are asking for?

$ aws s3 ls $bucket/TCGA_mar-12-2020/ | grep whole_exome_agilent_plus_tcga_6k.targetIntervals.bed          <aws:saml>
2020-03-20 13:22:34    7684576 whole_exome_agilent_plus_tcga_6k.targetIntervals.bed

it looks like it's just uploaded?

Screen Shot 2020-03-20 at 1 33 56 PM

Hmm It said March 12 when I first looked and I uploaded again. Not sure if it was overwritten.

This is the capture kit info used for the new samples. But might be easier to generate again with all 319 samples. Let me know if you want me to create that -
tcga-capture_kit-info_NEW.txt

@kgaonkar6
Copy link
Collaborator

kgaonkar6 commented Mar 20, 2020

@yuankunzhu @jaclyn-taroni Adding files updated from fusion-filtering analysis https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering/results

Changes #621 :
A QC step to identify and remove fusions in multi (more than 4) histologies is added to the fusion-filtering analysis:
Update files:

  • pbta-fusion-putative-oncogenic.tsv
  • pbta-fusion-recurrently-fused-genes-byhistology.tsv
  • pbta-fusion-recurrently-fused-genes-bysample.tsv

@baileyckelly
Copy link
Collaborator

Hi there -
From reviewing tickets, can we confirm that the following tickets are likely not going to be ready for v16 for the pbta-histologies file?
#608 - blocked
#509 - Target for v17
#245 - Target for v17

We will be adding in the following ticket(s), though for the histologies file:
#637 - Data refresh for pbta-histologies file

@jaclyn-taroni
Copy link
Member

Hi there -
From reviewing tickets, can we confirm that the following tickets are likely not going to be ready for v16 for the pbta-histologies file?
#608 - blocked
#509 - Target for v17
#245 - Target for v17

We will be adding in the following ticket(s), though for the histologies file:
#637 - Data refresh for pbta-histologies file

Yes, sounds good. I will edit the original post to match this.

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Mar 27, 2020

D3b data team is still working on clinical data updating, we are suggesting move all changes related pbta-histologies.tsv to v17(#656), so for this release, it will have changes for:

cc @jaclyn-taroni and @baileyckelly

@yuankunzhu yuankunzhu mentioned this issue Mar 27, 2020
4 tasks
@jaclyn-taroni
Copy link
Member

Closed via #657

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data in progress Someone is working on this issue, but feel free to propose an alternative approach!
Projects
None yet
Development

No branches or pull requests

7 participants