Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
update docs for v15 release (#569)
Browse files Browse the repository at this point in the history
* update docs for v15 release

update:
- download-data.sh
- release-notes.md
- data-formats.md
- data-files-descriptions.md

* Update doc/release-notes.md

remove typo

Co-Authored-By: Chante Bethell  <43576623+cbethell@users.noreply.github.com>

* Update release-notes.md

remove duplicate header

* Comment out fusion-summary until #578 is resolved

* Update doc/release-notes.md

fix spacing

Co-Authored-By: Chante Bethell  <43576623+cbethell@users.noreply.github.com>

Co-authored-by: Chante Bethell  <43576623+cbethell@users.noreply.github.com>
Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
Co-authored-by: jashapiro <jashapiro@gmail.com>
  • Loading branch information
4 people authored Mar 2, 2020
1 parent 2a86c8e commit 661f644
Show file tree
Hide file tree
Showing 5 changed files with 115 additions and 13 deletions.
6 changes: 3 additions & 3 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -156,9 +156,9 @@ jobs:
name: Gene set enrichment analysis to generate GSVA scores
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"

- run:
name: Fusion Summary
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"
# - run:
# name: Fusion Summary
# command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"

- run:
name: Add Shatterseek
Expand Down
9 changes: 6 additions & 3 deletions doc/data-files-description.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This document contains information about all data files associated with this pro
+ **File description**
+ A *brief* one sentence description of what the file contains (e.g., bed files contain coordinates for features XYZ).

### current release (release-v14-20200203)
### current release (release-v15-20200228)

| **File name** | **File Type** | **Origin** | **File Description** |
|---------------|----------------|------------------------|-----------------------|
Expand All @@ -29,10 +29,12 @@ This document contains information about all data files associated with this pro
|`intersect_cds_lancet_strelka_mutect_WGS.bed` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with Lancet, Strelka2, Mutect2 regions
|`intersect_strelka_mutect_WGS.bed` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with Strelka2 and Mutect2 regions called
|`pbta-cnv-cnvkit-gistic.zip`| PBTA data file | [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run-gistic.sh) | Somatic CNV - GISTIC 2.0 output using `pbta-cnv-cnvkit.seg` file input (WGS samples only)
|`pbta-cnv-consensus-gistic.zip`| PBTA data file | [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run_gistic_consensus.sh) | Somatic CNV - GISTIC 2.0 output using `pbta-cnv-consensus.seg` file input (WGS samples only)
|`pbta-cnv-consensus-gistic.zip`| Analysis file | [Workflow](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/run-gistic/scripts/run-gistic-openpbta.sh) | Somatic CNV - GISTIC 2.0 output using `pbta-cnv-consensus.seg` file input (WGS samples only)
|`pbta-cnv-cnvkit.seg.gz` | PBTA data file | [Copy number variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-copy-number-variant-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc_combined_somatic_wgs_cnv_wf.cwl) | Somatic Copy Number Variant - CNVkit [SEG file](https://cnvkit.readthedocs.io/en/stable/fileformats.html#seg) (WGS samples only)
|`pbta-cnv-consensus.seg.gz` | Analysis file | [CNV consensus calls](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | Somatic Copy Number Variant - CNVkit [SEG file](https://cnvkit.readthedocs.io/en/stable/fileformats.html#seg) (WGS samples only)
|`pbta-cnv-controlfreec.tsv.gz` | PBTA data file | [Copy number variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-copy-number-variant-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc_combined_somatic_wgs_cnv_wf.cwl) | Somatic Copy Number Variant - TSV file that is a merge of [ControlFreeC `*_CNVs` files](http://boevalab.inf.ethz.ch/FREEC/tutorial.html#OUTPUT) (WGS samples only)
|`consensus_seg_annotated_cn_autosomes.tsv.gz` | Analysis file | [Focal CNV consensus calls](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/46cf6ccb119312ccae6122ac94c51710df01f6da/analyses/focal-cn-file-preparation) | [TSV file](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/46cf6ccb119312ccae6122ac94c51710df01f6da/analyses/focal-cn-file-preparation#scripts-and-notebooks) containing genes with copy number changes per biospecimen; autosomes only
|`consensus_seg_annotated_cn_x_and_y.tsv.gz` | Analysis file | [Focal CNV consensus calls](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/46cf6ccb119312ccae6122ac94c51710df01f6da/analyses/focal-cn-file-preparation) | [TSV file](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/46cf6ccb119312ccae6122ac94c51710df01f6da/analyses/focal-cn-file-preparation#scripts-and-notebooks) containing genes with copy number changes per biospecimen; sex chromosomes only
|`pbta-fusion-arriba.tsv.gz` | PBTA data file | [Gene fusion detection](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#gene-fusion-detection); [Workflow](https://github.com/kids-first/kf-rnaseq-workflow/blob/master/workflow/kfdrc_RNAseq_workflow.cwl) | Fusion - [Arriba TSV](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/arriba-tsv-header.md), annotated with FusionAnnotator
|`pbta-fusion-putative-oncogenic.tsv` | Analysis file | [`analyses/fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | Filtered and prioritized fusions
|`pbta-fusion-recurrently-fused-genes-byhistology.tsv`| Analysis file | [`analysis/fusion-filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | Recurrently-fused genes tabulated by broad histology
Expand Down Expand Up @@ -77,4 +79,5 @@ This document contains information about all data files associated with this pro
|`WGS.hg38.strelka2.unpadded.bed` | Reference Regions File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M) used for Strelka2 variant caller
|`WGS.hg38.vardict.100bp_padded.bed` | Reference Regions File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | `WGS.hg38.mutect2.vardict.unpadded.bed` with each region padded by 100 bp used for VarDict variant caller
|`WXS.hg38.100bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 WXS regions provided by the kit manufacturer used for Strelka2, Mutect2, and VarDict variant callers with each region padded by 100 bp
|`WXS.hg38.lancet.400bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 WXS regions provided by the kit manufacturer used for Lancet variant callers with each region padded by 400 bp
|`WXS.hg38.lancet.400bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 WXS regions provided by the kit manufacturer used for Lancet variant callers with each region padded by 400 bp
|`gencode.v19.basic.exome.hg38liftover.100bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 WXS regions provided by the kit manufacturer used for TCGA variant calling with each region padded by 100 bp; obtained from the [GDC website](https://gdc.cancer.gov/about-data/publications/mc3-20170)
10 changes: 9 additions & 1 deletion doc/data-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,11 +192,19 @@ Copy number consensus calls from the copy number and structural variant callers

* `pbta-cnv-consensus.seg.gz` contains consensus segments and segment means (log R ratios) from two or more callers, as described in the [analysis README](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/copy_number_consensus_call/README.md).

##### Focal Copy Number Files

Focal copy number files map the consensus calls (genomic segments) above to genes for downstream analysis and are a product of the [`analysis/focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/46cf6ccb119312ccae6122ac94c51710df01f6da/analyses/focal-cn-file-preparation).
Note: these files contain biospecimens and genes with copy number changes; neutral regions are excluded.

- `consensus_seg_annotated_cn_autosomes.tsv.gz` contains focal gene copy number alterations for all autosomes.
- `consensus_seg_annotated_cn_x_and_y.tsv.gz` contains focal gene copy number alterations for the sex chromosomes.

#### GISTIC Output File Formats

`pbta-cnv-cnvkit-gistic.zip` is the output of running GISTIC 2.0 on the CNVkit results (`pbta-cnv-cnvkit.seg`).
`pbta-cnv-consensus-gistic.zip` is the output of running GISTIC 2.0 on the CNV consensus calls (`pbta-cnv-consensus.seg.gz`), described below.
The scripts used to run GISTIC are linked here: [CNVkit](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run-gistic.sh) and [Consensus calls](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run_gistic_consensus.sh).
The scripts used to run GISTIC are linked here: [CNVkit](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run-gistic.sh) and [Consensus calls](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/run-gistic/scripts/run-gistic-openpbta.sh).

Note that GISTIC is run on the _entire cohort_ and therefore the output reflects regions that are significantly amplified or deleted across the entire cohort.

Expand Down
99 changes: 95 additions & 4 deletions doc/release-notes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,100 @@
# release notes
## current release
### release-v15-20200228
- release date: 2020-02-28
- status: available
- changes:
- Update Strelka2, Mutect2, and Lancet TCGA MAF files per [#257](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/257); files updated:
- pbta-tcga-snv-mutect2.vep.maf.gz
- pbta-tcga-snv-strelka2.vep.maf.gz
- pbta-tcga-snv-lancet.vep.maf.gz
- Add TCGA WXS BED regions file lifted from hg19 to hg38 and padded by 100bp, which was used for variant calling:
- gencode.v19.basic.exome.hg38liftover.100bp_padded.bed, obtained from [the GDC website](https://gdc.cancer.gov/about-data/publications/mc3-20170); NOTE: this BED file is almost certainly wrong per [#568](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/568).
- Replace pbta-cnv-consensus-gistic.zip with the file from the [OpenPBTA analysis master repository](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/46cf6ccb119312ccae6122ac94c51710df01f6da/analyses/run-gistic/results) now that GISTIC has been installed [comment](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/543#issuecomment-587514075).
- Updated `pbta-histologies.tsv` to:
- Harmonize `Embryonal Tumor` and `Embryonal tumor` [#541](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/541)
- Change `disease_type_old` to `pathology_diagnosis` and `disease_type_new` to `integrated_diagnosis` per request in [comment](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/543#issuecomment-589248029).
- Add consensus focal CN files from [analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/46cf6ccb119312ccae6122ac94c51710df01f6da/analyses/focal-cn-file-preparation):
- consensus_seg_annotated_cn_autosomes.tsv.gz
- consensus_seg_annotated_cn_x_and_y.tsv.gz
- Updated fusion files to add additional kinase genes, remove residual polyA stranded samples, and fix order of filtering operations per [comment](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/539#issuecomment-587167746), [comment](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/521#issuecomment-582990375), [#530](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/530), and [#553](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/530), and [PR #567](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/567):
- pbta-fusion-recurrently-fused-genes-byhistology.tsv
- pbta-fusion-putative-oncogenic.tsv
- pbta-fusion-recurrently-fused-genes-bysample.tsv
- folder structure:
```
data
└── release-v15-20200228
├── release-notes.md
├── data-files-description.md
├── StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed
├── StrexomeLite_hg38_liftover_100bp_padded.bed
├── WGS.hg38.lancet.300bp_padded.bed
├── WGS.hg38.lancet.unpadded.bed
├── WGS.hg38.mutect2.vardict.unpadded.bed
├── WGS.hg38.strelka2.unpadded.bed
├── WGS.hg38.vardict.100bp_padded.bed
├── WXS.hg38.100bp_padded.bed
├── WXS.hg38.lancet.400bp_padded.bed
├── md5sum.txt
├── pbta-cnv-cnvkit.seg.gz
├── pbta-cnv-consensus.seg.gz
├── pbta-cnv-controlfreec.tsv.gz
├── pbta-cnv-cnvkit-gistic.zip
├── pbta-cnv-consensus-gistic.zip
├── pbta-fusion-arriba.tsv.gz
├── pbta-fusion-starfusion.tsv.gz
├── pbta-fusion-putative-oncogenic.tsv
├── pbta-gene-counts-rsem-expected_count.polya.rds
├── pbta-gene-counts-rsem-expected_count.stranded.rds
├── pbta-gene-expression-kallisto.polya.rds
├── pbta-gene-expression-kallisto.stranded.rds
├── pbta-gene-expression-rsem-fpkm.polya.rds
├── pbta-gene-expression-rsem-fpkm.stranded.rds
├── pbta-histologies.tsv
├── pbta-snv-lancet.vep.maf.gz
├── pbta-snv-mutect2.vep.maf.gz
├── pbta-snv-strelka2.vep.maf.gz
├── pbta-snv-vardict.vep.maf.gz
├── pbta-sv-manta.tsv.gz
├── independent-specimens.wgs.primary-plus.tsv
├── independent-specimens.wgs.primary.tsv
├── independent-specimens.wgswxs.primary-plus.tsv
├── independent-specimens.wgswxs.primary.tsv
├── pbta-gene-expression-rsem-fpkm-collapsed.polya.rds
├── pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
├── pbta-gene-expression-rsem-tpm.polya.rds
├── pbta-gene-expression-rsem-tpm.stranded.rds
├── pbta-isoform-expression-rsem-tpm.polya.rds
├── pbta-isoform-expression-rsem-tpm.stranded.rds
├── pbta-isoform-counts-rsem-expected_count.polya.rds
├── pbta-isoform-counts-rsem-expected_count.stranded.rds
├── pbta-snv-consensus-mutation.maf.tsv.gz
├── pbta-snv-consensus-mutation-tmb-all.tsv
├── pbta-snv-consensus-mutation-tmb-coding.tsv
├── pbta-fusion-recurrently-fused-genes-byhistology.tsv
├── pbta-fusion-recurrently-fused-genes-bysample.tsv
├── pbta-tcga-snv-lancet.vep.maf.gz
├── pbta-tcga-snv-strelka2.vep.maf.gz
├── pbta-tcga-snv-mutect2.vep.maf.gz
├── pbta-tcga-manifest.tsv
├── pbta-mend-qc-results.tar.gz
├── pbta-mend-qc-manifest.tsv
├── pbta-star-log-final.tar.gz
├── pbta-star-log-manifest.tsv
├── intersect_cds_lancet_strelka_mutect_WGS.bed
├── intersect_cds_lancet.bed
├── intersect_strelka_mutect_WGS.bed
├── fusion_summary_embryonal_foi.tsv
├── fusion_summary_ependymoma_foi.tsv
├── gencode.v19.basic.exome.hg38liftover.100bp_padded.bed
├── consensus_seg_annotated_cn_autosomes.tsv.gz
└── consensus_seg_annotated_cn_x_and_y.tsv.gz
```



## archived release
### release-v14-20200203
- release date: 2020-02-03
- status: available
Expand Down Expand Up @@ -110,8 +205,6 @@ data
└── fusion_summary_ependymoma_foi.tsv
```



## archived releases
### release-v13-20200116
- release date: 2020-01-16
Expand Down Expand Up @@ -700,5 +793,3 @@ data
├── strelka2.maf.gz
└── tumor-normal-pair.tsv
```


4 changes: 2 additions & 2 deletions download-data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ set -o pipefail

# Use the OpenPBTA bucket as the default.
URL=${OPENPBTA_URL:-https://s3.amazonaws.com/kf-openaccess-us-east-1-prd-pbta/data}
RELEASE=${OPENPBTA_RELEASE:-release-v14-20200203}
PREVIOUS=${OPENPBTA_RELEASE:-release-v13-20200116}
RELEASE=${OPENPBTA_RELEASE:-release-v15-20200228}
PREVIOUS=${OPENPBTA_RELEASE:-release-v14-20200203}

# Remove old symlinks in data
find data -type l -delete
Expand Down

0 comments on commit 661f644

Please sign in to comment.