diff --git a/doc/data-files-description.md b/doc/data-files-description.md index 01ab98f093..863d5670d8 100644 --- a/doc/data-files-description.md +++ b/doc/data-files-description.md @@ -13,13 +13,10 @@ This document contains information about all data files associated with this pro + **File description** + A *brief* one sentence description of what the file contains (e.g., bed files contain coordinates for features XYZ). - - -### current release (release-v13-20200116) +### current release (release-v14-20200203) | **File name** | **File Type** | **Origin** | **File Description** | |---------------|----------------|------------------------|-----------------------| -|`cnv_consensus.tsv`| Analysis file | [`analyses/copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | Consensus calls from ControlFreeC, Manta, and CNVKit |`fusion_summary_embryonal_foi.tsv`| Analysis file | [`analysis/fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary) | Summary file for presence of embryonal tumor fusions of interest |`fusion_summary_ependymoma_foi.tsv`| Analysis file | [`analysis/fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary) | Summary file for presence of ependymal tumor fusions of interest |`gencode.v27.primary_assembly.annotation.gtf.gz` | Reference file | GENCODE v27 | hg38 gene annotation on primary assembly (reference chromosomes and scaffolds) @@ -28,12 +25,14 @@ This document contains information about all data files associated with this pro |`independent-specimens.wgs.primary.tsv` | Analysis file | [`analyses/independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | Independent specimens list for WGS samples, primary only |`independent-specimens.wgswxs.primary-plus.tsv` | Analysis file | [`analyses/independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | Independent specimens list for WGS and WXS samples, primary + non-primary when no primary sample is available |`independent-specimens.wgswxs.primary.tsv` | Analysis file | [`analyses/independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | Independent specimens list for WGS and WXS samples, primary only -|`intersect_exon_lancet_strelka_mutect_WGS.bed` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with Lancet, Strelka2, Mutect2 regions -|`intersect_exon_WXS.bed` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with WXS 100bp padded BED regions +|`intersect_cds_lancet.bed` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with WXS 100bp padded BED regions and Lancet's WXS regions +|`intersect_cds_lancet_strelka_mutect_WGS.bed` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with Lancet, Strelka2, Mutect2 regions |`intersect_strelka_mutect_WGS.bed` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Intersection of `gencode.v27.primary_assembly.annotation.gtf.gz` CDS with Strelka2 and Mutect2 regions called -|`pbta-cnv-cnvkit-gistic.zip`| PBTA data file | [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/run-gistic.sh) | Somatic CNV - GISTIC 2.0 output using `pbta-cnv-cnvkit.seg` file input (WGS samples only) -|`pbta-cnv-cnvkit.seg.gz` | PBTA data file | [Copy number variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-copy-number-variant-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc_combined_somatic_wgs_cnv_wf.cwl) | Somatic Copy Number Variant - CNVkit [SEG file](https://cnvkit.readthedocs.io/en/stable/fileformats.html#seg) (WGS samples only) -|`pbta-cnv-controlfreec.tsv.gz` | PBTA data file | [Copy number variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-copy-number-variant-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc_combined_somatic_wgs_cnv_wf.cwl) | Somatic Copy Number Variant - TSV file that is a merge of [ControlFreeC `*_CNVs` files](http://boevalab.inf.ethz.ch/FREEC/tutorial.html#OUTPUT) (WGS samples only) +|`pbta-cnv-cnvkit-gistic.zip`| PBTA data file | [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run-gistic.sh) | Somatic CNV - GISTIC 2.0 output using `pbta-cnv-cnvkit.seg` file input (WGS samples only) +|`pbta-cnv-consensus-gistic.zip`| PBTA data file | [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run_gistic_consensus.sh) | Somatic CNV - GISTIC 2.0 output using `pbta-cnv-consensus.seg` file input (WGS samples only) +|`pbta-cnv-cnvkit.seg.gz` | PBTA data file | [Copy number variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-copy-number-variant-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc_combined_somatic_wgs_cnv_wf.cwl) | Somatic Copy Number Variant - CNVkit [SEG file](https://cnvkit.readthedocs.io/en/stable/fileformats.html#seg) (WGS samples only) +|`pbta-cnv-consensus.seg.gz` | Analysis file | [CNV consensus calls](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | Somatic Copy Number Variant - CNVkit [SEG file](https://cnvkit.readthedocs.io/en/stable/fileformats.html#seg) (WGS samples only) +|`pbta-cnv-controlfreec.tsv.gz` | PBTA data file | [Copy number variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-copy-number-variant-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc_combined_somatic_wgs_cnv_wf.cwl) | Somatic Copy Number Variant - TSV file that is a merge of [ControlFreeC `*_CNVs` files](http://boevalab.inf.ethz.ch/FREEC/tutorial.html#OUTPUT) (WGS samples only) |`pbta-fusion-arriba.tsv.gz` | PBTA data file | [Gene fusion detection](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#gene-fusion-detection); [Workflow](https://github.com/kids-first/kf-rnaseq-workflow/blob/master/workflow/kfdrc_RNAseq_workflow.cwl) | Fusion - [Arriba TSV](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/arriba-tsv-header.md), annotated with FusionAnnotator |`pbta-fusion-putative-oncogenic.tsv` | Analysis file | [`analyses/fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | Filtered and prioritized fusions |`pbta-fusion-recurrently-fused-genes-byhistology.tsv`| Analysis file | [`analysis/fusion-filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | Recurrently-fused genes tabulated by broad histology @@ -54,22 +53,22 @@ This document contains information about all data files associated with this pro |`pbta-isoform-counts-rsem-expected_count.stranded.rds` | PBTA data file | [Gene expression abundance estimation](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#gene-expression-abundance-estimation); [Workflow](https://github.com/kids-first/kf-rnaseq-workflow/blob/master/workflow/kfdrc_RNAseq_workflow.cwl) |Gene expression - RSEM expected counts for stranded samples (transcript-level) |`pbta-isoform-expression-rsem-tpm.polya.rds` | PBTA data file | [Gene expression abundance estimation](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#gene-expression-abundance-estimation); [Workflow](https://github.com/kids-first/kf-rnaseq-workflow/blob/master/workflow/kfdrc_RNAseq_workflow.cwl) | Gene expression - RSEM TPM for poly-A samples (transcript-level) |`pbta-isoform-expression-rsem-tpm.stranded.rds` | PBTA data file | [Gene expression abundance estimation](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#gene-expression-abundance-estimation); [Workflow](https://github.com/kids-first/kf-rnaseq-workflow/blob/master/workflow/kfdrc_RNAseq_workflow.cwl) | Gene expression - RSEM TPM for stranded samples (transcript-level) -|`pbta-mend-qc-manifest.tsv` | PBTA data file | [`MendQC analysis placeholder`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/341); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc-mendqc-wf.cwl) | File to map MendQC output to biospecimen IDs -|`pbta-mend-qc-results.tar.gz` | PBTA data file | [`MendQC analysis placeholder`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/341); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc-mendqc-wf.cwl) | MendQC output files +|`pbta-mend-qc-manifest.tsv` | PBTA data file | [`MendQC analysis placeholder`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/341); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc-mendqc-wf.cwl) | File to map MendQC output to biospecimen IDs +|`pbta-mend-qc-results.tar.gz` | PBTA data file | [`MendQC analysis placeholder`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/341); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc-mendqc-wf.cwl) | MendQC output files |`pbta-snv-consensus-mutation.maf.tsv.gz` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Consensus calls for SNVs and small indels; columns in the included file are derived from the Strelka2. |`pbta-snv-consensus-mutation-tmb-all.tsv` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Tumor mutation burden statistics calculated from Strelka2 and Mutect2 SNV consensus, and the intersection of Strelka2 and Mutect2 BED windows sizes. |`pbta-snv-consensus-mutation-tmb-coding.tsv` | Analysis file | [`analyses/snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/) | Coding only tumor mutation burden statistics calculated from the number of coding sequence Strelka2, Mutect2, and Lancet consensus SNVs and size of the intersection of all three callers' BED windows and the Gencode v27 coding sequences. -|`pbta-snv-lancet.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc-lancet-wf.cwl) | Somatic SNV - Lancet [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) -|`pbta-snv-mutect2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc_strelka2_mutect2_manta_workflow.cwl) | Somatic SNV - Mutect2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) -|`pbta-snv-strelka2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc_strelka2_mutect2_manta_workflow.cwl) | Somatic SNV - Strelka2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) -|`pbta-snv-vardict.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc-vardict-wf.cwl) | Somatic SNV - VarDict [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) +|`pbta-snv-lancet.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc-lancet-wf.cwl) | Somatic SNV - Lancet [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) +|`pbta-snv-mutect2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc_strelka2_mutect2_manta_workflow.cwl) | Somatic SNV - Mutect2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) +|`pbta-snv-strelka2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc_strelka2_mutect2_manta_workflow.cwl) | Somatic SNV - Strelka2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) +|`pbta-snv-vardict.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc-vardict-wf.cwl) | Somatic SNV - VarDict [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) |`pbta-star-log-final.tar.gz` | PBTA data file | [Gene expression abundance estimation](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#gene-expression-abundance-estimation); [Workflow](https://github.com/kids-first/kf-rnaseq-workflow/blob/master/workflow/kfdrc_RNAseq_workflow.cwl) | STAR log final output files |`pbta-star-log-manifest.tsv` | PBTA data file | [Gene expression abundance estimation](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#gene-expression-abundance-estimation); [Workflow](https://github.com/kids-first/kf-rnaseq-workflow/blob/master/workflow/kfdrc_RNAseq_workflow.cwl) | File to map STAR output to biospecimen IDs -|`pbta-sv-manta.tsv.gz`| PBTA data file | [Structural variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-structural-variant-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc_strelka2_mutect2_manta_workflow.cwl) | Somatic Structural Variant - Manta output, annotated with AnnotSV (WGS samples only) +|`pbta-sv-manta.tsv.gz`| PBTA data file | [Structural variant calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-structural-variant-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc_strelka2_mutect2_manta_workflow.cwl) | Somatic Structural Variant - Manta output, annotated with AnnotSV (WGS samples only) |`pbta-tcga-manifest.tsv`| PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling) | Manifest of tumor/normal BAMs used for SNV calling, Tumor_Sample_Barcodes, and histologies -|`pbta-tcga-snv-lancet.vep.maf.gz` | PBTA/TCGA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc-lancet-wf.cwl) | Somatic SNV - Lancet [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) -|`pbta-tcga-snv-mutect2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc-mutect2_strelka2-wf.cwl) | Somatic SNV - Mutect2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) -|`pbta-tcga-snv-strelka2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/kfdrc-mutect2_strelka2-wf.cwl) | Somatic SNV - Strelka2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) +|`pbta-tcga-snv-lancet.vep.maf.gz` | PBTA/TCGA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc-lancet-wf.cwl) | Somatic SNV - Lancet [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) +|`pbta-tcga-snv-mutect2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc-mutect2_strelka2-wf.cwl) | Somatic SNV - Mutect2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) +|`pbta-tcga-snv-strelka2.vep.maf.gz` | PBTA data file | [Somatic mutation calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-mutation-calling); [Workflow](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/cwl/kfdrc-mutect2_strelka2-wf.cwl) | Somatic SNV - Strelka2 [annotated MAF file](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/vep-maf.md) |`StrexomeLite_hg38_liftover_100bp_padded.bed`| Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 targeted panel regions used for all variant callers, each region padded by 100 bp |`StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed` | Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 targeted DNA panel bait capture regions provided by the kit manufacturer |`WGS.hg38.lancet.300bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | WGS.hg38.lancet.unpadded.bed file with each region padded by 300 bp @@ -78,4 +77,4 @@ This document contains information about all data files associated with this pro |`WGS.hg38.strelka2.unpadded.bed` | Reference Regions File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M) used for Strelka2 variant caller |`WGS.hg38.vardict.100bp_padded.bed` | Reference Regions File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | `WGS.hg38.mutect2.vardict.unpadded.bed` with each region padded by 100 bp used for VarDict variant caller |`WXS.hg38.100bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 WXS regions provided by the kit manufacturer used for Strelka2, Mutect2, and VarDict variant callers with each region padded by 100 bp -|`WXS.hg38.lancet.400bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 WXS regions provided by the kit manufacturer used for Lancet variant callers with each region padded by 400 bp +|`WXS.hg38.lancet.400bp_padded.bed` | Reference Target/Baits File | [SNV and INDEL calling](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling) | hg38 WXS regions provided by the kit manufacturer used for Lancet variant callers with each region padded by 400 bp \ No newline at end of file diff --git a/doc/data-formats.md b/doc/data-formats.md index db93950158..41eba04afb 100644 --- a/doc/data-formats.md +++ b/doc/data-formats.md @@ -184,16 +184,47 @@ The filtered and prioritized fusion and downstream files are a product of the [` * `fusion_summary_embryonal_foi.tsv` contains a binary matrix that denotes the presence or absence of a recurrent embryonal tumor fusions of interest per individual RNA-seq specimen. * `fusion_summary_ependymoma_foi.tsv` contains a binary matrix that denotes the presence or absence of a recurrent ependymal tumor fusions of interest per individual RNA-seq specimen. -### Copy Number Files +### Derived Copy Number Files + +#### Consensus Copy Number File + +Copy number consensus calls from the copy number and structural variant callers are a product of the [`analyses/copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) analysis module. + +* `pbta-cnv-consensus.seg.gz` contains consensus segments and segment means (log R ratios) from two or more callers, as described in the [analysis README](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/copy_number_consensus_call/README.md). + +#### GISTIC Output File Formats `pbta-cnv-cnvkit-gistic.zip` is the output of running GISTIC 2.0 on the CNVkit results (`pbta-cnv-cnvkit.seg`). -The script used to run GISTIC can be [found here](https://github.com/d3b-center/publication_workflows/blob/master/openPBTA/run-gistic.sh). +`pbta-cnv-consensus-gistic.zip` is the output of running GISTIC 2.0 on the CNV consensus calls (`pbta-cnv-consensus.seg.gz`), described below. +The scripts used to run GISTIC are linked here: [CNVkit](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run-gistic.sh) and [Consensus calls](https://github.com/d3b-center/OpenPBTA-workflows/blob/master/bash/run_gistic_consensus.sh). -### Consensus Copy Number File +Note that GISTIC is run on the _entire cohort_ and therefore the output reflects regions that are significantly amplified or deleted across the entire cohort. -Copy number consensus calls from the copy number and structural variant callers are a product of the [`analyses/copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) analysis module. - * `cnv_consensus.tsv` contains consensus regions from two or more callers, with columns described in the [analysis README](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/copy_number_consensus_call/README.md). +The GISTIC output data files below, which are commonly leveraged for downstream analyses, are described in more detail on the Broad Institute's [GenePattern website](https://www.genepattern.org/modules/docs/GISTIC_2.0). + + - `all_lesions.conf_90.txt` (90% confidence level): this file contains significant regions of amplification and deletion and samples with amplifications/deletions in each of these regions + - `amp_genes.conf_90.txt` (90% confidence level): table of amplification peaks and genes within them + - `del_genes.conf_90.txt` (90% confidence level): table of deletion peaks and genes within them + - `all_thresholded.by_genes.txt`: table of high- and low-level amplifications and deletions using sample-specific thresholds for high-level (output in `sample_cutoffs.txt` file) and default low-level thresholds (+/-0.1) + +##### Additional relevant output files are described below: + + - `all_data_by_genes.txt`: This file contains a table of gene symbol, gene ID, cytoband, and Log R Ratios (LRR) for each sample (not thresholded). + - `broad_data_by_genes.txt`: This file contains a table of gene symbol, gene ID, cytoband, and LRR for each sample. + - `focal_data_by_genes.txt`: This file contains a matrix of gene LRR by sample. + - `sample_seg_counts.txt`: By default, samples with >2500 segments are excluded from GISTIC analyses; samples are annotated as included or excluded in this file. + - `broad_values_by_arm.txt`: This file contains a matrix of chromosomal arm LRR by sample. + +##### Use cases for these files include: + + - `broad_values_by_arm.txt` for molecular subtyping in which chromosomal arms are commonly gained/amplified or deleted + - `all_thresholded.by_genes.txt` for gene-level copy-number analyses ## Data Caveats -The clinical manifest will be updated and versioned as molecular subgroups are identified based on genomic analyses. +The clinical manifest will be updated and versioned as molecular subgroups are identified based on genomic analyses. + +Analyses related to molecular subtyping are as follows: + +* [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG) +* [`molecular-subtyping-embryonal`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-embryonal) diff --git a/doc/release-notes.md b/doc/release-notes.md index 7887bf36b2..959fcbcef9 100644 --- a/doc/release-notes.md +++ b/doc/release-notes.md @@ -1,5 +1,118 @@ # release notes ## current release +### release-v14-20200203 +- release date: 2020-02-03 +- status: available +- changes: + - Update kallisto stranded file to remove index column per [#474](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/474). Also removed samples from last re-sequencing polyA+stranded batch that were missed with the v13 release: + - pbta-gene-expression-kallisto.stranded.rds + - Update matrices of ependymonal tumor and embryonal tumor fusions of interest by biospecimen from [`analyses/fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/478) to include all RNA biospecimens in new `pbta-histologies.tsv` file without fusion calls. Files updated: + - fusion_summary_embryonal_foi.tsv + - fusion_summary_ependymoma_foi.tsv + - Update Strelka2, Mutect2, and Lancet TCGA MAF files per [#483](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/483) and [#512](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/512). Files updated: + - pbta-tcga-snv-mutect2.vep.maf.gz + - pbta-tcga-snv-strelka2.vep.maf.gz + - pbta-tcga-snv-lancet.vep.maf.gz + - Remove `cnv_consensus.tsv` file per [this comment](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/432#issuecomment-579340589). + - Update copy number files: + - Update CNVkit seg file to add missing specimen per [#472](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/472): + - pbta-cnv-cnvkit.seg.gz + - Update GISTIC results for CNVkit to include missing specimen per [#491](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/491): + - pbta-cnv-cnvkit-gistic.zip + - Add consensus SEG file per [#441](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/441): + - pbta-cnv-consensus.seg.gz + - Add GISTIC results for consensus SEG per [#453](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/453): + - pbta-cnv-consensus-gistic.zip + - Update analysis files and names from `exon` to `cds` per [#440](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/440) and [this comment](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/432#issuecomment-581462059): + - intersect_strelka_mutect_WGS.bed + - intersect_cds_lancet.bed + - intersect_cds_lancet_strelka_mutect_WGS.bed + - pbta-snv-consensus-mutation-tmb-coding.tsv + - Update `pbta-histologies.tsv` to add embryonal `molecular_subtypes` per [#251](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251) using results [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-embryonal/results) and high-grade glioma (HGG) `molecular_subtypes` per [#249](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) and [this commit](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/435/commits/7e8e7917c002e11ab97312dc7f64c542fc04892b). + - Additional clinical data updated include: + - `glioma_brain_region` (if sample was not previously classified as glioma) + - `Notes` (to concatenate old notes for `disease_type_new` and `molecular_subtype` changes and current changes based on OpenPBTA subtyping) + - `disease_type_new` + - HGG: created "Diffuse midline glioma" for any subtype containing `DMG` and rest were "High-grade glioma". (Previously, DMGs were both HGG and DIPG, but following WHO 2016 nomenclature, we stick with DMG instead of DIPG now that we have subtypes). + - Embryonal: if `molecular_subtype` contains `ETMR`, these became `Embryonal tumor with multilayer rosettes` while the rest became `CNS Embryonal Tumor` + - `short_histology` + - HGG: all became `HGAT` + - Embryonal: if `molecular_subtype` contains `ETMR`, these became `ETMR` while the rest became `Embryonal Tumor` + - `broad_histology` + - HGG: all became `Diffuse astrocytic and oligodendroglial tumor` + - Embryonal: all became `Embryonal Tumor` + +- folder structure: +``` +data +└── release-v14-20200203 + ├── release-notes.md + ├── data-files-description.md + ├── StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed + ├── StrexomeLite_hg38_liftover_100bp_padded.bed + ├── WGS.hg38.lancet.300bp_padded.bed + ├── WGS.hg38.lancet.unpadded.bed + ├── WGS.hg38.mutect2.vardict.unpadded.bed + ├── WGS.hg38.strelka2.unpadded.bed + ├── WGS.hg38.vardict.100bp_padded.bed + ├── WXS.hg38.100bp_padded.bed + ├── WXS.hg38.lancet.400bp_padded.bed + ├── md5sum.txt + ├── pbta-cnv-cnvkit.seg.gz + ├── pbta-cnv-consensus.seg.gz + ├── pbta-cnv-controlfreec.tsv.gz + ├── pbta-cnv-cnvkit-gistic.zip + ├── pbta-cnv-consensus-gistic.zip + ├── pbta-fusion-arriba.tsv.gz + ├── pbta-fusion-starfusion.tsv.gz + ├── pbta-fusion-putative-oncogenic.tsv + ├── pbta-gene-counts-rsem-expected_count.polya.rds + ├── pbta-gene-counts-rsem-expected_count.stranded.rds + ├── pbta-gene-expression-kallisto.polya.rds + ├── pbta-gene-expression-kallisto.stranded.rds + ├── pbta-gene-expression-rsem-fpkm.polya.rds + ├── pbta-gene-expression-rsem-fpkm.stranded.rds + ├── pbta-histologies.tsv + ├── pbta-snv-lancet.vep.maf.gz + ├── pbta-snv-mutect2.vep.maf.gz + ├── pbta-snv-strelka2.vep.maf.gz + ├── pbta-snv-vardict.vep.maf.gz + ├── pbta-sv-manta.tsv.gz + ├── independent-specimens.wgs.primary-plus.tsv + ├── independent-specimens.wgs.primary.tsv + ├── independent-specimens.wgswxs.primary-plus.tsv + ├── independent-specimens.wgswxs.primary.tsv + ├── pbta-gene-expression-rsem-fpkm-collapsed.polya.rds + ├── pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds + ├── pbta-gene-expression-rsem-tpm.polya.rds + ├── pbta-gene-expression-rsem-tpm.stranded.rds + ├── pbta-isoform-expression-rsem-tpm.polya.rds + ├── pbta-isoform-expression-rsem-tpm.stranded.rds + ├── pbta-isoform-counts-rsem-expected_count.polya.rds + ├── pbta-isoform-counts-rsem-expected_count.stranded.rds + ├── pbta-snv-consensus-mutation.maf.tsv.gz + ├── pbta-snv-consensus-mutation-tmb-all.tsv + ├── pbta-snv-consensus-mutation-tmb-coding.tsv + ├── pbta-fusion-recurrently-fused-genes-byhistology.tsv + ├── pbta-fusion-recurrently-fused-genes-bysample.tsv + ├── pbta-tcga-snv-lancet.vep.maf.gz + ├── pbta-tcga-snv-strelka2.vep.maf.gz + ├── pbta-tcga-snv-mutect2.vep.maf.gz + ├── pbta-tcga-manifest.tsv + ├── pbta-mend-qc-results.tar.gz + ├── pbta-mend-qc-manifest.tsv + ├── pbta-star-log-final.tar.gz + ├── pbta-star-log-manifest.tsv + ├── intersect_cds_lancet_strelka_mutect_WGS.bed + ├── intersect_cds_lancet.bed + ├── intersect_strelka_mutect_WGS.bed + ├── fusion_summary_embryonal_foi.tsv + └── fusion_summary_ependymoma_foi.tsv +``` + + + +## archived releases ### release-v13-20200116 - release date: 2020-01-16 - status: available @@ -113,7 +226,7 @@ data └── fusion_summary_ependymoma_foi.tsv ``` -## archived releases +### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: @@ -589,4 +702,3 @@ data ``` -