Data file descriptions

This document contains information about all data files associated with this project. Each file will have the following association information:

  • File type will be one of:
    • Reference file: Obtained from an external source/database. When known, the obtained data and a link to the external source is included.
    • Modified reference file: Obtained from an external source/database but modified for OpenPBTA use.
    • Processed data file: Data that are processed upstream of the analysis project, e.g., the output of a somatic single nucleotide variant method. Links to the relevant D3B Center or Kids First workflow (and version where applicable) are included in Origin.
    • Analysis file: Any file created by a script in analyses/*.
  • Origin
    • For Processed data files, a link the relevant D3B Center or Kids First workflow (and version where applicable).
    • When applicable, a link to the specific script that produced (or modified, for Modified reference file types) the data.
  • File description
    • A brief one sentence description of what the file contains (e.g., bed files contain coordinates for features XYZ).

current release (v11)

File name File Type Origin File Description
histologies-base.tsv Data file Cohort-specific data files and databases Clinical and sequencing metadata for each biospecimen
histologies.tsv Modified data file molecular-subtyping-integrate histologies-base.tsv plus molecular_subtype, cancer_group, integrated_diagnosis, and harmonized_diagnosis
intersect_cds_lancet_strelka_mutect_WGS.bed Analysis file snv-callers Intersection of gencode.v27.primary_assembly.annotation.gtf.gz CDS with Lancet, Strelka2, Mutect2 regions
intersect_strelka_mutect_WGS.bed Analysis file snv-callers Intersection of gencode.v27.primary_assembly.annotation.gtf.gz CDS with Strelka2 and Mutect2 regions called
efo-mondo-map.tsv Reference mapping file Manual collation Mapping of EFO and MONDO codes to cancer groups
efo-mondo-map-prefill.tsv Modified reference mapping file Analysis file generated in molecular-subtyping-integrate Mapping of EFO and MONDO codes to cancer groups
ensg-hugo-pmtl-mapping.tsv Reference mapping file Manual curation of PMTLv1.1 by FNL; RNA-Seq pipeline GTF mapping File which maps Hugo Symbols to ENSEMBL gene IDs an each ENSG to the RMTL curated by FNL
*.bed Reference file Manual collation Bed files used for variant calling and are used for tmb calculation
uberon-map-gtex-group.tsv Reference mapping file Manual collation Mapping of UBERON codes to tissue types in GTEx broad groups
uberon-map-gtex-subgroup.tsv Reference mapping file Manual collation Mapping of UBERON codes to tissue types in GTEx subgroups
methyl-beta-values.rds Processed data file methylation beta valeues Methylation beta values
methyl-m-values.rds Processed data file methylation m valeues Methylation m values
rna-isoform-expression-rsem-tpm.rds Processed data file RNA isoform TPM files RNA isoform TPM files
snv-dgd.maf.tsv.gz Processed data file DGD merged SNV MAF results DGD merged SNV MAF results
fusion-dgd.tsv Processed data file DGD merged fusion results DGD merged fusion results
fusion-arriba.tsv.gz Processed data file Gene fusion detection; Workflow Fusion - Arriba TSV, annotated with FusionAnnotator
fusion-starfusion.tsv.gz Processed data file Gene fusion detection; Workflow Fusion - STARFusion TSV
fusion_summary_embryonal_foi.tsv Analysis file fusion-summary Summary file for presence of embryonal tumor fusions of interest
fusion_summary_ependymoma_foi.tsv Analysis file fusion-summary Summary file for presence of ependymal tumor fusions of interest
fusion_summary_ewings_foi.tsv Analysis file fusion-summary Summary file for presence of Ewing's sarcoma fusions of interest
fusion_summary_ewings_lgat.tsv Analysis file fusion-summary Summary file for presence of LGAT fusions of interest
fusion-putative-oncogenic.tsv Analysis file fusion_filtering Filtered and prioritized fusions
gene-counts-rsem-expected_count-collapsed.rds Analysis file PBTA+GMKF+TARGET+GTEx collapse-rnaseq ;GTEx v8 release Gene expression - RSEM expected_count for each samples collapsed to gene symbol (gene-level)
gene-expression-rsem-tpm-collapsed.rds Analysis file PBTA+GMKF+TARGET+GTEx collapse-rnaseq;GTEx v8 release Gene expression - RSEM TPM for each samples collapsed to gene symbol (gene-level)
tcga-gene-counts-rsem-expected_count-collapsed.rds Modified reference file TCGA samples - manually curated to include 10414 TCGA RNA samples that are in diseaseXpress and has GDC clinical information Gene expression - RSEM expected_count for each samples collapsed to gene symbol (gene-level)
tcga-gene-expression-rsem-tpm-collapsed.rds Modified reference file TCGA samples - manually curated to include 10414 TCGA RNA samples that are in diseaseXpress and has GDC clinical information Gene expression - RSEM TPM for each samples collapsed to gene symbol (gene-level)
WGS.hg38.lancet.300bp_padded.bed Reference Target/Baits File SNV and INDEL calling WGS.hg38.lancet.unpadded.bed file with each region padded by 300 bp
WGS.hg38.lancet.unpadded.bed Reference Regions File SNV and INDEL calling hg38 WGS regions created using UTR, exome, and start/stop codon features of the GENCODE 31 reference, augmented with PASS variant calls from Strelka2 and Mutect2
WGS.hg38.mutect2.vardict.unpadded.bed Reference Regions File SNV and INDEL calling hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M and non-N regions) used for Mutect2 and VarDict variant callers
WGS.hg38.strelka2.unpadded.bed Reference Regions File SNV and INDEL calling hg38 BROAD Institute interval calling list (restricted to Chr1-22,X,Y,M) used for Strelka2 variant caller
WGS.hg38.vardict.100bp_padded.bed Reference Regions File SNV and INDEL calling WGS.hg38.mutect2.vardict.unpadded.bed with each region padded by 100 bp used for VarDict variant caller
snv-consensus-plus-hotspots.maf.tsv.gz Processed data file copy_number_consensus_call Consensus (2 of 4) maf for PBTA + GMKF + TARGET
cnv-cnvkit.seg.gz Processed data file Copy number variant calling; Workflow Somatic Copy Number Variant - CNVkit SEG file
cnv-consensus.seg.gz Analysis file [copy_number_consensus_call]]( Somatic Copy Number Variant - WGS samples only
Analysis files copy_number_consensus_call CNVkit calls for WXS or CNV consensus calls for WGS with gain/loss status
cnv-consensus-gistic.gz Analysis file run-gistic GISTIC results - WGS samples only
cnv-controlfreec.tsv.gz Processed data file Copy number variant calling; Workflow Somatic Copy Number Variant - TSV file that is a merge of ControlFreeC *_CNVs files
consensus_wgs_plus_cnvkit_wxs_autosomes.tsv.gz Analysis file focal-cn-file-preparation TSV file containing genes with copy number changes per biospecimen; autosomes only
consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz Analysis file focal-cn-file-preparation TSV file containing genes with copy number changes per biospecimen; sex chromosomes only
consensus_wgs_plus_cnvkit_wxs.tsv.gz Analysis file focal-cn-file-preparation TSV file containing genes with copy number changes per biospecimen; both autosomes and sex chromosomes
snv-mutation-tmb-all.tsv Analysis file tmb-calculation TSV file with sample names and their tumor mutation burden counting all variants
snv-mutation-tmb-coding.tsv Analysis file tmb-calculation TSV file with sample names and their tumor mutation burden counting all variants in coding region only
sv-manta.tsv.gz Processed data file Structural variant calling; Workflow Somatic Structural Variant - Manta output, annotated with AnnotSV (WGS samples only)
independent-specimens.wgswxspanel.relapse.tsv Analysis files independent-samples Independent (non-redundant) sample list of DNA, RNA, or methylation samples of all sequencing methods, from primary, primary-plus, or relapse tumors within each or across all cohorts
independent-specimens.rnaseqpanel.relapse.pre-release.tsv Analysis files independent-samples Independent (non-redundant) sample list of RNA samples of all sequencing methods, from primary, primary-plus, or relapse tumors across all cohorts for the purposes of running fusion_filtering pre-release