diff --git a/docs/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline.md b/docs/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline.md index 3d496f430..47310a6f9 100644 --- a/docs/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline.md +++ b/docs/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline.md @@ -1,7 +1,7 @@ # mRNA Analysis Pipeline ## Introduction -The GDC mRNA quantification analysis pipeline measures gene level expression in [HT-Seq](http://www-huber.embl.de/HTSeq/doc/overview.html) raw read count, Fragments per Kilobase of transcript per Million mapped reads (FPKM), and FPKM-UQ (upper quartile normalization). These values are generated through this pipeline by first aligning reads to the GRCh38 [reference genome](https://gdc.cancer.gov/download-gdc-reference-files) and then by quantifying the mapped reads. To facilitate harmonization across samples, all RNA-Seq reads are treated as unstranded during analyses. +The GDC mRNA quantification analysis pipeline measures gene level expression with [STAR](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) as raw read counts. Subsequently the counts are augmented with several transformations including Fragments per Kilobase of transcript per Million mapped reads (FPKM), upper quartile normalized FPKM (FPKM-UQ), and Transcripts per Million (TPM). These values are additionally annotated with the gene symbol and gene bio-type. These data are generated through this pipeline by first aligning reads to the GRCh38 [reference genome](https://gdc.cancer.gov/download-gdc-reference-files) and then by quantifying the mapped reads. To facilitate harmonization across samples, all RNA-Seq reads are treated as unstranded during analyses. ## Data Processing Steps @@ -12,7 +12,9 @@ Files that were processed after Data Release 14 have associated transcriptomic a Files that were processed after Data Release 25 will have associated [gene fusion files](#fusion-pipelines). -[![RNA Alignment Pipeline](images/gene-expression-quantification-pipeline-v3.png)](images/gene-expression-quantification-pipeline-v3.png "Click to see the full image.") +As of Data Release 32 the reference annotation will be updated to GENCODE v36 and HT-Seq will no longer be used. + +[![RNA Alignment Pipeline](images/RNA_Expression_WF_Flowchart.png)](images/RNA_Expression_WF_Flowchart.png "Click to see the full image.") | I/O | Entity | Format | |---|---|---| @@ -157,19 +159,62 @@ STAR \ --runThreadN \ --twopassMode Basic ``` +```DR32 +# STAR Genome Index +STAR +--runMode genomeGenerate +--genomeDir +--genomeFastaFiles +--sjdbOverhang 100 +--sjdbGTFfile +--runThreadN 8 + +# STAR Alignment +# STAR v2.7.0f +STAR +--readFilesIn \ +--outSAMattrRGline \ +--genomeDir \ +--readFilesCommand \ +--runThreadN \ +--twopassMode Basic \ +--outFilterMultimapNmax 20 \ +--alignSJoverhangMin 8 \ +--alignSJDBoverhangMin 1 \ +--outFilterMismatchNmax 999 \ +--outFilterMismatchNoverLmax 0.1 \ +--alignIntronMin 20 \ +--alignIntronMax 1000000 \ +--alignMatesGapMax 1000000 \ +--outFilterType BySJout \ +--outFilterScoreMinOverLread 0.33 \ +--outFilterMatchNminOverLread 0.33 \ +--limitSjdbInsertNsj 1200000 \ +--outFileNamePrefix \ +--outSAMstrandField intronMotif \ +--outFilterIntronMotifs None \ +--alignSoftClipAtReferenceEnds Yes \ +--quantMode TranscriptomeSAM GeneCounts \ +--outSAMtype BAM Unsorted \ +--outSAMunmapped Within \ +--genomeLoad NoSharedMemory \ +--chimSegmentMin 15 \ +--chimJunctionOverhangMin 15 \ +--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \ +--chimOutJunctionFormat 1 \ +--chimMainSegmentMultNmax 1 \ +--outSAMattributes NH HI AS nM NM ch +``` \*These indices are available for download at the [GDC Website](https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files) and do not need to be built again. ### mRNA Expression Workflow -Following alignment, BAM files are processed through the [RNA Expression Workflow](/Data_Dictionary/viewer/#?view=table-definition-view&id=rna_expression_workflow) to determine RNA expression levels. - -The reads mapped to each gene are enumerated using HT-Seq-Count. Expression values are provided in a tab-delimited format. [GENCODE v22](https://www.gencodegenes.org/human/release_22.html) was used for gene annotation. -Files that were processed after Data Release 14 have an additional set of read counts that were produced by STAR during the alignment step. +The primary counting data is generated by STAR and includes a gene ID, unstranded, and stranded counts data. Following alignment, the raw counts files produced by STAR are augmented with commonly used counts transformations (FPKM, FPKM-UQ, and TPM) along with basic annotations as part of the [RNA Expression Workflow](/Data_Dictionary/viewer/#?view=table-definition-view&id=rna_expression_workflow). These data are provided in a tab-delimited format. [GENCODE v36](https://www.gencodegenes.org/human/release_36.html) was used for gene annotation. -Note that counting algorithms such as HTSeq and STAR will not count reads that are mapped to more than one different gene. Below are two files that list genes that are completely encompassed by other genes and will likely display a value of zero. +Note that the STAR counting results will not count reads that are mapped to more than one different gene. Below are two files that list genes that are completely encompassed by other genes and will likely display a value of zero. -* [Overlapped Genes (stranded)](/Data/Bioinformatics_Pipelines/overlap.gene.stranded.tsv) +* [Overlapped Genes (stranded)](/Data/Bioinformatics_Pipelines/overlap.gene.stranded.tsv) * [Overlapped Genes (unstranded)](/Data/Bioinformatics_Pipelines/overlap.gene.strandless.tsv) | I/O | Entity | Format | @@ -181,6 +226,9 @@ Note that counting algorithms such as HTSeq and STAR will not count reads that a HTSeq-0.6.1p1 +```Current +Counts are produced by STAR concurrent with alignment. +``` ```Original htseq-count \ -m intersection-nonempty \ @@ -189,7 +237,7 @@ htseq-count \ -s no \ - gencode.v22.annotation.gtf ``` -```DR15Plus +```DR15-31 htseq-count \ -f bam \ -r name \ @@ -202,26 +250,25 @@ htseq-count \ > ``` -## mRNA Expression HT-Seq Normalization +## mRNA Expression Transformation -RNA-Seq expression level read counts produced by HT-Seq are normalized using two similar methods: FPKM and FPKM-UQ. Normalized values should be used only within the context of the entire gene set. Users are encouraged to normalize raw read count values if a subset of genes is investigated. +RNA-Seq expression level read counts produced by the workflow are normalized using three commonly used methods: FPKM, FPKM-UQ, and TPM. Normalized values should be used only within the context of the entire gene set. Users are encouraged to normalize raw read count values if a subset of genes is investigated. ### FPKM -The Fragments per Kilobase of transcript per Million mapped reads (FPKM) calculation normalizes read count by dividing it by the gene length and the total number of reads mapped to protein-coding genes. +The fragments per kilobase of transcript per million mapped reads (FPKM) calculation aims to control for transcript length and overall sequencing quantity. ### Upper Quartile FPKM -The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the total protein-coding read count is replaced by the 75th percentile read count value for the sample. +The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the protein coding gene in the 75th percentile position is substituted for the sequencing quantity. This is thought to provide a more stable value than including the noisier genes at the extremes. -### Calculations +### TPM -[![FPKM Calculations](images/Calc_FPKM_andUQ.png)](images/fpkm.gif "Click to see the full image.") +The transcripts per million calculation can is similar to FPKM, but the difference is that all transcripts are normalized for length first. Then, instead of using the total overall read count as a normalization for size, the sum of the length-normalized transcript values are used as an indicator of size. -- __RCg:__ Number of reads mapped to the gene -- __RCpc:__ Number of reads mapped to all protein-coding genes -- __RCg75:__ The 75th percentile read count value for genes in the sample -- __L:__ Length of the gene in base pairs; Calculated as the sum of all exons in a gene +### Calculations + +[![FPKM Calculations](images/normalizations_calc.png)](images/normalizations_calc.png "Click to see the full image.") __Note:__ The read count is multiplied by a scalar (109) during normalization to account for the kilobase and 'million mapped reads' units. @@ -233,10 +280,14 @@ __Sample 1: Gene A__ * 1,000 reads mapped to Gene A * 1,000,000 reads mapped to all protein-coding regions * Read count in Sample 1 for 75th percentile gene: 2,000 +* Number of protein coding genes on autosomes: 19,029 +* Sum of length-normalized transcript counts: 9,000,000 + +__FPKM for Gene A__ = 1,000 \* 10^9 / (3,000 \* 50,000,000) = __6.67__ -__FPKM for Gene A__ = (1,000)\*(10^9)/[(3,000)\*(1,000,000)] = __333.33__ +__FPKM-UQ for Gene A__ = 1,000) \* 10^9 / (3,000 \* 2,000 \* 19,029) = __8.76__ -__FPKM-UQ for Gene A__ = (1,000)\*(10^9)/[(3,000)\*(2,000)] = __166,666.67__ +__TPM for Gene A__ = (1,000 * 1,000 / 3,000) * 1,000,000 / (9,000,000) = __37.04__ ## Fusion Pipelines @@ -256,7 +307,7 @@ The [Arriba gene fusion pipeline](https://github.com/suhrig/arriba) uses Arriba ## scRNA-Seq Pipeline (single-nuclei) -The GDC processes single-cell RNA-Seq (scRNA-Seq) data using the [Cell Ranger pipeline](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) to calculate gene expression followed by [Seurat](https://satijalab.org/seurat/) for secondary expression analysis. +The GDC processes single-cell RNA-Seq (scRNA-Seq) data using the [Cell Ranger pipeline](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) to calculate gene expression followed by [Seurat](https://satijalab.org/seurat/) for secondary expression analysis. ### scRNA Gene Expression Pipeline @@ -278,12 +329,9 @@ When the input RNA was extracted from nuclei instead of cytoplasm, a slightly mo ## File Access and Availability -To facilitate the use of harmonized data in user-created pipelines, RNA-Seq gene expression is accessible in the GDC Data Portal at several intermediate steps in the pipeline. Below is a description of each type of file available for download in the GDC Data Portal. +To facilitate the use of harmonized data in user-created pipelines, RNA-Seq gene expression is accessible in the GDC Data Portal at several intermediate steps in the pipeline. Below is a description of each type of file available for download in the GDC Data Portal. | Type | Description | Format | |---|---|---| -| RNA-Seq Alignment | RNA-Seq reads that have been aligned to the GRCh38 build. Reads that were not aligned are included to facilitate the availability of raw read sets | BAM | -| HT-Seq Read Counts | The number of reads aligned to each gene, calculated by HT-Seq | TXT | -| STAR Read Counts | The number of reads aligned to each gene, calculated by STAR | TSV | -| FPKM | A normalized expression value that takes into account each gene length and the number of reads mapped to all protein-coding genes | TXT | -| FPKM-UQ | A modified version of the FPKM formula in which the 75th percentile read count is used as the denominator in place of the total number of protein-coding reads | TXT | +| RNA-Seq Alignment | RNA-Seq reads that have been aligned to the GRCh38 build. Reads that were not aligned are included to facilitate the availability of raw read sets. | BAM | +| STAR Read Counts | The number of reads aligned to each gene, calculated by STAR, along with values using common normalization methods. | TSV | diff --git a/docs/Data/Bioinformatics_Pipelines/images/RNA_Expression_WF.graphml b/docs/Data/Bioinformatics_Pipelines/images/RNA_Expression_WF.graphml new file mode 100644 index 000000000..7b2a0347d --- /dev/null +++ b/docs/Data/Bioinformatics_Pipelines/images/RNA_Expression_WF.graphml @@ -0,0 +1,406 @@ + + + + + + + + + + + + + + + + + + + + + + + BAM + + + + + + + + + + + + FASTQ + + + + + + + + + + + + Convert to +FASTQ +(Biobambam) + + + + + + + + + + + + Alignment & +Splice Junction +Detection +(STAR 2 TwoPass) + + + + + + + + + + + + Splice +Junction* + + + + + + + + + + + + Aligned +Transcriptomic +BAM* + + + + + + + + + + + + Aligned +Genomic +BAM + + + + + + + + + + + + Aligned +Chimeric +BAM* + + + + + + + + + + + + Gene +Expression +Counts + + + + + + + + + + + + Stranded and +unstranded counts, +FPKM, FPKM-UQ +TPM*** + + + + + + + + + + + + Augment +Gene +Counts + + + + + + + + + + + + Transcript +Fusion +(Arriba) + + + + + + + + + + + + Transcript +Fusion +(STAR-Fusion) + + + + + + + + + + + + Transcript +Fusion** + + + + + + + + + + + + Transcript +Fusion** + + + + + + + + + + + + * Data Release 14+ +** Data Release 25+ +*** Data Release 32+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/Data/Bioinformatics_Pipelines/images/RNA_Expression_WF_Flowchart.png b/docs/Data/Bioinformatics_Pipelines/images/RNA_Expression_WF_Flowchart.png new file mode 100644 index 000000000..ff7277d05 Binary files /dev/null and b/docs/Data/Bioinformatics_Pipelines/images/RNA_Expression_WF_Flowchart.png differ diff --git a/docs/Data/Bioinformatics_Pipelines/images/fpkm_fpkm-uq_tpm_formulas.png b/docs/Data/Bioinformatics_Pipelines/images/fpkm_fpkm-uq_tpm_formulas.png new file mode 100644 index 000000000..73ba045eb Binary files /dev/null and b/docs/Data/Bioinformatics_Pipelines/images/fpkm_fpkm-uq_tpm_formulas.png differ diff --git a/docs/Data/Bioinformatics_Pipelines/images/fpkm_fpkm-uq_tpm_formulas.svg b/docs/Data/Bioinformatics_Pipelines/images/fpkm_fpkm-uq_tpm_formulas.svg new file mode 100644 index 000000000..1c05e95a3 --- /dev/null +++ b/docs/Data/Bioinformatics_Pipelines/images/fpkm_fpkm-uq_tpm_formulas.svg @@ -0,0 +1,417 @@ + + + + + + + + + + image/svg+xml + + + + + + + N = number of protein coding genesCg = count of reads aligned to gene gLg = union length of exons of gene gG = number of protein coding genes on autosomesCqtl(0.75) = count of reads aligned to gene at quantile 0.75 + + FPKM = + FPKM-UQ = + TPM = + + (Cg * 1e3 / Lg) * 1e6 + + (Cg * 1e3 / Lg) + + g=1 + N + + + + + Cqtl(0.75) * G * Lg + Cg * 1e9 + + + + Cg * 1e9 + + + + i=1 + N + Ci + Lg + ) + ( + + + + + diff --git a/docs/Data/Bioinformatics_Pipelines/images/normalizations_calc.png b/docs/Data/Bioinformatics_Pipelines/images/normalizations_calc.png new file mode 100644 index 000000000..7c8bce502 Binary files /dev/null and b/docs/Data/Bioinformatics_Pipelines/images/normalizations_calc.png differ diff --git a/docs/Data/Release_Notes/Data_Release_Notes.md b/docs/Data/Release_Notes/Data_Release_Notes.md index ad555c9c7..342eab14b 100644 --- a/docs/Data/Release_Notes/Data_Release_Notes.md +++ b/docs/Data/Release_Notes/Data_Release_Notes.md @@ -2,6 +2,7 @@ | Version | Date | |---|---| +| [v32.0](Data_Release_Notes.md#data-release-320) | March XX, 2022 | | [v31.0](Data_Release_Notes.md#data-release-310) | October 29, 2021 | | [v30.0](Data_Release_Notes.md#data-release-300) | September 23, 2021 | | [v29.0](Data_Release_Notes.md#data-release-290) | March 31, 2021 | @@ -38,6 +39,87 @@ | [v2.0](Data_Release_Notes.md#data-release-20) | August 9, 2016 | | [v1.0](Data_Release_Notes.md#initial-data-release-10) | June 6, 2016 | +## Data Release 32.0 + +* __GDC Product__: Data - New GENCODE v36 Release +* __Release Date__: March XX, 2022 + +### New updates + +1. The following data types have been replaced with new GENCODE v36 versions + * RNA-Seq: all files, including alignments, gene expression files, and transcript fusion files. + * WXS and Targeted Sequencing: annotated VCFs, single-caller MAFs, Ensemble MAFs. + * WGS: All variant calling files. This includes simple somatic mutations, structural variants, copy number variants. + * GENIE Targeted Sequencing files. + * FM-AD Targeted Sequencing files. +1. All WXS files for TCGA have been replaced with new versions. +1. TCGA RNA-Seq has been changed to contain three alignments, STAR-counts files, and transcript fusion files for each aliquot. +1. Files from the HT-Seq pipeline are no longer supported and will no longer appear in the portal. +1. The project-level MAFs in TCGA and FM-AD have been replaced with aliquot-level MAFs. +1. TCGA methylation data produced from the SeSAMe pipeline is now available. Files that originated from the methylation liftover pipeline are no longer supported and will no longer appear in the portal. +1. TCGA copy number variation files produced from the DNACopy pipeline are no longer supported and will no longer appear in the portal. +1. All mutations and genes in the Exploration page have been replaced with mutations and genes generated with GENCODE v36. +1. BAM files that no longer appear in the portal but previously did will be available for six months past this release. They may not be available after that. +1. Derived files (not BAM) that no longer appear in the portal will be downloadable as previous versions of v36 files. + +A complete list of files for this release are listed for the GDC Data Portal and the GDC Legacy Archive are found below: + +* [XXXXX](XXXXX) +* [XXXXX](XXXXX) + +### Bugs Fixed Since Last Release + +* None + +### Known Issues and Workarounds + +* The slide image viewer does not display properly for 14 slides, which are identified [here](missing_tiling.txt). The full slide image can be downloaded as an SVS file. +* The Copy Number Estimate files in GENIE are labeled on the portal as TXT while the files are actually in TSV format. +* Some tumor-only annotated VCFs (not raw VCFs) could have a small proportion of variants that appear twice. Tumor-only annotated VCFs can be identified by searching for workflow "GATK4 MuTect2 Annotation" +* The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file. +* Some miRNA files with QC failed reads were not swapped in DR11.0. 361 aliquots remain to be swapped in a later release +* Mutation frequency may be underestimated when using MAF files for genes that overlap other genes. This is because MAF files only record one gene per variant. +* Most intronic mutations are removed for MAF generation. However, validated variants may rescue these in some cases. Therefore intronic mutations in MAF files are not representative of those called by mutation callers. +* Public MAF files for different variant calling pipelines but the same project may contain different numbers of samples. Samples are omitted from the public MAF files if they have no PASS variants, which can lead to this apparent discrepancy. +* BAM files produced by the GDC RNA-Seq Alignment workflow will currently fail validation using the Picard ValidateSamFiles tool. This is caused by STAR2 not recording mate mapping information for unmapped reads, which are retained in our BAM files. Importantly, all affected BAM files are known to behave normally in downstream workflows including expression quantification. +* Portion "weight" property is incorrectly described in the Data Dictionary as the weight of the patient in kg, should be described as the weight of the portion in mg +* TCGA Projects + * Incorrect information about treatment may be included for patients within TCGA-HNSC and TCGA-LGG. Please refer to the clinical XML for accurate information on treatment + * 74 Diagnostic TCGA slides are attached to a portion rather than a sample like the rest of the diagnostic slides. The reflects how these original samples were handled. + * Two tissue slide images are unavailable for download from GDC Data Portal + * The raw and annotated VarScan VCF files for aliquot `TCGA-VR-A8ET-01A-11D-A403-09` are not available. These VCFs files will be replaced in a later release. + * Some TCGA annotations are unavailable in the Legacy Archive or Data Portal. These annotations can be found [here](tcga-annotations-unavailable-20170315.json). + * Tumor grade property is not populated + * Progression_or_recurrence property is not populated +* TARGET projects + * TARGET CGI BAMs in the Legacy Archive for the following aliquots should not be used because they were not repaired and concatenated into their original composite BAM files by CGHub. + * TARGET-20-PASJGZ-04A-02D + * TARGET-30-PAPTLY-01A-01D + * TARGET-20-PAEIKD-09A-01D + * TARGET-20-PASMYS-14A-02D + * TARGET-20-PAMYAS-14A-02D + * TARGET-10-PAPZST-09A-01D + * 11 bam files for TARGET-NBL RNA-Seq are not available in the GDC Data portal + * There are 5051 TARGET files for which `experimental_strategy`, `data_format`, `platform`, and `data_subtype` are blank + * There are two cases with identical submitter_id `TARGET-10-PARUYU` + * Some TARGET cases are missing `days_to_last_follow_up` + * Some TARGET cases are missing `age_at_diagnosis` + * Some TARGET files are not connected to all related aliquots + * Samples of TARGET sample_type `Recurrent Blood Derived Cancer - Bone Marrow` are mislabeled as `Recurrent Blood Derived Cancer - Peripheral Blood`. A workaround is to look at the sample barcode, which is -04 for `Recurrent Blood Derived Cancer - Bone Marrow`. (e.g. `TARGET-20-PAMYAS-04A-03R`) + * The latest TARGET data is not yet available at the GDC. For the complete and latest data, please see the [TARGET Data Matrix](https://ocg.cancer.gov/programs/target/data-matrix). Data that is not present or is not the most up to date includes: + * All microarray data and metadata + * All sequencing analyzed data and metadata + * 1180 of 12063 sequencing runs of raw data + * Demographic information for some TARGET patients is incorrect. The correct information can be found in the associated clinical supplement file. Impacted patients are TARGET-50-PAJNUS. + * No data from TARGET-MDLS is available. +* Issues in the Legacy Archive + * The read alignment end coordinates in the x.isoform.quantification.txt files produced by the miRNA pipeline are exclusive (i.e. offset by 1) for all TCGA miRNA legacy (GRCh37/hg19) and current harmonized (GRCh38/hg38) miRNA data. This error has no impact on miRNA alignment or quantification - only the coordinates reported in the quantification file.* Slide barcodes (`submitter_id` values for Slide entities in the Legacy Archive) are not available + * SDF Files are not linked to Project or Case in the Legacy Archive + * Two biotab files are not linked to Project or Case in the Legacy Archive + * SDRF files are not linked to Project or Case in the Legacy Archive + * TARGET-MDLS cases do not have disease_type or primary_site populated + + ## Data Release 31.0 * __GDC Product__: Data