Merge pull request #141 from NCI-GDC/DR32_Release

Dr32 release
NCI-GDC · Feb 28, 2022 · ec3b8bf · ec3b8bf
2 parents 2ae8379 + 1756f5f
commit ec3b8bf
Show file tree

Hide file tree

Showing 7 changed files with 981 additions and 28 deletions.
diff --git a/docs/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline.md b/docs/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline.md
@@ -1,7 +1,7 @@
 # mRNA Analysis Pipeline
 
 ## Introduction
-The GDC mRNA quantification analysis pipeline measures gene level expression in [HT-Seq](http://www-huber.embl.de/HTSeq/doc/overview.html) raw read count, Fragments per Kilobase of transcript per Million mapped reads (FPKM), and FPKM-UQ (upper quartile normalization). These values are generated through this pipeline by first aligning reads to the GRCh38 [reference genome](https://gdc.cancer.gov/download-gdc-reference-files) and then by quantifying the mapped reads. To facilitate harmonization across samples, all RNA-Seq reads are treated as unstranded during analyses.    
+The GDC mRNA quantification analysis pipeline measures gene level expression with [STAR](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) as raw read counts. Subsequently the counts are augmented with several transformations including Fragments per Kilobase of transcript per Million mapped reads (FPKM), upper quartile normalized FPKM (FPKM-UQ), and Transcripts per Million (TPM). These values are additionally annotated with the gene symbol and gene bio-type. These data are generated through this pipeline by first aligning reads to the GRCh38 [reference genome](https://gdc.cancer.gov/download-gdc-reference-files) and then by quantifying the mapped reads. To facilitate harmonization across samples, all RNA-Seq reads are treated as unstranded during analyses.
 
 ## Data Processing Steps
 
@@ -12,7 +12,9 @@ Files that were processed after Data Release 14 have associated transcriptomic a
 
 Files that were processed after Data Release 25 will have associated [gene fusion files](#fusion-pipelines).
 
-[![RNA Alignment Pipeline](images/gene-expression-quantification-pipeline-v3.png)](images/gene-expression-quantification-pipeline-v3.png "Click to see the full image.")
+As of Data Release 32 the reference annotation will be updated to GENCODE v36 and HT-Seq will no longer be used.
+
+[![RNA Alignment Pipeline](images/RNA_Expression_WF_Flowchart.png)](images/RNA_Expression_WF_Flowchart.png "Click to see the full image.")
 
 | I/O | Entity | Format |
 |---|---|---|
@@ -157,19 +159,62 @@ STAR \
 --runThreadN <threads> \
 --twopassMode Basic
 ```
+```DR32
+# STAR Genome Index
+STAR
+--runMode genomeGenerate
+--genomeDir <star_index_path>
+--genomeFastaFiles <reference>
+--sjdbOverhang 100
+--sjdbGTFfile <gencode.v36.annotation.gtf>
+--runThreadN 8
+
+# STAR Alignment
+# STAR v2.7.0f
+STAR
+--readFilesIn <fastq_files> \
+--outSAMattrRGline <read_group_strings> \
+--genomeDir <genome_dir> \
+--readFilesCommand <cat, zcat, etc> \
+--runThreadN <threads> \
+--twopassMode Basic \
+--outFilterMultimapNmax 20 \
+--alignSJoverhangMin 8 \
+--alignSJDBoverhangMin 1 \
+--outFilterMismatchNmax 999 \
+--outFilterMismatchNoverLmax 0.1 \
+--alignIntronMin 20 \
+--alignIntronMax 1000000 \
+--alignMatesGapMax 1000000 \
+--outFilterType BySJout \
+--outFilterScoreMinOverLread 0.33 \
+--outFilterMatchNminOverLread 0.33 \
+--limitSjdbInsertNsj 1200000 \
+--outFileNamePrefix <output prefix> \
+--outSAMstrandField intronMotif \
+--outFilterIntronMotifs None \
+--alignSoftClipAtReferenceEnds Yes \
+--quantMode TranscriptomeSAM GeneCounts \
+--outSAMtype BAM Unsorted \
+--outSAMunmapped Within \
+--genomeLoad NoSharedMemory \
+--chimSegmentMin 15 \
+--chimJunctionOverhangMin 15 \
+--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \
+--chimOutJunctionFormat 1 \
+--chimMainSegmentMultNmax 1 \
+--outSAMattributes NH HI AS nM NM ch
+```
 
 \*These indices are available for download at the [GDC Website](https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files) and do not need to be built again.
 
 ### mRNA Expression Workflow
-Following alignment, BAM files are processed through the [RNA Expression Workflow](/Data_Dictionary/viewer/#?view=table-definition-view&id=rna_expression_workflow) to determine RNA expression levels.
-
-The reads mapped to each gene are enumerated using HT-Seq-Count. Expression values are provided in a tab-delimited format. [GENCODE v22](https://www.gencodegenes.org/human/release_22.html) was used for gene annotation.
 
-Files that were processed after Data Release 14 have an additional set of read counts that were produced by STAR during the alignment step.
+The primary counting data is generated by STAR and includes a gene ID, unstranded, and stranded counts data. Following alignment, the raw counts files produced by STAR are augmented with commonly used counts transformations (FPKM, FPKM-UQ, and TPM) along with basic annotations as part of the [RNA Expression Workflow](/Data_Dictionary/viewer/#?view=table-definition-view&id=rna_expression_workflow). These data are provided in a tab-delimited format. [GENCODE v36](https://www.gencodegenes.org/human/release_36.html) was used for gene annotation.
 
-Note that counting algorithms such as HTSeq and STAR will not count reads that are mapped to more than one different gene. Below are two files that list genes that are completely encompassed by other genes and will likely display a value of zero.
+Note that the STAR counting results will not count reads that are mapped to more than one different gene. Below are two files that list genes that are completely encompassed by other genes and will likely display a value of zero.
 
-* [Overlapped Genes (stranded)](/Data/Bioinformatics_Pipelines/overlap.gene.stranded.tsv)  
+* [Overlapped Genes (stranded)](/Data/Bioinformatics_Pipelines/overlap.gene.stranded.tsv)
 * [Overlapped Genes (unstranded)](/Data/Bioinformatics_Pipelines/overlap.gene.strandless.tsv)
 
 | I/O | Entity | Format |
@@ -181,6 +226,9 @@ Note that counting algorithms such as HTSeq and STAR will not count reads that a
 
 HTSeq-0.6.1p1
 
+```Current
+Counts are produced by STAR concurrent with alignment.
+```
 ```Original
 htseq-count \
 -m intersection-nonempty \
@@ -189,7 +237,7 @@ htseq-count \
 -s no \
 - gencode.v22.annotation.gtf
 ```
-```DR15Plus
+```DR15-31
 htseq-count \
 -f bam \
 -r name \
@@ -202,26 +250,25 @@ htseq-count \
 <gtf_file> > <counts_file>
 ```
 
-## mRNA Expression HT-Seq Normalization
+## mRNA Expression Transformation
 
-RNA-Seq expression level read counts produced by HT-Seq are normalized using two similar methods: FPKM and FPKM-UQ. Normalized values should be used only within the context of the entire gene set. Users are encouraged to normalize raw read count values if a subset of genes is investigated.
+RNA-Seq expression level read counts produced by the workflow are normalized using three commonly used methods: FPKM, FPKM-UQ, and TPM. Normalized values should be used only within the context of the entire gene set. Users are encouraged to normalize raw read count values if a subset of genes is investigated.
 
 ### FPKM
 
-The Fragments per Kilobase of transcript per Million mapped reads (FPKM) calculation normalizes read count by dividing it by the gene length and the total number of reads mapped to protein-coding genes.
+The fragments per kilobase of transcript per million mapped reads (FPKM) calculation aims to control for transcript length and overall sequencing quantity.
 
 ### Upper Quartile FPKM
 
-The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the total protein-coding read count is replaced by the 75th percentile read count value for the sample.
+The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the protein coding gene in the 75th percentile position is substituted for the sequencing quantity. This is thought to provide a more stable value than including the noisier genes at the extremes.
 
-### Calculations
+### TPM
 
-[![FPKM Calculations](images/Calc_FPKM_andUQ.png)](images/fpkm.gif "Click to see the full image.")
+The transcripts per million calculation can is similar to FPKM, but the difference is that all transcripts are normalized for length first. Then, instead of using the total overall read count as a normalization for size, the sum of the length-normalized transcript values are used as an indicator of size.
 
-- __RC<sub>g</sub>:__ Number of reads mapped to the gene
-- __RC<sub>pc</sub>:__ Number of reads mapped to all protein-coding genes
-- __RC<sub>g75</sub>:__ The 75th percentile read count value for genes in the sample
-- __L:__ Length of the gene in base pairs; Calculated as the sum of all exons in a gene
+### Calculations
+
+[![FPKM Calculations](images/normalizations_calc.png)](images/normalizations_calc.png "Click to see the full image.")
 
 __Note:__ The read count is multiplied by a scalar (10<sup>9</sup>) during normalization to account for the kilobase and 'million mapped reads' units.
 
@@ -233,10 +280,14 @@ __Sample 1: Gene A__
 * 1,000 reads mapped to Gene A
 * 1,000,000 reads mapped to all protein-coding regions
 * Read count in Sample 1 for 75th percentile gene: 2,000
+* Number of protein coding genes on autosomes: 19,029
+* Sum of length-normalized transcript counts: 9,000,000
+
+__FPKM for Gene A__ = 1,000 \* 10^9 / (3,000 \* 50,000,000) = __6.67__
 
-__FPKM for Gene A__ = (1,000)\*(10^9)/[(3,000)\*(1,000,000)] = __333.33__
+__FPKM-UQ for Gene A__ = 1,000) \* 10^9 / (3,000 \* 2,000 \* 19,029) = __8.76__
 
-__FPKM-UQ for Gene A__ = (1,000)\*(10^9)/[(3,000)\*(2,000)] = __166,666.67__
+__TPM for Gene A__ = (1,000 * 1,000 / 3,000) * 1,000,000 / (9,000,000) = __37.04__
 
 ## Fusion Pipelines
 
@@ -256,7 +307,7 @@ The [Arriba gene fusion pipeline](https://github.com/suhrig/arriba) uses Arriba
 
 ## scRNA-Seq Pipeline (single-nuclei)
 
-The GDC processes single-cell RNA-Seq (scRNA-Seq) data using the [Cell Ranger pipeline](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) to calculate gene expression followed by [Seurat](https://satijalab.org/seurat/) for secondary expression analysis.  
+The GDC processes single-cell RNA-Seq (scRNA-Seq) data using the [Cell Ranger pipeline](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) to calculate gene expression followed by [Seurat](https://satijalab.org/seurat/) for secondary expression analysis.
 
 ### scRNA Gene Expression Pipeline
 
@@ -278,12 +329,9 @@ When the input RNA was extracted from nuclei instead of cytoplasm, a slightly mo
 
 ## File Access and Availability
 
-To facilitate the use of harmonized data in user-created pipelines, RNA-Seq gene expression is accessible in the GDC Data Portal at several intermediate steps in the pipeline. Below is a description of each type of file available for download in the GDC Data Portal.   
+To facilitate the use of harmonized data in user-created pipelines, RNA-Seq gene expression is accessible in the GDC Data Portal at several intermediate steps in the pipeline. Below is a description of each type of file available for download in the GDC Data Portal.
 
 | Type | Description | Format |
 |---|---|---|
-| RNA-Seq Alignment | RNA-Seq reads that have been aligned to the GRCh38 build. Reads that were not aligned are included to facilitate the availability of raw read sets  |  BAM |
-| HT-Seq Read Counts | The number of reads aligned to each gene, calculated by HT-Seq |  TXT |
-| STAR Read Counts | The number of reads aligned to each gene, calculated by STAR |  TSV |
-| FPKM | A normalized expression value that takes into account each gene length and the number of reads mapped to all protein-coding genes |  TXT |
-| FPKM-UQ | A modified version of the FPKM formula in which the 75th percentile read count is used as the denominator in place of the total number of protein-coding reads |  TXT |
+| RNA-Seq Alignment | RNA-Seq reads that have been aligned to the GRCh38 build. Reads that were not aligned are included to facilitate the availability of raw read sets.  |  BAM |
+| STAR Read Counts | The number of reads aligned to each gene, calculated by STAR, along with values using common normalization methods. |  TSV |