Skip to content

Fast, efficient RNA-Seq metrics for quality control and process optimization

License

Notifications You must be signed in to change notification settings

getzlab/rnaseqc

Repository files navigation

RNA-SeQC

Version CI

RNA-SeQC 2 is described in A. Graubert*, F. Aguet*, A. Ravi, K.G. Ardlie, Gad Getz, "RNA-SeQC 2: efficient RNA-seq quality control and quantification for large cohorts," Bioinformatics, 2021.

Installing

The latest stable build of RNA-SeQC is available on the GitHub Releases page, and contains static binaries for Linux and OSX.

RNA-SeQC is also available as a docker image: gcr.io/broad-cga-aarong-gtex/rnaseqc:latest which is automatically updated with any code change. Older versions of the docker image are tagged using the full commit SHA of any commit which introduced a code change.

To checkout the source of RNA-SeQC run git clone --recursive https://github.com/getzlab/rnaseqc.git. If you do not use the --recursive flag, you'll need to run git submodule update --init --recursive or you will be missing SeqLib.

Unit Tests

Input data for RNA-SeQC's testing suite is not stored in the repository due to size constraints. The current test data is available here, and must be unpacked within the test_data/ directory. Please note that the location of the test data is subject to change. The test resources use ~1.2 GB of space.

You can download and unpack test data with:

cd test_data
wget https://storage.googleapis.com/agraubert/broadinstitute/rnaseqc/test_inputs.tar.gz
tar xzf test_inputs.tar.gz

You can run the unit tests with make test

Usage

NOTE: This tool requires that the provided GTF be collapsed in such a way that there are no overlapping transcripts on the same strand and that each gene have a single transcript whose id matches the parent gene id. This is not a transcript-quantification method. Readcounts and coverage are made towards exons and genes only if all aligned segments of a read fully align to exons of a gene, but keep in mind that coverage may be counted towards multiple transcripts (and its exons) if these criteria are met. Beyond this, no attempt will be made to disambiguate which transcript a read belongs to. You can collapse an existing GTF using the GTEx collapse annotation script

Command Line Usage:

rnaseqc [OPTIONS] gtf bam output

Example: ./rnaseqc test_data/downsampled.gtf test_data/downsampled.bam --bed test_data/downsampled.bed --coverage .

OPTIONS:
  -h, --help                        Display this message and quit

  --version                         Display the version and quit

  gtf                               The input GTF file containing features
                                    to check the bam against

  bam                               The input SAM/BAM file containing reads
                                    to process

  output                            Output directory

  -s[sample], --sample=[sample]     The name of the current sample. Default:
                                    The bam's filename

  --bed=[BEDFILE]                   Optional input BED file containing
                                    non-overlapping exons used for fragment
                                    size calculations

  --fasta=[fasta]                   Optional input FASTA/FASTQ file
                                    containing the reference sequence used
                                    for parsing CRAM files

  --chimeric-distance=[DISTANCE]    Set the maximum accepted distance
                                    between read mates. Mates beyond this
                                    distance will be counted as chimeric
                                    pairs. Default: 2000000 [bp]

  --fragment-samples=[SAMPLES]      Set the number of samples to take when
                                    computing fragment sizes. Requires the
                                    --bed argument. Default: 1000000

  -q[QUALITY],
  --mapping-quality=[QUALITY]       Set the lower bound on read quality for
                                    exon coverage counting. Reads below this
                                    number are excluded from coverage
                                    metrics. Default: 255

  --base-mismatch=[MISMATCHES]      Set the maximum number of allowed
                                    mismatches between a read and the
                                    reference sequence. Reads with more than
                                    this number of mismatches are excluded
                                    from coverage metrics. Default: 6

  --offset=[OFFSET]                 Set the offset into the gene for the 3'
                                    and 5' windows in bias calculation. A
                                    positive value shifts the 3' and 5'
                                    windows towards eachother, while a
                                    negative value shifts them apart.
                                    Default: 150 [bp]

  --window-size=[SIZE]              Set the size of the 3' and 5' windows in
                                    bias calculation. Default: 100 [bp]

  --gene-length=[LENGTH]            Set the minimum size of a gene for bias
                                    calculation. Genes below this size are
                                    ignored in the calculation. Default: 600
                                    [bp]

  --legacy                          Use legacy counting rules. Gene and exon
                                    counts match output of RNA-SeQC 1.1.9

  --stranded=[stranded]             Use strand-specific metrics. Only
                                    features on the same strand of a read
                                    will be considered. Allowed values are
                                    'RF', 'rf', 'FR', and 'fr'

  -v, --verbose                     Give some feedback about what's going
                                    on. Supply this argument twice for
                                    progress updates while parsing the bam

  -t[TAG...], --tag=[TAG...]        Filter out reads with the specified tag.

  --chimeric-tag=[TAG]              Reads maked with the specified tag will
                                    be labeled as Chimeric. Defaults to 'ch'
                                    for STAR

  --exclude-chimeric                Exclude chimeric reads from the read
                                    counts

  -u, --unpaired                    Allow unpaired reads to be quantified.
                                    Required for single-end libraries

  --rpkm                            Output gene RPKM values instead of TPMs

  --coverage                        If this flag is provided, coverage
                                    statistics for each transcript will be
                                    written to a table. Otherwise, only
                                    summary coverage statistics are
                                    generated and added to the metrics table

  --coverage-mask=[SIZE]            Sets how many bases at both ends of a
                                    transcript are masked out when computing
                                    per-base exon coverage. Default: 500bp

  -d[threshold],
  --detection-threshold=[threshold] Number of counts on a gene to consider
                                    the gene 'detected'. Additionally, genes
                                    below this limit are excluded from 3'
                                    bias computation. Default: 5 reads

  "--" can be used to terminate flag options and force all following
  arguments to be treated as positional options

Output files:

The following output files are generated in the output directory you provide:

  • {sample}.metrics.tsv : A tab-delimited list of (Statistic, Value) pairs of all statistics and metrics recorded.
  • {sample}.exon_reads.gct : A tab-delimited GCT file with (Exon ID, Gene Name, coverage) tuples for all exons which had at least part of one read mapped.
  • {sample}.gene_reads.gct : A tab-delimited GCT file with (Gene ID, Gene Name, coverage) tuples for all genes which had at least one read map to at least one of its exons. This file contains the gene-level read counts used, e.g., for differential expression analyses.
  • {sample}.gene_tpm.gct : A tab-delimited GCT file with (Gene ID, Gene Name, TPM) tuples for all genes reported in the gene_reads.gct file, with expression values in transcript per million (TPM) units. Note: this file is renamed to .gene_rpkm.gct if the --rpkm flag is present.
  • {sample}.fragmentSizes.txt : A list of fragment sizes recorded, if a BED file was provided
  • {sample}.coverage.tsv : A tab-delimited list of (Gene ID, Transcript ID, Mean Coverage, Coverage Std, Coverage CV) tuples for all transcripts encountered in the GTF.

Metrics reported:

See Metrics.md for a description of all metrics reported in the metrics.tsv, coverage.tsv, and fragmentSizes.txt files.

Legacy mode differences

The --legacy flag enables compatibility with RNASeQC 1.1.9. This ensures that exon and gene readcounts match exactly the counts which would have been produced by running that version. This also adds an extra condition to classify reads as chimeric (see "Chimeric Reads", above). Any metrics which existed in 1.1.9 will also match within Java's floating point precision.