1.1 Prepare mapping and QC reference data

Overview

We use the STAR algorithm for RNA-Seq mapping and for QC we use a tool developed within the Sanger CGP group for the ICGC-TCGA-PanCancer Project called bamstats (installed via this github project) along with RSeQC for additional RNA-Seq specific QC information.

For each reference-gene build, cgpRna expects reference data to be present in a particular structure. Pre-generated reference data is available for the following reference/gene builds:

GRCh38 Ensembl release 77 (full name of reference build is GRCh38_full_analysis_set_plus_decoy_hla)
GRCh37 Ensembl release 75 (full name of reference build is GRCH37d5)

If you require reference data for alternative combinations of reference/gene builds this will need to be built following the instructions below.

Pre-generated data

The pre-generated reference data for cgpRna can be downloaded from ftp://ftp.sanger.ac.uk/pub/cancer/support-files/cgpRna/

To decompress run command: tar -zxvf <ref>.tar.gz -C /path/to/decompress/to

You will need approx. 200GB space for each reference set of data.

N.B. The /path/to/decompress/to will then become /ref/data/root or the -r parameter when running the pipeline (see usage for running the mapping and qc pipeline as an example).

Building reference data from scratch

The reference data needs to have the following basic structure:
<ref-data-root> / <species> / <reference-genome-build> / star / <gene-build> e.g.
../ref-data/human/hg38/star/e77

Reference fasta (.fa) genome files and accompanying .fa.fai files should be placed under the reference-genome-build folder e.g. genome.fa and genome.fa.fai are under <ref-data-root>/human/GRCh38 in the pre-generated data set

Shared reference files

Create the following directory structure: /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/<gene-build> e.g. <ref-data-root>/human/GRCh38/cgpRna/77
Download the Ensembl GTF file for the gene annotation version you are using and place it in the following directory with the name ensembl.gtf: /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/<gene-build> e.g. for e77 (which is compatible with GRCh38) the gtf file was downloaded from this location: http://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/, it was then decompressed and renamed to ensembl.gtf. N.B. if the reference fasta being used contains "chr" in the chromosome names, this will need to be added to the ensembl.gtf file.

Building star reference files

To generate the star reference data:

1.Create the following directory structure: /<ref-data-root>/<species>/<reference-genome-build>/star/<gene-build> e.g. <ref-data-root>/human/GRCh38/star/77
2.Run the following command to generate the index files:

<installation-directory>/bin/STAR --runMode genomeGenerate --genomeDir <ref-data-root>/human/<reference-genome-build>/star --genomeFastaFiles <ref-data-root>/human/<reference-genome-build>/genome.fa --sjdbOverhang 99

where installation-directory = path_to_install_to when the setup.sh script was run

Please note, the STAR documentation advises to not include alternative and haplotype sequences in the reference. This means it may be necessary to create a cut down genome.fa file specifically for STAR that excludes those contig types. Human GRCh38 is a prime example of where this is required.

N.B.The --sjdbOverhang attribute should be set to the read length-1. The RNA-seq data the pipeline was developed for, contained libraries of 2x75bp and 2x100bp. Where there is a mixture of read lengths like this it is recommended to base the parameter on the longer length so we used 99.

It's also possible to add annotated transcript information to the genome index, at this stage, which greatly improves mapping. From STAR version 2.4.1a onwards this can be included during the mapping step on the fly which is what we use in the cgpRna star_fusion pipeline with parameter: --sjdbGTFfile /human/GRCh38/77/ensembl.gtf so please ensure you have installed a version of STAR compatible with this functionality.

3.Create a soft link to the Ensembl GTF file: cd /<ref-data-root>/<species>/<reference-genome-build>/star/<gene-build>; ln -s /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/<gene-build>/ensembl.gtf

RSeQC reference data

For the RSeQC tool we will be using three modules that require reference bed files;

split_bam.py which is used to determine the amount of ribosomal RNA contamination. A rRNA bed file if required for this which can either be downloaded from the UCSC table browser or the RSeQC developers provide it for human GRCh38 or hg19 here
geneBody_coverage.py which checks if read coverage is consistent or whether any 5' or 3' bias across genes. It uses a bed file called .HouseKeepingGene.bed which can also be downloaded via the RSeQC website via the species links here
read_distribution.py this module will calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions). It uses a bed file called .RefSeq.bed which can also be downloaded via the RSeQC species link or the UCSC Table Browser.

N.B It's important to ensure the chromosome name format in these files matches the reference so if 'chr' is not present in our reference fasta, then chr should be removed from the chromosome column in all of the above bed file

Decompress the files using gunzip, create the following folder in the reference data area and place them there: <ref-data-root> / <species> / <reference-genome-build> / rseqc /
So the rseqc directory should contain; rRNA.bed, HouseKeepingGene.bed and RefSeq.bed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly