-
Notifications
You must be signed in to change notification settings - Fork 4
1.2 Run the mapping and QC pipeline
RNA-Seq reads are mapped with the STAR aligner.
A suggested mapping and QC pipeline (using scripts and tools that should be in the bin directory of the installation path provided when setup.sh was run) is shown in the diagram below:
star_mapping.pl, which can be found in the perl/bin directory of cgpRna, is a wrapper script which implements the STAR algorithm. Inputs can be one or multiple BAM files or one or more pairs of fastq(.gz) files. If BAM, then star_mapping.pl will first split the data in to fastq.gz files using biobambam bamtofastq which is installed by PCAP-Core (a dependency for the cgpRna setup.sh install script). The script produces two BAM files; 1 - mapped against the reference genome (sample.star.Aligned.out.BAM) and 2 - aligned to the transcriptome (sample.star.AlignedtoTranscriptome.out.BAM) where the latter uses the GTF file downloaded as part of the reference data set-up.
N.B. for full help instructions, run star_maping.pl without input parameters
star_mapping.pl -s <sample-name> -o <output-directory> -r <reference-data> -sp <species> -rb <ref-build> -gb <gene-build> -g <gtf-file> [file(s)...] e.g.
star_mapping.pl -s TEST -o RESULTS -r /ref/data/root -sp human -rb GRCh38 -gb 77 -g ensembl.gtf test.1_1.fastq test.1_2.fastq
STAR requires at least 32GB of memory and star_mapping.pl can be run with multiple threads - the suggestion is at least 4.
If you want to produce mapping QC metrics, bamstats (installed by PCAP-Core) can be run over the BAM files via the following command:
bam_stats -r <ref-fasta.fai> -i <input.bam> -o <output.bas> e.g.
bam_stats -r /ref/data/root/human/GRCh38/genome.fa.fai -i <sample>.star.Aligned.out.bam -o <sample>.bas
bam_stats -r /ref/data/root/human/GRCh38/genome.fa.fai -i <sample>.star.AlignedtoTranscriptome.out.bam -o <sample>.bas
Additional RNA-Seq specific metrics can also be generated using scripts from the RSeQC tool kit installed as part of the cgpRna installation.
split_bam.py which is used to determine the amount of ribosomal RNA contamination.
split_bam.py -i <sample>.star.Aligned.out.bam -r /ref/data/root/human/ref-build/rseqc/rRNA.bed -o <sample>.rRNA > <sample>.rrna.txt
geneBody_coverage.py which checks if read coverage is consistent or whether any 5' or 3' bias across genes.
geneBody_coverage.py -i <sample>.star.Aligned.out.bam -f 'png' -r /ref/data/root/human/ref-build/rseqc/HouseKeepingGenes.bed -o <sample>
read_distribution.py this module will calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions).
read_distribution.py -i <sample>.star.Aligned.out.bam -r /ref/data/root/human/ref-build/rseqc/RefSeq.bed > <sample>.read_dist.txt
The script, process_qcstats.pl can then be run to reformat the output from the RSeQC tools above so that the stats can be appended on to the end of the bas file produced by bamstats.
process_qcstats.pl -s <sample> -i <input-dir-where-rseqc-results-are> -o <out-dir>
then a simple Unix paste command can be used to collate the RNA-Seq QC metrics:
paste <sample>.bas <sample>.insert.bas <sample>.read.dist.bas <sample>.rrna.bas <sample>.gene.cov.bas > <sample>.RNA.bas