A pipeline for the analysis of STRT2 RNA-sequencing outputs from NextSeq.
Please cite Ezer et al. 2021 when you use this pipeline.
Ezer S, Yoshihara M, Katayama S, DoGA consortium, Daub C, Lohi H, Krjutskov K, Kere J.
Generation of RNA sequencing libraries for transcriptome analysis of globin-rich tissues of the domestic dog.
STAR Protoc. 2021 Dec 12;2(4):100995.
git clone https://github.com/my0916/STRT2.git
For STRT2.sh
(the main pipeline)
For STRT-TFE.sh
(TFE-based analysis)
For fastq-fastQC.sh
The conda environment is provided as condaEnv.yml
. The environment can be created with the followings:
conda env create -f condaEnv.yml
conda activate STRT2_env
For UPPMAX, these software are available through the module
command in the scripts (STRT2-UPPMAX.sh
, STRT2-TFE-UPPMAX.sh
, and fastq-fastqc-uppmax.sh
).
- Illumina BaseCalls files (.bcl). The number of lanes is determined based on the number of directories in the basecalls directory. Here is an example of 4 lanes:
├── L001
├── L002
├── L003
└── L004
- HISAT2 index built with a reference genome, (ribosomal DNA), and ERCC spike-ins
- See also How to build HISAT2 index.
- The HISAT2 index directory should include the followings:
├── [basename].1.ht2
├── [basename].2.ht2
├── [basename].3.ht2
├── [basename].4.ht2
├── [basename].5.ht2
├── [basename].6.ht2
├── [basename].7.ht2
├── [basename].8.ht2
├── [basename].fasta
└── [basename].dict
- Source files (in
src
directory)barcode.txt
: Barcode sequence with barcode name (1–48). Please modify if you used different (number of) barcodes.ERCC.bed
: 5'-end 50 nt region of ERCC spike-ins (SRM2374) for annotation and quality check.
./STRT2.sh -o STRT2LIB -g canFam3 -a ens -b /path/to/Data/Intensities/BaseCalls/ \
-i /path/to/index/canFam3_reference -c HUDDINGE -r ABCDEFG123
For UPPMAX:
sbatch -A snic2017-7-317 -p core -n 8 -t 24:00:00 ./STRT2-UPPMAX.sh -o STRT2LIB -g canFam3 -a ens \
-b /path/to/Data/Intensities/BaseCalls/ -i /path/to/index/canFam3_reference -c HUDDINGE -r ABCDEFG123
-
Mandatory
Name Description -g, --genome
Reference genome. Choose from hg19
/hg38
/mm9
/mm10
/canFam3
.-b, --basecalls
/PATH/to/the Illumina basecalls directory. -i, --index
/PATH/to/the directory and basename of the HISAT2 index for the reference genome. -
Optional
Name Default value Description -o, --out
OUTPUT Output file name. -a, --annotation
ref Gene annotation for QC and counting.
Choose fromref
(RefSeq)/ens
(Ensembl)/kg
(UCSC KnownGenes), or directly input the Gencode annotation file name (eg.wgEncodeGencodeBasicV28lift37
) for Gencode.
Note that some annotations are unavailable in some cases. Please find the details below.-c, --center
CENTER The name of the sequencing center that produced the reads.
Required for the the Picard IlluminaBasecallsToSam program.-r, --run
RUNBARCODE The barcode of the run. Prefixed to read names.
Required for the the Picard IlluminaBasecallsToSam program.-s, --structure
8M3S74T6B Read structure.
Required for the the Picard IlluminaBasecallsToSam program.
Details are described here.-d, --dta
Add -d, --dta
(downstream-transcriptome-assembly) if you plan to perform TFE-based analysis.
Please note that this leads to fewer alignments with short-anchors.-h, --help
Show usage. -v, --version
Show version. -a, --annotation
availability as of July 2020:
RefSeq (ref) Ensembl (ens) KnownGenes (kg) Gencode hg19 (human) ✔️ ✔️ ✔️ ✔️ hg38 (human) ✔️ NA ✔️ ✔️ mm9 (mouse) ✔️ ✔️ ✔️ NA mm10 (mouse) ✔️ NA ✔️ ✔️ canFam3 (dog) ✔️ ✔️ NA NA
Outputs are provided in out
directory.
Unaligned BAM files generated with Picard IlluminaBasecallsToSam program are found in tmp/Unaligned_bam
.
-
OUTPUT
-QC.txt
Quality check report for all samples.Column Value Barcode
Sample name. OUTPUT
with numbersQualified_reads
Primary aligned read count Total_reads
Read count without redundant (duplicate) reads Redundancy
Qualified reads / Total reads Mapped_reads
Mapped read count (Total reads without unmapped reads) Mapped_rate
Mapped reads / Total reads Spikein_reads
Read count mapped to ERCC spike-ins Spikein-5end_reads
Read count mapped to the 5'-end 50 nt region of ERCC spike-ins Spikein-5end_rate
Spikein-5end reads / Spikein reads Coding_reads
Read count aligned within any exon or the 500 bp upstream of coding genes Coding-5end_reads
Read count aligned the 5′-UTR or 500 bp upstream of coding genes Coding-5end_rate
Coding-5end reads / Coding reads -
OUTPUT
-QC-plots.pdf
Quality check report by boxplots.Mapped_reads
,Mapped_rate
,Spikein_reads
,Mapped / Spikein
,Spikein-5end_rate
, andCoding-5end_rate
are shown for all samples. Barcode numbers of outlier samples are marked with red characters.
Please consider these outlier samples for the further downstream analysis. -
OUTPUT
_byGene-counts.txt
Read count table output from featureCounts. http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf -
OUTPUT
_byGene-counts.txt.summary
Filtering summary from featureCounts. http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf -
Output_bam
Resulting BAM files including unmapped, non-primary aligned, and duplicated (marked) reads. -
Output_bai
Index files (.bai) of the resulting BAM files in theOutput_bam
directory. -
ExtractIlluminaBarcodes_Metrics
Metrics file produced by the Picard ExtractIlluminaBarcodes program. The number of matches/mismatches between the barcode reads and the actual barcodes is shown per lane. https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_illumina_ExtractIlluminaBarcodes.php -
HISAT2_Metrics
Alignment summary of samples from each lane produced by the HISAT2 program. https://ccb.jhu.edu/software/hisat2/manual.shtml#alignment-summary -
MarkDuplicates_Metrics
Metrics file indicating the numbers of duplicates produced by the Picard MarkDuplicates program. https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.4.0/picard_sam_markduplicates_MarkDuplicates.php
After running the pipeline above, you can generate fastq files for each sample from the output BAM files in the fastq
directory. These fastq files (without duplicated reads) can be submitted to public sequence databases.
FastQC files are also generated for each fastq file in the fastqc
directory.
Based on the FastQC results, MultiQC report (MultiQC_report.html) is generated.
Here is the case for the dog genome (canFam3).
You may add the ribosomal DNA repetitive unit for human (U13369) and mouse (BK000964).
wget http://hgdownload.cse.ucsc.edu/goldenPath/canFam3/bigZips/canFam3.fa.gz
unpigz -c canFam3.fa.gz | ruby -ne '$ok = $_ !~ /^>chrUn_/ if $_ =~ /^>/; puts $_ if $ok' > canFam3_reference.fasta
wget https://www-s.nist.gov/srmors/certificates/documents/SRM2374_putative_T7_products_NoPolyA_v2.FASTA
cat SRM2374_putative_T7_products_NoPolyA_v2.FASTA >> canFam3_reference.fasta
Here Ensembl transcript map (canFam3.transMapEnsemblV4.gtf.gz) was downloaded from the UCSC Table Browser.
unpigz -c canFam3.transMapEnsemblV4.gtf.gz | hisat2_extract_splice_sites.py - | grep -v ^chrUn > canFam3.ss
unpigz -c canFam3.transMapEnsemblV4.gtf.gz | hisat2_extract_exons.py - | grep -v ^chrUn > canFam3.exon
You may additionally perform hisat2_extract_snps_haplotypes_UCSC.py
to extract SNPs and haplotypes from a dbSNP file for human and mouse.
hisat2-build canFam3_reference.fasta --ss canFam3.ss --exon canFam3.exon canFam3_reference
This outputs a set of files with suffixes. Here, canFam3_reference.1.ht2
, canFam3_reference.2.ht2
, ..., canFam3_reference.8.ht2
are generated.
In this case, canFam3_reference
is the basename used for -i, --index
.
java -jar picard.jar CreateSequenceDictionary R=canFam3_reference.fasta O=canFam3_reference.dict
This is required for the Picard MergeBamAlignment program. Note that the original FASTA file (canFam3_reference.fasta
here) is also required.