-
Notifications
You must be signed in to change notification settings - Fork 4
2. Use the command line interface (CLI) in the Docker container
From version 2.4.0
we added run-cgprna
command-line interface (CLI) which simplifies the use of cgpRna, however, only starting from version 2.5.0
the CLI has functoinalities of detecting gene fusion events (the cgpRna Infuse pipeline). This page covers some general information on how to use the CLI within the container.
NOTE: As we changed the file structure within the mapping reference bundle files from 2.5.0
, please use cgpRna version 2.5.0
and above to follow this document. If using version 2.4.X
, you'll need to use cgpRna_v2.4.x-star.*ensemblxx.tar.gz
files on our FTP site as the mapping reference bundle, and you won't be able to use any gene fusion tool through the CLI.
Example commands on this page use Singularity (version 3.1) to illustrate the CLI usage. You'll need to make sure that 'singularity' executable is available in your environment (i.e. You'll need to install Singularity or load the module in a modules environment, like farms within Sanger Institue). If you're using Docker, Docker commands should follow the same principles.
Make sure to pull the image of version 2.5.0
or above from Quay.io.
Examples to pull image of version 2.5.0
:
singularity pull docker://quay.io/wtsicgp/cgprna:2.5.0
# cgprna_2.4.0.sif will be created in your current working directory
Once you have the image:
-
to find out available sub-commands:
singularity exec /path/to/cgprna_x.x.x.sif run-cgprna -h
-
to check usages of a particular sub-command:
singularity exec /path/to/cgprna_x.x.x.sif run-cgprna A_SUB_COMMAND -h
References files are available on our FTP site for run-cgprna
command line to use. Currently, there're only files for GRCh38
and GRCh37
.
With run-cgprna
, most of the steps in an RNAseq analysis workflow can be started with a single command.
With the command below, you'll be able to map RAW_FILE_1
and RAW_FILE_2
to the reference genome. run-cgprna map
uses STAR to map the reads and output mapped BAM files to OUTDIR
.
You'll need a star_RefBuildXX_ensemblXX.tar.gz
file (cgpRna-mapRefBundle.*.tar.gz
file if using cgpRna 2.4.X
) downloaded from our FTP site as the reference file.
run-cgprna map
uses STAR to map a BAM file or a pair of FastQ files to a reference genome.
The example command below uses a pair of FastQ files as input. Please note that the names of the pair must follow the convention that they have the suffix of _1
and _2
immediately prior to .f[ast]q
.
If your raw data is in an unmapped BAM file instead of a pair of FastQ files, you can set RAW_FILE_1
to the path to the raw BAM file, and remove lines with RAW_FILE_2
from the command, but keep -i $RAW_FILE_1
part in the last line.
By default, when input raw file is a BAM file, BAM header of the output file will have the same read group ID, library, description, platform and platform unit as the input BAM file in @RG
line. However, these can be changed with command-line options. Please run run-cgprna map -h
to see available options for manipulating output BAM headers. You'll want to set them if your input is a pair of FastQ files.
By default, SAMPLE_NAME
is used as the prefix of output files , however, if your sample is sequenced on multiple lanes or batches you'll want to use -op
(--output-file-prefix
) option for run-cgprna map
command to change the prefix, so that multiple mapping jobs won't overwrite each other when using the same output folder. You'll need to use the same SAMPLE_NAME
for all lane/batch mapped BAM if you intend to merge them later with bammarkduplicates2
in this container, as SAMPLE_NAME
will be written into the output BAM header as the sample name tag value.
SAMPLE_NAME='test_sample'
RAW_FILE_1=/path/to/test_sample_1_1.fq.gz
RAW_FILE_2=/path/to/test_sample_1_2.fq.gz
OUTDIR=/path/to/a/output/folder
MAP_REF_BUNDLE=/path/to/cgpRna-mapRefBundle.tar.gz
SINGULARITY_IMAGE=/path/to/cgprna.sif
THREADS=4
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $MAP_REF_BUNDLE:$MAP_REF_BUNDLE:ro \
--bind $RAW_FILE_1:$RAW_FILE_1:ro \
--bind $RAW_FILE_2:$RAW_FILE_2:ro \
$SINGULARITY_IMAGE \
run-cgprna map \
-t $THREADS \
-r $MAP_REF_BUNDLE \
-s $SAMPLE_NAME \
-od $OUTDIR \
-i $RAW_FILE_1 $RAW_FILE_2
Within cgpRna container, run-cgprna stats
can be used to generate mapping stats of a sample (or a lane/batch of a sample) from outputs of the mapping step.
run-cgprna stats
takes two BAM files as input. One file is the alignment of reads mapped to the genome, the other one is the transcriptome alignments, which is optional but can be used to calculate stats such as transcriptome insert size.
Reference bundle can be downloaded from our FTP site, you'll need to make sure it matches the genome build of the mapping reference.
GENOME_BAM=/path/to/map_output/test_sample.bam
TRANSCRIPTOME_BAM=/path/to/map_output/test_sample.mapToTranscriptome.bam
OUTDIR=/path/to/a/output/folder
STATS_REF_BUNDLE=/path/to/GRCh37d5_reseqc_ref.tar.gz
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $STATS_REF_BUNDLE:$STATS_REF_BUNDLE:ro \
--bind $GENOME_BAM:$GENOME_BAM:ro \
--bind $TRANSCRIPTOME_BAM:$TRANSCRIPTOME_BAM:ro \
$SINGULARITY_IMAGE \
run-cgprna stats \
-r $STATS_REF_BUNDLE \
-od $OUTDIR \
-i $GENOME_BAM \
-tb $TRANSCRIPTOME_BAM
This is not implemented in the CLI as it's not required for sample sequenced on one lane (which most of RNAseq samples are) and it can be done easily with the tool already available inside the container. An example command is below:
# Remove lines with 'BAM_LANE_3' in if the sample is only sequenced on two lanes. Add more of those lines if it has more than 3 lanes.
OUTDIR=/path/to/output/folder
OUTBAM_NAME=test_sample_merged.bam # just a file name, the file will be created in the OUTDIR
OUTBAM_INDEX_NAME=test_sample_merged.bam.bai # just a file name, the file will be created in the OUTDIR
OUTBAM_MD5_NAME=test_sample_merged.bam.md5 # just a file name, the file will be created in the OUTDIR
OUTBAM_DUP_MET_NAME=test_sample_merged.bam.met # just a file name, the file will be created in the OUTDIR
BAM_LANE_1=/path/to/map_output/test_sample_lane_1.bam
BAM_LANE_2=/path/to/map_output/test_sample_lane_2.bam
BAM_LANE_3=/path/to/map_output/test_sample_lane_3.bam
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $BAM_LANE_1:$BAM_LANE_1:ro \
--bind $BAM_LANE_2:$BAM_LANE_2:ro \
--bind $BAM_LANE_3:$BAM_LANE_3:ro \
$SINGULARITY_IMAGE \
bammarkduplicates2 md5=1 index=1 \
O=$OUTDIR/$OUTBAM_NAME \
indexfilename=$OUTDIR/$OUTBAM_INDEX_NAME \
md5filename=$OUTDIR/$OUTBAM_MD5_NAME \
M=$OUTDIR/$OUTBAM_DUP_MET_NAME \
I=$BAM_LANE_3 \
I=$BAM_LANE_2 \
I=$BAM_LANE_1
We provide a GTF each for GRCh38
and GRCh37
on our FTP site. These GTFs are the same as in star_RefBuildXX_ensemblXX.tar.gz
file (cgpRna-mapRefBundle.*.tar.gz
file if using cgpRna 2.4.X
), which STAR uses as gene annotations. You can use these GTFs for counting or other GTFs of you like, as long as they are for the same genome build, to which reads are mapped in the input BAM file.
INPUT_BAM=/path/to/test_sample.bam
OUTDIR=/path/to/a/output/folder
REF_GTF=/path/to/ensembl.gtf
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $REF_GTF:$REF_GTF:ro \
--bind $INPUT_BAM:$INPUT_BAM:ro \
$SINGULARITY_IMAGE \
run-cgprna count \
-r $REF_GTF \
-od $OUTDIR \
-i $INPUT_BAM
The reference file (and its .fai
file) for this step is available on our FTP site for each reference genome build on the server. Make sure you downloaded the .fai
file with the genome.fa
file and place them in the same folder.
Note that you need to use the same FastA file as what the input BAM is mapped to, and these do not have to be from our FTP site.
INPUT_BAM=/path/to/test_sample.bam
OUTDIR=/path/to/a/output/folder
REF_FASTA=/path/to/ensembl.gtf
THREADS=4
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $REF_FASTA:$REF_FASTA:ro \
--bind ${REF_FASTA}.fai:${REF_FASTA}.fai:ro \
--bind $INPUT_BAM:$INPUT_BAM:ro \
$SINGULARITY_IMAGE \
run-cgprna bigwig \
-t $THREADS \
-r $REF_FASTA \
-od $OUTDIR \
-i $INPUT_BAM
Starting from cgpRna version 2.5.0
, Infuse pipeline can be run through the CLI to search for gene fusion events. It uses Tophat fusion, STAR-Fusion and deFuse to detect gene fusion events and then compare detected events of the three tools and uses VAGrENG to annotate them.
In the CLI, three sub-commands were added tophat-fusion
, star-fusion
and defuse
to run the tools.
We packed reference files into three bundled tar files for the three tools respectively. They are all available on our FTP site. These are every large files so please compare the md5 checksums after you download them.
Reference bundles for each CLI sub-command:
-
tophat-fusion
- you'll needtophat_RefBuildXX_ensemblXX.tar.gz
from the FTP site. -
star-fusion
- you'll needstar_RefBuildXX_ensemblXX.tar.gz
from the FTP site, which is the same as for themapping
sub-command. -
defuse
- you'll needdefuse_RefBuildXX_ensemblXX.tar.gz
from the FTP site.
In the CLI, three sub-commands were added tophat-fusion
, star-fusion
and defuse
. Inputs of these commands can be BAM file, pair of FastQ files (optionally gzipped) or a mixture of both. When there more than one input file you just need to use a white space to separate the file paths. The three commands share the same input format.
All of the three gene-fusion tools outputs multiple files. In our in-house pipeline, we only care about the files listed below:
-
sample_name.tophat-fusion.normals.filtered.strand.txt
oftophat-fusion
, -
sample_name.star-fusion.normals.filtered.txt
ofstar-fusion
, -
sample_name.defuse-fusion.normals.ext.filtered.txt
ofdefuse
.
NOTE: Make sure you're using image of cgpRna version 2.5.0 or above.
-
Tophat-fusion
SAMPLE_NAME=test_sample # Remove/add `INPUT_X` as you need. INPUT_1=/path/to/test_sample.bam INPUT_2=/path/to/test_sample.1_1.fq.gz INPUT_3=/path/to/test_sample.1_2.fq.gz OUTDIR=/path/to/a/output/folder REFERENCE_BUNDLE=/path/to/bundle/tophat_a-ref-build_ref_bundle.tar.gz THREADS=4 SINGULARITY_IMAGE=/path/to/cgprna.sif singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \ --bind $REFERENCE_BUNDLE:$REFERENCE_BUNDLE:ro \ --bind $INPUT_1:$INPUT_1:ro \ --bind $INPUT_2:$INPUT_2:ro \ --bind $INPUT_3:$INPUT_3:ro \ $SINGULARITY_IMAGE \ run-cgprna tophat-fusion \ -t $THREADS \ -r $REFERENCE_BUNDLE \ -od $OUTDIR \ -s $SAMPLE_NAME \ -i $INPUT_1 $INPUT_2 $INPUT_3
-
STAR-fusion
SAMPLE_NAME=test_sample # Remove/add `INPUT_X` as you need. INPUT_1=/path/to/test_sample.bam INPUT_2=/path/to/test_sample.1_1.fq.gz INPUT_3=/path/to/test_sample.1_2.fq.gz OUTDIR=/path/to/a/output/folder REFERENCE_BUNDLE=/path/to/bundle/star_a-ref-build_ref_bundle.tar.gz THREADS=4 SINGULARITY_IMAGE=/path/to/cgprna.sif singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \ --bind $REFERENCE_BUNDLE:$REFERENCE_BUNDLE:ro \ --bind $INPUT_1:$INPUT_1:ro \ --bind $INPUT_2:$INPUT_2:ro \ --bind $INPUT_3:$INPUT_3:ro \ $SINGULARITY_IMAGE \ run-cgprna tophat-fusion \ -t $THREADS \ -r $REFERENCE_BUNDLE \ -od $OUTDIR \ -s $SAMPLE_NAME \ -i $INPUT_1 $INPUT_2 $INPUT_3
-
deFuse
SAMPLE_NAME=test_sample # Remove/add `INPUT_X` as you need. INPUT_1=/path/to/test_sample.bam INPUT_2=/path/to/test_sample.1_1.fq.gz INPUT_3=/path/to/test_sample.1_2.fq.gz OUTDIR=/path/to/a/output/folder REFERENCE_BUNDLE=/path/to/bundle/defuse_a-ref-build_ref_bundle.tar.gz THREADS=4 SINGULARITY_IMAGE=/path/to/cgprna.sif singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \ --bind $REFERENCE_BUNDLE:$REFERENCE_BUNDLE:ro \ --bind $INPUT_1:$INPUT_1:ro \ --bind $INPUT_2:$INPUT_2:ro \ --bind $INPUT_3:$INPUT_3:ro \ $SINGULARITY_IMAGE \ run-cgprna tophat-fusion \ -t $THREADS \ -r $REFERENCE_BUNDLE \ -od $OUTDIR \ -s $SAMPLE_NAME \ -i $INPUT_1 $INPUT_2 $INPUT_3
-
Merge and annotate fusion events
This functoinality is not implemented in the CLI, as it only needs to run a single perl script in the container:
compare_overlapping_fusions.pl
. The Perl script requires following inputs:-
A string as the sample name
-
Output files of the three fusion tools:
-
sample_name.tophat-fusion.normals.filtered.strand.txt
oftophat-fusion
, -
sample_name.star-fusion.normals.filtered.txt
ofstar-fusion
, and -
sample_name.defuse-fusion.normals.ext.filtered.txt
ofdefuse
.
-
-
A GTF file, which is used to annotate break points,
-
VAGrENT cache and ref files that should be the same reference and gene build as the GTF file,
GTF file and VAGrENT files for human can be found on our FTP site.
Example command:
SAMPLE_NAME=test_sample TOPHAT_FUSION_OUT=/path/to/tophat_fusion_output/test_sample.tophat-fusion.normals.filtered.strand.txt STAR_FUSION_OUT=/path/to/star_fusion_output/test_sample.star-fusion.normals.filtered.txt DEFUSE_OUT=/path/to/defuse_output/test_sample.defuse-fusion.normals.ext.filtered.txt GTF=/path/to/the_gtf.gtf VAGRENT_CACHE=/path/to/vagrent_ref_ensemblxx.cache.gz VAGRENT_FA=/path/to/vagrent_ref_ensemblxx.fa OUTDIR=/path/to/a/output/folder SINGULARITY_IMAGE=/path/to/cgprna.sif singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \ --bind $GTF:$GTF:ro \ --bind $VAGRENT_CACHE:$VAGRENT_CACHE:ro \ --bind ${VAGRENT_CACHE}.tbi:${VAGRENT_CACHE}.tbi:ro \ --bind $VAGRENT_FA:$VAGRENT_FA:ro \ --bind ${VAGRENT_FA}.fai:${VAGRENT_FA}.fai:ro \ --bind $TOPHAT_FUSION_OUT:$TOPHAT_FUSION_OUT:ro \ --bind $STAR_FUSION_OUT:$STAR_FUSION_OUT:ro \ --bind $DEFUSE_OUT:$DEFUSE_OUT:ro \ $SINGULARITY_IMAGE \ compare_overlapping_fusions.pl \ -s $SAMPLE_NAME \ -g $GTF \ -c $VAGRENT_CACHE \ -od $OUTDIR \ $TOPHAT_FUSION_OUT $STAR_FUSION_OUT $DEFUSE_OUT
-