Skip to content

2. Use the command line interface (CLI) in the Docker container

Yaobo Xu edited this page Nov 11, 2019 · 4 revisions

From version 2.4.0 we added run-cgprna command-line interface (CLI) which simplifies the use of cgpRna, however, only starting from version 2.5.0 the CLI has functoinalities of detecting gene fusion events (the cgpRna Infuse pipeline). This page covers some general information on how to use the CLI within the container.

NOTE: As we changed the file structure within the mapping reference bundle files from 2.5.0, please use cgpRna version 2.5.0 and above to follow this document. If using version 2.4.X, you'll need to use cgpRna_v2.4.x-star.*ensemblxx.tar.gz files on our FTP site as the mapping reference bundle, and you won't be able to use any gene fusion tool through the CLI.

Example commands on this page use Singularity (version 3.1) to illustrate the CLI usage. You'll need to make sure that 'singularity' executable is available in your environment (i.e. You'll need to install Singularity or load the module in a modules environment, like farms within Sanger Institue). If you're using Docker, Docker commands should follow the same principles.

Page contents:

1. Pull the image

Make sure to pull the image of version 2.5.0 or above from Quay.io.

Examples to pull image of version 2.5.0:

singularity pull docker://quay.io/wtsicgp/cgprna:2.5.0
# cgprna_2.4.0.sif will be created in your current working directory

2. CLI command-line usages

Once you have the image:

  • to find out available sub-commands:

    singularity exec /path/to/cgprna_x.x.x.sif run-cgprna -h
  • to check usages of a particular sub-command:

    singularity exec /path/to/cgprna_x.x.x.sif run-cgprna A_SUB_COMMAND -h

3. Download reference files

References files are available on our FTP site for run-cgprna command line to use. Currently, there're only files for GRCh38 and GRCh37.

4. Run commands

With run-cgprna, most of the steps in an RNAseq analysis workflow can be started with a single command.

4.1 Map reads to the reference

With the command below, you'll be able to map RAW_FILE_1 and RAW_FILE_2 to the reference genome. run-cgprna map uses STAR to map the reads and output mapped BAM files to OUTDIR.

4.1.1 Reference file

You'll need a star_RefBuildXX_ensemblXX.tar.gz file (cgpRna-mapRefBundle.*.tar.gz file if using cgpRna 2.4.X) downloaded from our FTP site as the reference file.

4.1.2 Input

run-cgprna map uses STAR to map a BAM file or a pair of FastQ files to a reference genome.

The example command below uses a pair of FastQ files as input. Please note that the names of the pair must follow the convention that they have the suffix of _1 and _2 immediately prior to .f[ast]q.

If your raw data is in an unmapped BAM file instead of a pair of FastQ files, you can set RAW_FILE_1 to the path to the raw BAM file, and remove lines with RAW_FILE_2 from the command, but keep -i $RAW_FILE_1 part in the last line.

4.1.3 Output BAM header

By default, when input raw file is a BAM file, BAM header of the output file will have the same read group ID, library, description, platform and platform unit as the input BAM file in @RG line. However, these can be changed with command-line options. Please run run-cgprna map -h to see available options for manipulating output BAM headers. You'll want to set them if your input is a pair of FastQ files.

By default, SAMPLE_NAME is used as the prefix of output files , however, if your sample is sequenced on multiple lanes or batches you'll want to use -op(--output-file-prefix) option for run-cgprna map command to change the prefix, so that multiple mapping jobs won't overwrite each other when using the same output folder. You'll need to use the same SAMPLE_NAME for all lane/batch mapped BAM if you intend to merge them later with bammarkduplicates2 in this container, as SAMPLE_NAME will be written into the output BAM header as the sample name tag value.

4.1.4 Example command for mapping
SAMPLE_NAME='test_sample'
RAW_FILE_1=/path/to/test_sample_1_1.fq.gz
RAW_FILE_2=/path/to/test_sample_1_2.fq.gz
OUTDIR=/path/to/a/output/folder
MAP_REF_BUNDLE=/path/to/cgpRna-mapRefBundle.tar.gz
SINGULARITY_IMAGE=/path/to/cgprna.sif
THREADS=4
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $MAP_REF_BUNDLE:$MAP_REF_BUNDLE:ro \
--bind $RAW_FILE_1:$RAW_FILE_1:ro \
--bind $RAW_FILE_2:$RAW_FILE_2:ro \
$SINGULARITY_IMAGE \
run-cgprna map \
-t $THREADS \
-r $MAP_REF_BUNDLE \
-s $SAMPLE_NAME \
-od $OUTDIR \
-i $RAW_FILE_1 $RAW_FILE_2

4.2 Generate quality metrics of mapping

Within cgpRna container, run-cgprna stats can be used to generate mapping stats of a sample (or a lane/batch of a sample) from outputs of the mapping step.

4.2.1 Input Alignments

run-cgprna stats takes two BAM files as input. One file is the alignment of reads mapped to the genome, the other one is the transcriptome alignments, which is optional but can be used to calculate stats such as transcriptome insert size.

4.2.2 Reference

Reference bundle can be downloaded from our FTP site, you'll need to make sure it matches the genome build of the mapping reference.

4.2.3 Example command for generating mapping stats
GENOME_BAM=/path/to/map_output/test_sample.bam
TRANSCRIPTOME_BAM=/path/to/map_output/test_sample.mapToTranscriptome.bam
OUTDIR=/path/to/a/output/folder
STATS_REF_BUNDLE=/path/to/GRCh37d5_reseqc_ref.tar.gz
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $STATS_REF_BUNDLE:$STATS_REF_BUNDLE:ro \
--bind $GENOME_BAM:$GENOME_BAM:ro \
--bind $TRANSCRIPTOME_BAM:$TRANSCRIPTOME_BAM:ro \
$SINGULARITY_IMAGE \
run-cgprna stats \
-r $STATS_REF_BUNDLE \
-od $OUTDIR \
-i $GENOME_BAM \
-tb $TRANSCRIPTOME_BAM

4.3 Merge BAM files (if a sample is sequenced on multiple lanes or in batches)

This is not implemented in the CLI as it's not required for sample sequenced on one lane (which most of RNAseq samples are) and it can be done easily with the tool already available inside the container. An example command is below:

# Remove lines with 'BAM_LANE_3' in if the sample is only sequenced on two lanes. Add more of those lines if it has more than 3 lanes.
OUTDIR=/path/to/output/folder
OUTBAM_NAME=test_sample_merged.bam # just a file name, the file will be created in the OUTDIR
OUTBAM_INDEX_NAME=test_sample_merged.bam.bai # just a file name, the file will be created in the OUTDIR
OUTBAM_MD5_NAME=test_sample_merged.bam.md5 # just a file name, the file will be created in the OUTDIR
OUTBAM_DUP_MET_NAME=test_sample_merged.bam.met # just a file name, the file will be created in the OUTDIR
BAM_LANE_1=/path/to/map_output/test_sample_lane_1.bam
BAM_LANE_2=/path/to/map_output/test_sample_lane_2.bam
BAM_LANE_3=/path/to/map_output/test_sample_lane_3.bam
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $BAM_LANE_1:$BAM_LANE_1:ro \
--bind $BAM_LANE_2:$BAM_LANE_2:ro \
--bind $BAM_LANE_3:$BAM_LANE_3:ro \
$SINGULARITY_IMAGE \
bammarkduplicates2 md5=1 index=1 \
O=$OUTDIR/$OUTBAM_NAME \
indexfilename=$OUTDIR/$OUTBAM_INDEX_NAME \
md5filename=$OUTDIR/$OUTBAM_MD5_NAME \
M=$OUTDIR/$OUTBAM_DUP_MET_NAME \
I=$BAM_LANE_3 \
I=$BAM_LANE_2 \
I=$BAM_LANE_1

4.4 Count reads mapped to genes

4.4.1 Reference

We provide a GTF each for GRCh38 and GRCh37 on our FTP site. These GTFs are the same as in star_RefBuildXX_ensemblXX.tar.gz file (cgpRna-mapRefBundle.*.tar.gz file if using cgpRna 2.4.X), which STAR uses as gene annotations. You can use these GTFs for counting or other GTFs of you like, as long as they are for the same genome build, to which reads are mapped in the input BAM file.

4.4.2 Example command for counting
INPUT_BAM=/path/to/test_sample.bam
OUTDIR=/path/to/a/output/folder
REF_GTF=/path/to/ensembl.gtf
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $REF_GTF:$REF_GTF:ro \
--bind $INPUT_BAM:$INPUT_BAM:ro \
$SINGULARITY_IMAGE \
run-cgprna count \
-r $REF_GTF \
-od $OUTDIR \
-i $INPUT_BAM

4.5 Generate a genomic coverage track

4.5.1 Reference

The reference file (and its .fai file) for this step is available on our FTP site for each reference genome build on the server. Make sure you downloaded the .fai file with the genome.fa file and place them in the same folder.

Note that you need to use the same FastA file as what the input BAM is mapped to, and these do not have to be from our FTP site.

4.5.2 Example command for generating a coverage track
INPUT_BAM=/path/to/test_sample.bam
OUTDIR=/path/to/a/output/folder
REF_FASTA=/path/to/ensembl.gtf
THREADS=4
SINGULARITY_IMAGE=/path/to/cgprna.sif
singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
--bind $REF_FASTA:$REF_FASTA:ro \
--bind ${REF_FASTA}.fai:${REF_FASTA}.fai:ro \
--bind $INPUT_BAM:$INPUT_BAM:ro \
$SINGULARITY_IMAGE \
run-cgprna bigwig \
-t $THREADS \
-r $REF_FASTA \
-od $OUTDIR \
-i $INPUT_BAM

4.6 Detect gene fusions with cgpRna Infuse pipeline

Starting from cgpRna version 2.5.0, Infuse pipeline can be run through the CLI to search for gene fusion events. It uses Tophat fusion, STAR-Fusion and deFuse to detect gene fusion events and then compare detected events of the three tools and uses VAGrENG to annotate them.

In the CLI, three sub-commands were added tophat-fusion, star-fusion and defuse to run the tools.

4.6.1 Fusion reference bundles

We packed reference files into three bundled tar files for the three tools respectively. They are all available on our FTP site. These are every large files so please compare the md5 checksums after you download them.

Reference bundles for each CLI sub-command:

  • tophat-fusion - you'll need tophat_RefBuildXX_ensemblXX.tar.gz from the FTP site.
  • star-fusion - you'll need star_RefBuildXX_ensemblXX.tar.gz from the FTP site, which is the same as for the mapping sub-command.
  • defuse - you'll need defuse_RefBuildXX_ensemblXX.tar.gz from the FTP site.
4.6.2 Inputs and outputs

In the CLI, three sub-commands were added tophat-fusion, star-fusion and defuse. Inputs of these commands can be BAM file, pair of FastQ files (optionally gzipped) or a mixture of both. When there more than one input file you just need to use a white space to separate the file paths. The three commands share the same input format.

All of the three gene-fusion tools outputs multiple files. In our in-house pipeline, we only care about the files listed below:

  • sample_name.tophat-fusion.normals.filtered.strand.txt of tophat-fusion,
  • sample_name.star-fusion.normals.filtered.txt of star-fusion,
  • sample_name.defuse-fusion.normals.ext.filtered.txt of defuse.
4.6.3 Example commands

NOTE: Make sure you're using image of cgpRna version 2.5.0 or above.

  • Tophat-fusion

    SAMPLE_NAME=test_sample
    # Remove/add `INPUT_X` as you need.
    INPUT_1=/path/to/test_sample.bam
    INPUT_2=/path/to/test_sample.1_1.fq.gz
    INPUT_3=/path/to/test_sample.1_2.fq.gz
    OUTDIR=/path/to/a/output/folder
    REFERENCE_BUNDLE=/path/to/bundle/tophat_a-ref-build_ref_bundle.tar.gz
    THREADS=4
    SINGULARITY_IMAGE=/path/to/cgprna.sif
    singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
    --bind $REFERENCE_BUNDLE:$REFERENCE_BUNDLE:ro \
    --bind $INPUT_1:$INPUT_1:ro \
    --bind $INPUT_2:$INPUT_2:ro \
    --bind $INPUT_3:$INPUT_3:ro \
    $SINGULARITY_IMAGE \
    run-cgprna tophat-fusion \
    -t $THREADS \
    -r $REFERENCE_BUNDLE \
    -od $OUTDIR \
    -s $SAMPLE_NAME \
    -i $INPUT_1 $INPUT_2 $INPUT_3
  • STAR-fusion

    SAMPLE_NAME=test_sample
    # Remove/add `INPUT_X` as you need.
    INPUT_1=/path/to/test_sample.bam
    INPUT_2=/path/to/test_sample.1_1.fq.gz
    INPUT_3=/path/to/test_sample.1_2.fq.gz
    OUTDIR=/path/to/a/output/folder
    REFERENCE_BUNDLE=/path/to/bundle/star_a-ref-build_ref_bundle.tar.gz
    THREADS=4
    SINGULARITY_IMAGE=/path/to/cgprna.sif
    singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
    --bind $REFERENCE_BUNDLE:$REFERENCE_BUNDLE:ro \
    --bind $INPUT_1:$INPUT_1:ro \
    --bind $INPUT_2:$INPUT_2:ro \
    --bind $INPUT_3:$INPUT_3:ro \
    $SINGULARITY_IMAGE \
    run-cgprna tophat-fusion \
    -t $THREADS \
    -r $REFERENCE_BUNDLE \
    -od $OUTDIR \
    -s $SAMPLE_NAME \
    -i $INPUT_1 $INPUT_2 $INPUT_3
  • deFuse

    SAMPLE_NAME=test_sample
    # Remove/add `INPUT_X` as you need.
    INPUT_1=/path/to/test_sample.bam
    INPUT_2=/path/to/test_sample.1_1.fq.gz
    INPUT_3=/path/to/test_sample.1_2.fq.gz
    OUTDIR=/path/to/a/output/folder
    REFERENCE_BUNDLE=/path/to/bundle/defuse_a-ref-build_ref_bundle.tar.gz
    THREADS=4
    SINGULARITY_IMAGE=/path/to/cgprna.sif
    singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
    --bind $REFERENCE_BUNDLE:$REFERENCE_BUNDLE:ro \
    --bind $INPUT_1:$INPUT_1:ro \
    --bind $INPUT_2:$INPUT_2:ro \
    --bind $INPUT_3:$INPUT_3:ro \
    $SINGULARITY_IMAGE \
    run-cgprna tophat-fusion \
    -t $THREADS \
    -r $REFERENCE_BUNDLE \
    -od $OUTDIR \
    -s $SAMPLE_NAME \
    -i $INPUT_1 $INPUT_2 $INPUT_3
  • Merge and annotate fusion events

    This functoinality is not implemented in the CLI, as it only needs to run a single perl script in the container: compare_overlapping_fusions.pl. The Perl script requires following inputs:

    • A string as the sample name

    • Output files of the three fusion tools:

      • sample_name.tophat-fusion.normals.filtered.strand.txt of tophat-fusion,
      • sample_name.star-fusion.normals.filtered.txt of star-fusion, and
      • sample_name.defuse-fusion.normals.ext.filtered.txt of defuse.
    • A GTF file, which is used to annotate break points,

    • VAGrENT cache and ref files that should be the same reference and gene build as the GTF file,

    GTF file and VAGrENT files for human can be found on our FTP site.

    Example command:

    SAMPLE_NAME=test_sample
    TOPHAT_FUSION_OUT=/path/to/tophat_fusion_output/test_sample.tophat-fusion.normals.filtered.strand.txt
    STAR_FUSION_OUT=/path/to/star_fusion_output/test_sample.star-fusion.normals.filtered.txt
    DEFUSE_OUT=/path/to/defuse_output/test_sample.defuse-fusion.normals.ext.filtered.txt
    GTF=/path/to/the_gtf.gtf
    VAGRENT_CACHE=/path/to/vagrent_ref_ensemblxx.cache.gz
    VAGRENT_FA=/path/to/vagrent_ref_ensemblxx.fa
    OUTDIR=/path/to/a/output/folder
    SINGULARITY_IMAGE=/path/to/cgprna.sif
    singularity exec --cleanenv --workdir $OUTDIR --pwd $OUTDIR --home $OUTDIR \
    --bind $GTF:$GTF:ro \
    --bind $VAGRENT_CACHE:$VAGRENT_CACHE:ro \
    --bind ${VAGRENT_CACHE}.tbi:${VAGRENT_CACHE}.tbi:ro \
    --bind $VAGRENT_FA:$VAGRENT_FA:ro \
    --bind ${VAGRENT_FA}.fai:${VAGRENT_FA}.fai:ro \
    --bind $TOPHAT_FUSION_OUT:$TOPHAT_FUSION_OUT:ro \
    --bind $STAR_FUSION_OUT:$STAR_FUSION_OUT:ro \
    --bind $DEFUSE_OUT:$DEFUSE_OUT:ro \
    $SINGULARITY_IMAGE \
    compare_overlapping_fusions.pl \
    -s $SAMPLE_NAME \
    -g $GTF \
    -c $VAGRENT_CACHE \
    -od $OUTDIR \
    $TOPHAT_FUSION_OUT $STAR_FUSION_OUT $DEFUSE_OUT