Skip to content

Custom Pipeline Steps

Stephen Kelly edited this page Dec 6, 2016 · 8 revisions

Overview

An overview of creating a custom pipeline step can be found here:

https://github.com/NYU-BFX/hic-bench_documentation/blob/master/HiC-Bench_manual.pdf

The following basic steps should be taken to create a custom pipeline step:

  • Copy an existing step as a template
  • Set the input directories (’inpdirs’)
  • Edit the run file and add needed parameter files
  • Set pipeline parameters
  • Add a script in the code directory containing the commands needed to run the programs used in the pipeline step (make sure to run chmod +x on it too).
  • Update the new pipeline step name and add it to the entries in the index.txt and as a symlink in the parent level of the analysis directory

Example

In this example, we will create a pipeline step within a currently existing analysis project. This new pipeline step will generate a plot displaying the number of peaks per sample.

Copying existing step

To start with, we will copy the align-stats step, which generates a plot from alignments, and is a similar task to what we intend to do.

SmithLab-ChIPSeq_2016-12-02/pipeline$ ll
total 486K
drwxr-xr-x 20 kellys04  608 Dec  5 20:48 .
drwxr-xr-x  7 kellys04  734 Dec  5 09:01 ..
drwxr-xr-x  5 kellys04  168 Dec  2 18:53 align
drwxr-xr-x  6 kellys04  193 Dec  2 18:53 align-stats
lrwxrwxrwx  1 kellys04    7 May 20  2016 code -> ../code
lrwxrwxrwx  1 kellys04   12 Mar 12  2016 code.main -> ../code.main
drwxr-xr-x  6 kellys04  193 Dec  2 20:41 diffbind
drwxr-xr-x  5 kellys04  168 Dec  2 19:13 fastqc

SmithLab-ChIPSeq_2016-12-02/pipeline$ cp -vr align-stats count_peaks

SmithLab-ChIPSeq_2016-12-02/pipeline$ ll
total 486K
drwxr-xr-x 20  608 Dec  5 20:48 .
drwxr-xr-x  7  734 Dec  5 09:01 ..
drwxr-xr-x  5  168 Dec  2 18:53 align
drwxr-xr-x  6  193 Dec  2 18:53 align-stats
lrwxrwxrwx  1    7 May 20  2016 code -> ../code
lrwxrwxrwx  1   12 Mar 12  2016 code.main -> ../code.main
drwxr-x---  6  193 Dec  5 21:19 count_peaks
drwxr-xr-x  6  193 Dec  2 20:41 diffbind
drwxr-xr-x  5  168 Dec  2 19:13 fastqc

Set input directories

We need to make a symlink in the inpdirs directory to the directory from which to gather input files, and remove links to directories that will not be used.

SmithLab-ChIPSeq_2016-12-02/pipeline$ cd count_peaks/

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll
total 106K
drwxr-x---  6 kellys04   193 Dec  5 21:19 .
drwxr-xr-x 20 kellys04   608 Dec  5 20:48 ..
lrwxrwxrwx  1 kellys04    15 Dec  5 20:48 clean.tcsh -> code/clean.tcsh
lrwxrwxrwx  1 kellys04     7 Dec  5 20:48 code -> ../code
drwxr-x---  2 kellys04    27 Dec  5 20:48 errors
drwxr-x---  2 kellys04    23 Dec  5 20:49 inpdirs
lrwxrwxrwx  1 kellys04     9 Dec  5 20:48 inputs -> ../inputs
drwxr-x---  2 kellys04    38 Dec  5 20:48 params
drwxrwx---  4 kellys04    59 Dec  5 21:19 results
-rwxr-x---  1 kellys04  1.2K Dec  5 21:20 run

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll inpdirs/
total 38K
drwxr-x--- 2 kellys04   23 Dec  5 20:49 .
drwxr-x--- 6 kellys04  193 Dec  5 21:19 ..
lrwxrwxrwx 1 kellys04   11 Dec  5 20:49 align -> ../../align

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ cd inpdirs/

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks/inpdirs$ ln -s ../../peaks

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks/inpdirs$ rm -f align

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks/inpdirs$ ll
total 38K
drwxr-x--- 2 kellys04   23 Dec  5 20:49 .
drwxr-x--- 6 kellys04  193 Dec  5 21:19 ..
lrwxrwxrwx 1 kellys04   11 Dec  5 20:49 peaks -> ../../peaks

Edit run Script

The run script, located in the parent pipeline step directory, acts like a wrapper to bring together all the components needed to run the individual pipeline step. Next we will show what the edited run script looks like for this step, with additional markup for explanation.

The run script location:

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll
total 106K
...
-rwxr-x---  1 kellys04  1.2K Dec  5 21:20 run

run Script Contents

The script header:

#!/bin/tcsh
source ./code/code.main/custom-tcshrc      # customize shell environment

##
## USAGE: run [--dry-run]
##

Special descriptive tags for use in analysis report generation.

  • TITLE : the title of the pipeline step, to be displayed in the report sidebar and section header
  • DESCRIPTION : pipeline step description, to be displayed in the first page of the section in the report
  • FIGURE : a figure to be included in the report, this should correspond to the filename of either a PDF or PNG file that is output in the results directory after pipeline step completion
  • SHEET : a flat table file to be displayed in the pipeline step section of the report
#TITLE: Count Peaks
#DESCRIPTION: Counts the number of peaks per sample
#FIGURE: peaks_count_barplots.pdf
#SHEET:

Script command line argument parsing; the run script never takes arguments.

# process command-line inputs
if ($#argv > 1) then
  grep '^##' $0 | scripts-send2err
  exit 1
endif

set opt = "$1"

The 'operation' name e.g. pipeline step name ID. This must be the same name as the directory name of the pipeline step.

set op = count_peaks

The input directories to use. If you have set multiple input directories, you can specify which to use here.

set inpdirs = "inpdirs/*"

Filter to use on the branches for inclusion in the step. In this case, only branches from the results in the input directory that match the naming pattern given will be used in the pipeline step.

set filter = "*peaks.by_sample*"

The name of the results directory to be created. Its best not to change this

set results = results

This part of the script sets up the results directory.

scripts-create-path $results/
scripts-send2err "=== Operation = $op ============="

Threads and memory requirements to pass to qsub. Alternate notations include 8-16 for a range of threads, and 8,20G to specify both thread and memory requirements.

set resources = 1

This sets the command to be run for the pipeline step. Note that this script references the code script to be set later, in this case ./code/chipseq-count_peaks.tcsh

set cmd = "./code/code.main/scripts-qsub-wrapper $resources ./code/chipseq-$op.tcsh"

This runs the pipeline step, using the pipeline branch explorer script, and applies the criteria we have set so far.

Rscript ./code/code.main/pipeline-master-explorer.r -v -F "$filter" "$cmd" $results/$op "params/params.*.tcsh" "$inpdirs" "" "*" 1

# run and wait until done!
if ("$opt" != "--dry-run") scripts-submit-jobs ./$results/.db/run

Set Parameters

The params files are used to hold program-specific parameters to be run. Multiple params files can be used to run a pipeline step repeatedly with different sets of parameters. This allows for parameter exploration during analysis. The default params are set like this:

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll params/
total 62K
drwxr-x--- 2 kellys04   38 Dec  5 20:48 .
drwxr-x--- 7 kellys04  214 Dec  6 10:13 ..
-rw-r----- 1 kellys04   50 Dec  5 20:48 params.standard.tcsh

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ cat params/params.standard.tcsh
#!/bin/tcsh

source ./inputs/params/params.tcsh

In this basic example, no program specific parameters are set, and only the pipeline-specific parameters are used. All environment variables set in the params.*.tcsh files will be made available by subsequent tcsh scripts used in the pipeline step. For examples on how to set program specific parameters, see the default MACS peak calling pipeline step:

SmithLab-ChIPSeq_2016-12-02/pipeline/peaks$ ll params/
total 113K
drwxr-xr-x 2 kellys04  113 Jul 19 10:53 .
drwxr-xr-x 6 kellys04  193 Dec  2 19:57 ..
-rwxr-x--- 1 kellys04  528 Jul 19 10:53 params.macs_broad.tcsh
-rwxr-x--- 1 kellys04  520 Jul 19 10:53 params.macs_narrow.tcsh

SmithLab-ChIPSeq_2016-12-02/pipeline/peaks$ cat params/params.macs_broad.tcsh
#!/bin/tcsh

source ./inputs/params/params.tcsh

module unload gcc
module unload python
# macs now part of python module
# module load macs/2.0.10.20131216
module load python/2.7.3

set extsize = `./code/read-sample-sheet.tcsh $sheet "$objects" fragmentation-size`
set extsize = `echo $extsize | tools-vectors m -n 0`
set macs_params = "--broad --nomodel --extsize=$extsize"
set use_input = 'true'
set annot_params = "annotate2 -i --upstream-dist 100000 --downstream-dist 100000 --proximal-dist 1000 $genome_dir/gene-name.bed"
SmithLab-ChIPSeq_2016-12-02/pipeline/peaks$ cat params/params.macs_narrow.tcsh
#!/bin/tcsh

source ./inputs/params/params.tcsh

module unload gcc
module unload python
# macs now part of python module
# module load macs/2.0.10.20131216
module load python/2.7.3

set extsize = `./code/read-sample-sheet.tcsh $sheet "$objects" fragmentation-size`
set extsize = `echo $extsize | tools-vectors m -n 0`
set macs_params = "--nomodel --extsize=$extsize"
set use_input = 'true'
set annot_params = "annotate2 -i --upstream-dist 100000 --downstream-dist 100000 --proximal-dist 1000 $genome_dir/gene-name.bed"

Pipeline Step code Script

A tcsh script should be placed in the code directory, to be run by the run script. It should have the same naming scheme as other scripts in the directory, using the pipeline step op operation ID (pipeline step directory name) as the script name:

SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll code/
total 891K
drwxr-xr-x 3 kellys04  1.4K Dec  5 20:51 .
drwxr-xr-x 6 kellys04   125 Jun  3  2016 ..
-rwxr-x--- 1 kellys04  3.3K Dec  5 21:19 chipseq-count_peaks.sh
-rwxr-x--- 1 kellys04  1.4K Dec  5 21:10 chipseq-count_peaks.tcsh

If you prefer to write your scripts in another shell (e.g. bash) or another language (Python, R), you can use the tcsh script as a wrapper for these scripts instead of performing your analysis directly within the tcsh script. In this case, the tcsh script chipseq-count_peaks.tcsh is being used as a wrapper for chipseq-count_peaks.sh.

code Script Contents

The code script receives arguments passed to it from the run script. A basic tcsh code script will look like this:

Script header

#!/bin/tcsh
source ./code/code.main/custom-tcshrc     # shell settings

##
## USAGE: chipseq-count_peaks.tcsh OUTPUT-DIR PARAM-SCRIPT BRANCH [OBJECTS]
##
  • OUTPUT-DIR : path to the output directory, generally set as the results directory followed by the complete pipeline analysis branch. Example: results/count_peaks.standard/peaks.by_sample.macs_broad/align.by_sample.bowtie2/all-samples

  • PARAM-SCRIPT : Path to the params script to be used in the pipeline step. Example: params/params.standard.tcsh

  • BRANCH : The pipeline analysis input branch being used in the given iteration of the script. Example: inpdirs/peaks/results/peaks.by_sample.macs_broad/align.by_sample.bowtie2

  • [OBJECTS] : The ID's of the samples to be processed. These correspond to the final directories in the input branch path. Example: CELLLINE_H3K4me3_1 CELLLINE_H3K4me3_2 CELLLINE_H3K27AC_1 CELLLINE_H3K27AC_2 CELLLINE_CONTROL_1 CELLLINE_CONTROL_2

The full path to an input file would look like this:

<BRANCH>/<OBJECT>/input_file.txt

The full path to an output file would look like this:

<OUTPUT-DIR>/output_file.txt

This step processes the script arguments and sets up the output directories.

# process command-line inputs and sets parameters.
if (($#argv < 3) || ($#argv > 4)) then
  grep '^##' $0 | scripts-send2err
  exit 1
endif

# inputs
set outdir = $1
set params = $2
set branch = $3
set objects = ($4)

# if samples is empty, use all objects in the branch
if ("$objects" == "") set objects = `cd $branch; ls -1d *`

# read variables from input branch
source ./code/code.main/scripts-read-job-vars $branch "$objects" "genome genome_dir"

# run parameter script
scripts-send2err "Setting parameters..."
source $params

# create path
scripts-create-path $outdir/


# -------------------------------------
# -----  MAIN CODE BELOW --------------
# -------------------------------------

This is where you can place your pipeline step code. It is also a good place to put some verbose argument checking while you debug your pipeline step. In this case, we are just echoing out the input arguments, and then passing them to our bash script.

echo "outdir is $outdir"
echo "params is $params"
echo "branch is $branch" # input directory
echo "objects is $objects" # samples to be processed


./code/chipseq-count_peaks.sh "$outdir" "$branch"

This step performs some clean-up operations.

# -------------------------------------
# -----  MAIN CODE ABOVE --------------
# -------------------------------------

# save variables
source ./code/code.main/scripts-save-job-vars

# done
scripts-send2err "Done."

Test Your Step

You can test your pipeline step by running ./run from the pipeline step directory. Be sure that you have made all code scripts executable by running chmod +x on them.

Add Step to the Pipeline

Once your pipeline step is complete, you can run it manually at any time by entering its directory and running ./run. However, simply placing a new pipeline step in the pipeline directory will not cause it to be run automatically with the rest of the pipeline. You must take two further steps to ensure that that the new step will be run automatically.

First, you must add the pipeline step to the index.txt file:

SmithLab-ChIPSeq_2016-12-02/pipeline$ cat index.txt
align
align-stats

fastqc
qc

peaks
peaktable
count_peaks

matrices

pca

diffbind

Next, you must add a symlink to the pipeline step at the parent directory level. The symlink must follow the same format shown in other symlinks, and must place the step in the correct order in relation to its required inputs.

SmithLab-ChIPSeq_2016-12-02$ ln -s pipeline/count_peaks __03c-count_peaks

SmithLab-ChIPSeq_2016-12-02$ ll
total 515K
drwxr-xr-x  7 kellys04   734 Dec  5 09:01 .
drwxrwsr-x 10 at570      304 Dec  2 16:03 ..
-rw-r--r--  1 kellys04   55K Dec  2 20:41 SmithLab-ChIPSeq_2016-12-02.e9109411
-rw-r--r--  1 kellys04    84 Dec  2 16:33 SmithLab-ChIPSeq_2016-12-02.o9109411
-rw-r-----  1 kellys04  3.3K Nov 30 11:18 README.md
lrwxrwxrwx  1 kellys04    14 Mar 12  2016 __01a-align -> pipeline/align
lrwxrwxrwx  1 kellys04    20 Mar 12  2016 __01b-align-stats -> pipeline/align-stats
lrwxrwxrwx  1 kellys04    15 Apr  7  2016 __02a-fastqc -> pipeline/fastqc
lrwxrwxrwx  1 kellys04    11 Apr  7  2016 __02b-qc -> pipeline/qc
lrwxrwxrwx  1 kellys04    14 Mar 12  2016 __03a-peaks -> pipeline/peaks
lrwxrwxrwx  1 kellys04    18 Mar 12  2016 __03b-peaktable -> pipeline/peaktable
lrwxrwxrwx  1 kellys04    17 Mar 12  2016 __04a-matrices -> pipeline/matrices
lrwxrwxrwx  1 kellys04    12 Mar 12  2016 __05a-pca -> pipeline/pca
lrwxrwxrwx  1 kellys04    17 Jun  3  2016 __06a-diffbind -> pipeline/diffbind
lrwxrwxrwx  1 kellys04    31 Dec  2 16:04 code -> code.repo/code.chipseq-standard
lrwxrwxrwx  1 kellys04    14 Mar 12  2016 code.main -> code/code.main
lrwxrwxrwx  1 kellys04    46 Dec  2 16:03 code.origin -> /ifs/home/kellys04/hic-bench/code/code.main/..
drwxr-xr-x  6 kellys04   125 Jun  3  2016 code.repo
lrwxrwxrwx  1 kellys04    81 Dec  2 16:04 data -> /ifs/home/kellys04/hic-bench/code/code.main/../../pipelines/chipseq-standard/data
drwxr-xr-x  5 kellys04   250 Dec  2 16:33 inputs
drwxr-xr-x 20 kellys04   608 Dec  5 20:48 pipeline
-rw-r--r--  1 kellys04   295 Dec  5 09:01 project_info.txt
drwxr-xr-x  2 kellys04    75 Jul 27 20:45 project_notes
drwxr-xr-x  5 kellys04   637 Dec  5 21:27 report
-rwxr-xr-x  1 kellys04   867 Apr  7  2016 run
-rwxr-xr-x  1 kellys04   632 Apr  7  2016 run.dry