-
Notifications
You must be signed in to change notification settings - Fork 20
Custom Pipeline Steps
An overview of creating a custom pipeline step can be found here:
https://github.com/NYU-BFX/hic-bench_documentation/blob/master/HiC-Bench_manual.pdf
The following basic steps should be taken to create a custom pipeline step:
- Copy an existing step as a template
- Set the input directories (’inpdirs’)
- Edit the
run
file and add needed parameter files - Set pipeline parameters
- Add a script in the
code
directory containing the commands needed to run the programs used in the pipeline step (make sure to runchmod +x
on it too). - Update the new pipeline step name and add it to the entries in the
index.txt
and as a symlink in the parent level of the analysis directory
In this example, we will create a pipeline step within a currently existing analysis project. This new pipeline step will generate a plot displaying the number of peaks per sample.
To start with, we will copy the align-stats
step, which generates a plot from alignments, and is a similar task to what we intend to do.
SmithLab-ChIPSeq_2016-12-02/pipeline$ ll
total 486K
drwxr-xr-x 20 kellys04 608 Dec 5 20:48 .
drwxr-xr-x 7 kellys04 734 Dec 5 09:01 ..
drwxr-xr-x 5 kellys04 168 Dec 2 18:53 align
drwxr-xr-x 6 kellys04 193 Dec 2 18:53 align-stats
lrwxrwxrwx 1 kellys04 7 May 20 2016 code -> ../code
lrwxrwxrwx 1 kellys04 12 Mar 12 2016 code.main -> ../code.main
drwxr-xr-x 6 kellys04 193 Dec 2 20:41 diffbind
drwxr-xr-x 5 kellys04 168 Dec 2 19:13 fastqc
SmithLab-ChIPSeq_2016-12-02/pipeline$ cp -vr align-stats count_peaks
SmithLab-ChIPSeq_2016-12-02/pipeline$ ll
total 486K
drwxr-xr-x 20 608 Dec 5 20:48 .
drwxr-xr-x 7 734 Dec 5 09:01 ..
drwxr-xr-x 5 168 Dec 2 18:53 align
drwxr-xr-x 6 193 Dec 2 18:53 align-stats
lrwxrwxrwx 1 7 May 20 2016 code -> ../code
lrwxrwxrwx 1 12 Mar 12 2016 code.main -> ../code.main
drwxr-x--- 6 193 Dec 5 21:19 count_peaks
drwxr-xr-x 6 193 Dec 2 20:41 diffbind
drwxr-xr-x 5 168 Dec 2 19:13 fastqc
We need to make a symlink in the inpdirs
directory to the directory from which to gather input files, and remove links to directories that will not be used.
SmithLab-ChIPSeq_2016-12-02/pipeline$ cd count_peaks/
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll
total 106K
drwxr-x--- 6 kellys04 193 Dec 5 21:19 .
drwxr-xr-x 20 kellys04 608 Dec 5 20:48 ..
lrwxrwxrwx 1 kellys04 15 Dec 5 20:48 clean.tcsh -> code/clean.tcsh
lrwxrwxrwx 1 kellys04 7 Dec 5 20:48 code -> ../code
drwxr-x--- 2 kellys04 27 Dec 5 20:48 errors
drwxr-x--- 2 kellys04 23 Dec 5 20:49 inpdirs
lrwxrwxrwx 1 kellys04 9 Dec 5 20:48 inputs -> ../inputs
drwxr-x--- 2 kellys04 38 Dec 5 20:48 params
drwxrwx--- 4 kellys04 59 Dec 5 21:19 results
-rwxr-x--- 1 kellys04 1.2K Dec 5 21:20 run
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll inpdirs/
total 38K
drwxr-x--- 2 kellys04 23 Dec 5 20:49 .
drwxr-x--- 6 kellys04 193 Dec 5 21:19 ..
lrwxrwxrwx 1 kellys04 11 Dec 5 20:49 align -> ../../align
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ cd inpdirs/
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks/inpdirs$ ln -s ../../peaks
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks/inpdirs$ rm -f align
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks/inpdirs$ ll
total 38K
drwxr-x--- 2 kellys04 23 Dec 5 20:49 .
drwxr-x--- 6 kellys04 193 Dec 5 21:19 ..
lrwxrwxrwx 1 kellys04 11 Dec 5 20:49 peaks -> ../../peaks
The run
script, located in the parent pipeline step directory, acts like a wrapper to bring together all the components needed to run the individual pipeline step. Next we will show what the edited run
script looks like for this step, with additional markup for explanation.
The run
script location:
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll
total 106K
...
-rwxr-x--- 1 kellys04 1.2K Dec 5 21:20 run
The script header:
#!/bin/tcsh
source ./code/code.main/custom-tcshrc # customize shell environment
##
## USAGE: run [--dry-run]
##
Special descriptive tags for use in analysis report generation.
-
TITLE
: the title of the pipeline step, to be displayed in the report sidebar and section header -
DESCRIPTION
: pipeline step description, to be displayed in the first page of the section in the report -
FIGURE
: a figure to be included in the report, this should correspond to the filename of either a PDF or PNG file that is output in theresults
directory after pipeline step completion -
SHEET
: a flat table file to be displayed in the pipeline step section of the report
#TITLE: Count Peaks
#DESCRIPTION: Counts the number of peaks per sample
#FIGURE: peaks_count_barplots.pdf
#SHEET:
Script command line argument parsing; the run
script never takes arguments.
# process command-line inputs
if ($#argv > 1) then
grep '^##' $0 | scripts-send2err
exit 1
endif
set opt = "$1"
The 'operation' name e.g. pipeline step name ID. This must be the same name as the directory name of the pipeline step.
set op = count_peaks
The input directories to use. If you have set multiple input directories, you can specify which to use here.
set inpdirs = "inpdirs/*"
Filter to use on the branches for inclusion in the step. In this case, only branches from the results in the input directory that match the naming pattern given will be used in the pipeline step.
set filter = "*peaks.by_sample*"
The name of the results directory to be created. Its best not to change this
set results = results
This part of the script sets up the results directory.
scripts-create-path $results/
scripts-send2err "=== Operation = $op ============="
Threads and memory requirements to pass to qsub
. Alternate notations include 8-16
for a range of threads, and 8,20G
to specify both thread and memory requirements.
set resources = 1
This sets the command to be run for the pipeline step. Note that this script references the code
script to be set later, in this case ./code/chipseq-count_peaks.tcsh
set cmd = "./code/code.main/scripts-qsub-wrapper $resources ./code/chipseq-$op.tcsh"
This runs the pipeline step, using the pipeline branch explorer script, and applies the criteria we have set so far.
Rscript ./code/code.main/pipeline-master-explorer.r -v -F "$filter" "$cmd" $results/$op "params/params.*.tcsh" "$inpdirs" "" "*" 1
# run and wait until done!
if ("$opt" != "--dry-run") scripts-submit-jobs ./$results/.db/run
The params
files are used to hold program-specific parameters to be run. Multiple params
files can be used to run a pipeline step repeatedly with different sets of parameters. This allows for parameter exploration during analysis. The default params
are set like this:
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll params/
total 62K
drwxr-x--- 2 kellys04 38 Dec 5 20:48 .
drwxr-x--- 7 kellys04 214 Dec 6 10:13 ..
-rw-r----- 1 kellys04 50 Dec 5 20:48 params.standard.tcsh
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ cat params/params.standard.tcsh
#!/bin/tcsh
source ./inputs/params/params.tcsh
In this basic example, no program specific parameters are set, and only the pipeline-specific parameters are used. All environment variables set in the params.*.tcsh
files will be made available by subsequent tcsh scripts used in the pipeline step. For examples on how to set program specific parameters, see the default MACS peak calling pipeline step:
SmithLab-ChIPSeq_2016-12-02/pipeline/peaks$ ll params/
total 113K
drwxr-xr-x 2 kellys04 113 Jul 19 10:53 .
drwxr-xr-x 6 kellys04 193 Dec 2 19:57 ..
-rwxr-x--- 1 kellys04 528 Jul 19 10:53 params.macs_broad.tcsh
-rwxr-x--- 1 kellys04 520 Jul 19 10:53 params.macs_narrow.tcsh
SmithLab-ChIPSeq_2016-12-02/pipeline/peaks$ cat params/params.macs_broad.tcsh
#!/bin/tcsh
source ./inputs/params/params.tcsh
module unload gcc
module unload python
# macs now part of python module
# module load macs/2.0.10.20131216
module load python/2.7.3
set extsize = `./code/read-sample-sheet.tcsh $sheet "$objects" fragmentation-size`
set extsize = `echo $extsize | tools-vectors m -n 0`
set macs_params = "--broad --nomodel --extsize=$extsize"
set use_input = 'true'
set annot_params = "annotate2 -i --upstream-dist 100000 --downstream-dist 100000 --proximal-dist 1000 $genome_dir/gene-name.bed"
SmithLab-ChIPSeq_2016-12-02/pipeline/peaks$ cat params/params.macs_narrow.tcsh
#!/bin/tcsh
source ./inputs/params/params.tcsh
module unload gcc
module unload python
# macs now part of python module
# module load macs/2.0.10.20131216
module load python/2.7.3
set extsize = `./code/read-sample-sheet.tcsh $sheet "$objects" fragmentation-size`
set extsize = `echo $extsize | tools-vectors m -n 0`
set macs_params = "--nomodel --extsize=$extsize"
set use_input = 'true'
set annot_params = "annotate2 -i --upstream-dist 100000 --downstream-dist 100000 --proximal-dist 1000 $genome_dir/gene-name.bed"
A tcsh script should be placed in the code
directory, to be run by the run
script. It should have the same naming scheme as other scripts in the directory, using the pipeline step op
operation ID (pipeline step directory name) as the script name:
SmithLab-ChIPSeq_2016-12-02/pipeline/count_peaks$ ll code/
total 891K
drwxr-xr-x 3 kellys04 1.4K Dec 5 20:51 .
drwxr-xr-x 6 kellys04 125 Jun 3 2016 ..
-rwxr-x--- 1 kellys04 3.3K Dec 5 21:19 chipseq-count_peaks.sh
-rwxr-x--- 1 kellys04 1.4K Dec 5 21:10 chipseq-count_peaks.tcsh
If you prefer to write your scripts in another shell (e.g. bash) or another language (Python, R), you can use the tcsh script as a wrapper for these scripts instead of performing your analysis directly within the tcsh script. In this case, the tcsh script chipseq-count_peaks.tcsh
is being used as a wrapper for chipseq-count_peaks.sh
.
The code
script receives arguments passed to it from the run
script. A basic tcsh code
script will look like this:
Script header
#!/bin/tcsh
source ./code/code.main/custom-tcshrc # shell settings
##
## USAGE: chipseq-count_peaks.tcsh OUTPUT-DIR PARAM-SCRIPT BRANCH [OBJECTS]
##
-
OUTPUT-DIR
: path to the output directory, generally set as theresults
directory followed by the complete pipeline analysis branch. Example:results/count_peaks.standard/peaks.by_sample.macs_broad/align.by_sample.bowtie2/all-samples
-
PARAM-SCRIPT
: Path to the params script to be used in the pipeline step. Example:params/params.standard.tcsh
-
BRANCH
: The pipeline analysis input branch being used in the given iteration of the script. Example:inpdirs/peaks/results/peaks.by_sample.macs_broad/align.by_sample.bowtie2
-
[OBJECTS] : The ID's of the samples to be processed. These correspond to the final directories in the input branch path. Example:
CELLLINE_H3K4me3_1 CELLLINE_H3K4me3_2 CELLLINE_H3K27AC_1 CELLLINE_H3K27AC_2 CELLLINE_CONTROL_1 CELLLINE_CONTROL_2
The full path to an input file would look like this:
<BRANCH>/<OBJECT>/input_file.txt
The full path to an output file would look like this:
<OUTPUT-DIR>/output_file.txt
This step processes the script arguments and sets up the output directories.
# process command-line inputs and sets parameters.
if (($#argv < 3) || ($#argv > 4)) then
grep '^##' $0 | scripts-send2err
exit 1
endif
# inputs
set outdir = $1
set params = $2
set branch = $3
set objects = ($4)
# if samples is empty, use all objects in the branch
if ("$objects" == "") set objects = `cd $branch; ls -1d *`
# read variables from input branch
source ./code/code.main/scripts-read-job-vars $branch "$objects" "genome genome_dir"
# run parameter script
scripts-send2err "Setting parameters..."
source $params
# create path
scripts-create-path $outdir/
# -------------------------------------
# ----- MAIN CODE BELOW --------------
# -------------------------------------
This is where you can place your pipeline step code. It is also a good place to put some verbose argument checking while you debug your pipeline step. In this case, we are just echo
ing out the input arguments, and then passing them to our bash script.
echo "outdir is $outdir"
echo "params is $params"
echo "branch is $branch" # input directory
echo "objects is $objects" # samples to be processed
./code/chipseq-count_peaks.sh "$outdir" "$branch"
This step performs some clean-up operations.
# -------------------------------------
# ----- MAIN CODE ABOVE --------------
# -------------------------------------
# save variables
source ./code/code.main/scripts-save-job-vars
# done
scripts-send2err "Done."
You can test your pipeline step by running ./run
from the pipeline step directory. Be sure that you have made all code
scripts executable by running chmod +x
on them.
Once your pipeline step is complete, you can run it manually at any time by entering its directory and running ./run
. However, simply placing a new pipeline step in the pipeline
directory will not cause it to be run automatically with the rest of the pipeline. You must take two further steps to ensure that that the new step will be run automatically.
First, you must add the pipeline step to the index.txt
file:
SmithLab-ChIPSeq_2016-12-02/pipeline$ cat index.txt
align
align-stats
fastqc
qc
peaks
peaktable
count_peaks
matrices
pca
diffbind
Next, you must add a symlink to the pipeline step at the parent directory level. The symlink must follow the same format shown in other symlinks, and must place the step in the correct order in relation to its required inputs.
SmithLab-ChIPSeq_2016-12-02$ ln -s pipeline/count_peaks __03c-count_peaks
SmithLab-ChIPSeq_2016-12-02$ ll
total 515K
drwxr-xr-x 7 kellys04 734 Dec 5 09:01 .
drwxrwsr-x 10 at570 304 Dec 2 16:03 ..
-rw-r--r-- 1 kellys04 55K Dec 2 20:41 SmithLab-ChIPSeq_2016-12-02.e9109411
-rw-r--r-- 1 kellys04 84 Dec 2 16:33 SmithLab-ChIPSeq_2016-12-02.o9109411
-rw-r----- 1 kellys04 3.3K Nov 30 11:18 README.md
lrwxrwxrwx 1 kellys04 14 Mar 12 2016 __01a-align -> pipeline/align
lrwxrwxrwx 1 kellys04 20 Mar 12 2016 __01b-align-stats -> pipeline/align-stats
lrwxrwxrwx 1 kellys04 15 Apr 7 2016 __02a-fastqc -> pipeline/fastqc
lrwxrwxrwx 1 kellys04 11 Apr 7 2016 __02b-qc -> pipeline/qc
lrwxrwxrwx 1 kellys04 14 Mar 12 2016 __03a-peaks -> pipeline/peaks
lrwxrwxrwx 1 kellys04 18 Mar 12 2016 __03b-peaktable -> pipeline/peaktable
lrwxrwxrwx 1 kellys04 17 Mar 12 2016 __04a-matrices -> pipeline/matrices
lrwxrwxrwx 1 kellys04 12 Mar 12 2016 __05a-pca -> pipeline/pca
lrwxrwxrwx 1 kellys04 17 Jun 3 2016 __06a-diffbind -> pipeline/diffbind
lrwxrwxrwx 1 kellys04 31 Dec 2 16:04 code -> code.repo/code.chipseq-standard
lrwxrwxrwx 1 kellys04 14 Mar 12 2016 code.main -> code/code.main
lrwxrwxrwx 1 kellys04 46 Dec 2 16:03 code.origin -> /ifs/home/kellys04/hic-bench/code/code.main/..
drwxr-xr-x 6 kellys04 125 Jun 3 2016 code.repo
lrwxrwxrwx 1 kellys04 81 Dec 2 16:04 data -> /ifs/home/kellys04/hic-bench/code/code.main/../../pipelines/chipseq-standard/data
drwxr-xr-x 5 kellys04 250 Dec 2 16:33 inputs
drwxr-xr-x 20 kellys04 608 Dec 5 20:48 pipeline
-rw-r--r-- 1 kellys04 295 Dec 5 09:01 project_info.txt
drwxr-xr-x 2 kellys04 75 Jul 27 20:45 project_notes
drwxr-xr-x 5 kellys04 637 Dec 5 21:27 report
-rwxr-xr-x 1 kellys04 867 Apr 7 2016 run
-rwxr-xr-x 1 kellys04 632 Apr 7 2016 run.dry