For a version supporting bcl2fastq V1 from Illumina, see the "bcl2fastqV1" branch.
This is our bcl to fastq pipeline. Features include:
- Runs as a background process, rather than a cron job that needs to check if another instance is already running
- Handles undetermined indices in a more coherent manner
- Adds quality metrics to the report email
- Performs a contamination screen
- Automatically delivers data to most groups and posts external data to the F*EX server
- Internal groups additionally have their data linked into Galaxy, if possible
- Compiles an automated project-report (pdf), to go along with each submission. This uses ReportLab.
- Functions divided into more meaningful subfiles in a module, rather than being scattered across levels of shell scripts
- Written explicitly for python3, just to future proof things a bit.
The general workflow of this pipeline is as follows:
- Read in
bcl2fastq.ini
from~/
. Note that this file can be changed while the program is running. - Look for new flow cells.
- Iterate through directories listed under
[Paths]
->baseDir
, looking for the specialRTAComplete.txt
file.- This program is currently hardcoded to look only in flow cells generated by machines SN7001180 and NB501361.
- Check for the number of sample sheets (
SampleSheet*.csv
), since a separate folder will be output per sample sheet - For each sample sheet (or directory, if there are no sample sheets), check to see if it has already been processed.
- A flow cell is marked as being processed if:
- There is an equivalent directory under
[Paths]
->outputDir
, with a possible_lanes_1_2
suffix (or something equivalent, for different lanes) - That directory contains a file named
casava.finished
(our old pipeline) orfastq.made
(the current program) - This sets the
[Options]
->runID
field in the configuration file.
- There is an equivalent directory under
- A flow cell is marked as being processed if:
- Iterate through directories listed under
- If there are no new flow cells to process, the program sleeps (the duration is set in
[Options]
->sleepTime
) and starts again at step 1. - If there are new flow cells to process, ensure that there is sufficient space in
[Paths]
->outputDir
. This is set in[Options]
->minSpace
.- Note that having insufficient space will lead to an email being sent to addresses set in
[Email]
->errorTo
. The program will then sleep (see step 3) and loop (i.e., go back to step 1).
- Note that having insufficient space will lead to an email being sent to addresses set in
- Assuming there is at least one new flow cell and there's sufficient space, the program will generate fastq files.
- The sample sheet is first rewritten to strip out illegal character (e.g., anything with an umlaut). The rewritten sample sheet is placed in
/tmp
and not removed after running. - The barcode masking strategy is inferred from
RunInfo.xml
, unless it's already specified in the config file. - The program specified via
[bcl2fastq]
->bcl2fastq
is run with options specified in[bcl2fastq]
->bcl2fastq_options
. In addition to these options, the follow are hard coded:-o outputDir/runID
: The output directory is set to[Paths]
->outputDir/runID
. This directory is created if it doesn't already exist.-r runDir/runID
: This is the directory that's being processed ([Paths]
->runDir
/runID
).- This directory may be read only!
--interop-dir seqFacDir/runID/InterOp
: This preventsbcl2fastq
from attempting to write to the running directory, which could be dangerous. See[Paths]
->seqFacDir
for where this is.- The sequencing facility has requested this directory. Note that the path will be created if it doesn't already exist.
- The file named
bcl.done
in the output directory is touched. If the pipeline experiences an error and restarts then it will then skip the already completed demultiplexing step.
- The sample sheet is first rewritten to strip out illegal character (e.g., anything with an umlaut). The rewritten sample sheet is placed in
- Files and directories are renamed for consistency with previous data produced at the institute.
files.renamed
is then touched (if it already exists then this step will be skipped)
- A number of "post make" steps are run. This terminology is a hold-over from the previous bcl2fastq pipeline, which used
make
to generate the fastq files.- If a flow cell was run on the HiSeq 3000, optical duplicates are removed and placed in a separate file with clumpify.sh from bbmap.
- This has multiple workers, each of which is multithreaded. This is due to the program not nicely respecting thread settings and occasionally requesting gobs of memory.
- See
[Options]
->deduplicateInstances
for the number of simultaneous instances. - See
[bbmap]
for other related options - This step creates a ".duplicate.txt" file for each sample. If the pipeline later experiences an error and sees such a file then this step will be skipped for the given sample (the step is resource intensive).
- FastQC is run on each output fastq file.
- This is run in a multithreaded manner, see
[Options]
->postMakeThreads
for the number of workers. - See options under
[FastQC]
for executable paths and options. - The output is placed in
[Paths]
->outputDir
/runID
/FASTQC_project_name.
- This is run in a multithreaded manner, see
- An md5sum is made of the fastq files in each project (see the file named "md5sums.txt").
- As with FastQC, this is multithreaded, with the number of workers threads set via
[Options]
->postMakeThreads
.
- As with FastQC, this is multithreaded, with the number of workers threads set via
- A contamination screen is run with fastq_screen after downsampling read #1 of each sample.
- Runs multiQC on the output of FastQC.
- Additional steps can be added to
afterFastq.py
, though note that the package will need to be reinstalled and the process restarted.
- If a flow cell was run on the HiSeq 3000, optical duplicates are removed and placed in a separate file with clumpify.sh from bbmap.
- Xml files and FastQC outputs are copied to a location readable by the sequencing facility.
- This is location is set via
[Paths]
->seqFacDir
and things placed under arunID
subdirectory, as was the case withInterOp
above. - Currently, the xml files are
RunInfo.xml
andrunParameters.xml
.
- This is location is set via
- A summary PDF file is created for each of the projects. All of the metrics from this are gathered from
Stats/ConversionStats.xml
.- Everything about these PDFs is hard-coded. In an ideal world, this would have some sort of plugin interface.
- FastQC and fastq files are copied to the group directories, under
sequencing_data/
.- If the directories already exist then an error is produced. This is to ensure that nothing is inadvertently over-written!
- Projects with a matching local group (a subdirectory in
[Paths]
->outputDir
) are transfered and linked into Galaxy - Projects with no matching group are uploaded to the F*EX server and an email with the link sent to "Uni"->"default".
- Projects with a matching local group have Output directories and files for projects starting with "A" have their permissions changed to ensure that groups do not have write access.
- Parkour is updated with a variety of flowcell metrics.
- A summary email is produced (largely by parsing
Stats/DemultiplexingStats.xml
) and sent to the email addresses specified via[Email]
->finishedTo
.- Note the other options under
[Email]
, which specify the host name of the outgoing email server and the outgoing email address.
- Note the other options under
- A file named
fastq.made
is produced in[Paths]
->outputDir
/runID
/.
The following files have special meanings if found in an output directory:
bcl.done
: Demultiplexing has finished (touch this if you run it manually)files.renamed
: The fastq files and directories have been renamed to have things likeProject_
andSample_
prepended and "_001" stripped.*.duplicate.txt
: Produced by clumpify. If it exists then clumpify won't be runfastq.made
: The flow cell is finished
To wake a sleeping bfq.py
, one can simply kill -HUP pid
, where pid
is its process ID. This will wake the process immediately.
The configuration file is a human readable text file named bcl2fastq.ini
and must be placed in the home directory (~/
) of the user running this package. Currently, the file has the following sections:
[Paths]
- Holds all path informationbaseDir
- The base directory where the HiSeq writes its outputoutputDir
- The base directory where the demultiplexed fastq files and fastQC/md5sum output should be written.seqFacDir
- The base directory readable by the sequencing facility, for the files they're interested in.groupDir
- The base directory holding all group's datasets (currently, this should be/data
for us).logDir
- The demultiplexing log for each run is written here.
[FastQC]
fastqc_command
- Either justfastqc
or possibly the full path, as appropriate.fastqc_options
- Options for fastqc
[MultiQC]
multiqc_command
- Path to the multiqc commandmultiqc_options
- Options for multiqc
[fastq_screen]
fastq_screen_command
- The command to runfastq_screen
fastq_screen_options
- Options to be passed tofastq_screen
seqtk_command
- The path to SeqTK, which is used for downsamplingseqtk_options
- Options given to SeqTK, typically the seed (e.g.,-s 123456
)seqtk_size
- The target number to downsample to (e.g.,1000000
)
[bcl2fastq]
bcl2fastq
- Either justbcl2fastq
or pissibly the full path, as appropriate.bcl2fastq_options
- The options forbcl2fastq
. Something like--use-bases-mask Y\*,I6n,Y\* -l WARNING --barcode-mismatches 0 --no-lane-splitting
is recommended.
[Options]
- These are more generic options that don't fit elsewhere.index_mask
- The index mask (--use-bases-mask
) given tobcl2fastq
. This often needs to be changed every few runs, since most of the time it'sI6n
, but not always.postMakeThreads
- After the fastq files are made, things like fastqc are run on each of them. This value sets the total number of worker threads that are used to do that.minSpace
- The minimum free space (in gigabytes) that must be free in theoutputDir
. Having less free space than this results in an error email message.sleepTime
- The amount of time the programs sleeps before restarting (in hours). Importantly, if something is broken and error emails begin to be sent then this also specifies how frequently they'll be produced.runID
- This should be left blank.sampleSheet
- This should be left blank.
[parkour]
URL
- URL for the Parkour API. Currently, this should end with "/api/run_statistics/upload"user
- Username/email address for logging into Parkourpassword
- Password for parkour
[Email]
errorTo
- A comma-separated list of email addresses to which error reports should be sent.finishedTo
- A comma-separated list of email addresses to which reports of finishing a flow cell should be sent.fromAddress
- The email address from which emails are sent.host
- The outgoing email server.
[Uni]
default
- The email address that F*EX should send an email to when a "C" project is uploaded (except for the Scheule group).Scheule
- The email address that F*EX should send an email to when a "C" project from the Scheule group is uploaded.
[Version]
- Everything under here is included in the PDF files generated for each project.pipeline
- A version number for this packagebcl2fastq
- The version number of bcl2fastq from IlluminafastQC
- The version number of fastQC.
[Galaxy]
API key
- The API key to use when contacting the Galaxy server. DO NOT SHARE THIS!URL
- The galaxy server's URL (e.g., https://usegalaxy.org, though that'd obviously not work)verify
- Whether SSL certificates should be verified
A few general notes are in order:
- Blank lines my be added pretty much anywhere.
- Comments need to be on separate lines and can be preceded by either
#
or;
. - The order of things doesn't matter.
- Quotes should not be used!
fastqc=/usr/bin/fastqc
is not the same asfastqc="/usr/bin/fastqc"
! - All mentioned settings must be present! There's currently no method to support skipping steps if a line is blank or absent!
This package has the following dependencies:
- Python3 (python2 will explicitly not work, since some package and function names differ).
- The configparser module
- The reportlab module
- bioblend
- numpy and matplotlib
- bcl2fastq version 2+
- fastq_screen
- seqtk
- FastQC must be present
- MultiQC must be present
- md5sum must be present
- The Pillow python module must be relatively up to date and functional (can't install in Ubuntu and have it work in CentOS).
- There must be an available sendmail server somewhere. This package currently does not support authentication, but that could presumably be added.
- pigz
- splitFastq, which comes in this repository but must be compiled manually