- Accepts arguments in standard java.util.Properties format, i.e., one per line:
<argumentName>=<argumentValue>
- List values (such as overhangs) are delimited by commas
Required values:
- minQuality: the minimum quality score that will prevent fuzzy matching bases that were called >= this quality
- barcodeFile: path to the file containing barcode information (tab separated,
barcode<tab>sample_id
) - sourceFileForward: the path to the file containing the forward reads (fq.gz), or a comma-separated ordered list of paths that should be treated as concatenated input
- sourceFileReverse: the path to the file containing the reverse reads (fq.gz), or a comma-separated ordered list of paths that should be treated as concatenated input
- overhang: a comma-separated list of overhangs for this data (can accept multiple overhangs for double digests)
Optional values:
- population: required for demultiplexing, the population to use in the output file names
- align: true|false, whether the file should be output with .F|R.fq.gz, or R1|R2.fq.gz (default false, which outputs R1|R2)
- append: if the output files already exist, should we append to it (default is false, we overwrite instead)
- fuzzyMatch: should the program attempt to fuzzy match barcodes (default true)
- debugOut: should the program generate a debug output file with all the reads that failed to be parsed
An example can be found in default.config
Note: these options can be specified on the command line instead, ex CopyBarcodes minQuality= barcodeFile= etc. You cannot mix and match (specifying some in a config file and others on the command line), however.
The CopyBarcodes class performs pre-alignment processing of FASTQ files containing paired-end GBS reads (e.g., those generated on Illumina machines). The two goals are to identify, and optionally correct, sequencing errors in the barcode/overhang region of the forward reads, and subsequently add correct(ed) barcodes to the beginning of the respective reverse reads. This allows passing non-demultiplexed files directly into a program like Tassel for alignment and variant calling.
In all cases, an initial verification step is performed, using a prefix tree to check that the beginning of each read in the forward-read FASTQ file contains one of the barcodes in the supplied barcode file, and that this barcode is immediately followed by one of the two possible overhang sequences that would be genereated by the utilized restriction enzyme (specified in the .config file; by default, the overhangs associated with ApeK1 are used). Once verified, the x and y coordinates of the corresponding reverse read is checked against the forward read, and if they match, the barcode is copied to the beginning of ther reverse read.
If only forward- and reverse- read FASTQ files, plus barcode and configuration files, are passed, mismatches are deleted from the output FASTQ files, which is written to disc as an interleaved FASTQ.gz file.
If the "fuzzy" argument is enabled (as it is by default), fuzzy-matching is performed in cases of mismatches, and if a unique barcode can be recovered by changing no more than one low-quality, it is written to the output file as well (the specific threshold is defined in the .config file). This option can also be run in "debug mode" by calling: all skipped reads will then be writted to debugOut.txt, and some summary statistics will also be output describing what was skipped.
Example usage:
export config=/PATH/TO/CONFIG//FILE
java -cp gbsTools.jar CopyBarcodes $config
Benchmarking was performed on a MacBook Pro with a 3.3 GHz Intel Core i7 CPU and 16 GB of 2133 MHz LPDDR3 RAM. The forward-read file was 28.64 GB, and the reverse-read file was 29.74 GB (each containing roughly 400,000,000 reads). Because writing to the output file is speed limiting, calling the .jar file with or without fuzzy matching took approximately 450 minutes.
This class performs similar quality control steps as described above (checking the x-y coordinates of reverse reads against forward reads, and optionally fuzzy matching barcodes). However, instead of a single interleaved FASTQ being output, pairs of forward- and reverse-read FASTQ files are written for each unique barcode, with the barcodes removed from all reads, and the enzyme cut-site left intact. The barcode file itself should include a tab-separated sample ID, which will act as the sample name following the naming conventions described here: http://www.ddocent.com/UserGuide/#raw-sequences. The flag -alignFile
controls whether use the "F/R" versus "R1/R2" nomenclature.
In addition a population name (a string) should therefore be passsed as the first argument following the class path:
java -cp gbsTools.jar Demultiplexer $config
Benchmarking was performed on a CentOS Linux distribution, running on a server with 6 Intel Xeon 2.67 GHz CPUs and 40 GB of RAM. The same FASTQ files described above were de-multiplexed into their 192 component FASTQ files (each ~200 MB when gzipped) in 160 minutes. For comparison, Stacks' process_radtags
completed in just over 11 hours, while GBSX's --Demultiplexer
option took 5 hours to finish, with -t 6
(15 hours with -t 1
).
This class downsamples FQ files. It can either be passed a path to sourceFileForward
and sourceFileReverse
, in which case it will downsample both equally, or it can take sourceFileInterleaved
instead, to use the output of CopyBarcodes
.
percentToRetain
determines how much to downsample by
retainByTruncating
turns on or off random elimination.
Progress tracking is attempted, but may not be very accurate.
TruncateReads will trim read and quality score length. Either pass it a config file or all arguments on the command line.
Usage: file= directory=<path to directory, where all files should be truncated> max_read_length= barcode_length=
All argument must be specific, except for file and directory - exactly one of those must be specified
This assumes that each read is in the format
The output is stored in .truncated.gz