- Unix-like system
- C++ compiler with C++17 support
- CMake (>= 3.17)
- Processor supported by streamvbyte
- Ubuntu 20.04, GCC 10, CMake 3.19, and Intel Cascade Lake
- CentOS 7, GCC 10, CMake 3.17, and Intel Skylake
- MacOS Big Sur, Clang 12, CMake 3.18, and Intel Ice Lake
The following script installs Abseil,
Boost,
GoogleTest,
Google Benchmark,
mimalloc,
streamvbyte, and spdlog in extern
directory.
cd extern
PREFIX=$(pwd) ./install.sh
The following script builds programs in build
direcory.
mkdir build
cd build
cmake .. -DCMAKE_PREFIX_PATH=$(pwd)/../extern -DCMAKE_BUILD_TYPE=Release
make -j
The following executables are built in the build directory.
$ ./kmerset-build --help
kmerset-build: Reads a FASTA file and constructs a set of k-mers. Usage: ./kmerset-build [options] <path to file>
Flags:
--canonical (set this flag when handling canonical k-mers); default: true;
--check (does compression & decompression to see if it is working
correctly); default: false;
--compressor (a program to compress output files; e.g., "bzip2" for bzip2,
"gzip" for gzip, and "" for no compression); default: "";
--cutoff (ignore k-mers that appear less often than this value); default: 1;
--debug (enable debugging messages); default: false;
--decompressor (a program to decompress input files; e.g., "bzip2 -d" for
bzip2, "gzip -d" for gzip, and "" for no decompression); default: "";
--k (the length of k-mers); default: 15;
--out (output file name); default: "";
--workers (number of threads to use); default: 1;
Try --helpfull to get a list of all flags.
The following command reads foo.fasta.gz
, counts canonical k-mers, removes ones that appear less than 4 times, and
saves the resulting k-mer set data to foo.kmerset.bz2
. k is set to 23. 8 threads will be used.
./kmerset-build --canonical --compressor='bzip2' --cutoff=4 --decompressor='gzip2 -d' --k=23 --out=foo.kmerset.bz2 --workers=8 foo.fasta.gz
$ ./kmerset-stat --help
kmerset-stat: Prints the metadata of a k-mer set. Usage: ./kmerset-stat [options] <path to file>
Flags:
--canonical (set this flag when handling canonical k-mers); default: true;
--debug (enable debugging messages); default: false;
--decompressor (a program to decompress input files; e.g., "bzip2 -d" for
bzip2, "gzip -d" for gzip, and "" for no decompression); default: "";
--k (the length of k-mers); default: 15;
--workers (number of threads to use); default: 1;
Try --helpfull to get a list of all flags.
The following command shows the metadata of the k-mer set represented by foo.kmerset.bz2
. It is assumed that the file
represents a k-mer set of canonical k-mers where k is 23. 8 threads will be used.
./kmerset-stat --canonical --decompressor='bzip2 -d' --k=23 --workers=8 foo.kmerset.bz2
$ ./kmerset-multiple-compress --help
kmerset-multiple-compress: Compresses multiple k-mer sets. Usage: ./kmerset-multiple-compress [options] <paths to file> <path to file> ...
Flags:
--canonical (set this flag when handling canonical k-mers); default: true;
--compressor (a program to compress output files; e.g., "bzip2" for bzip2,
"gzip" for gzip, and "" for no compression); default: "";
--debug (enable debugging messages); default: false;
--decompressor (a program to decompress input files; e.g., "bzip2 -d" for
bzip2, "gzip -d" for gzip, and "" for no decompression); default: "";
--extension (extension for output files); default: "txt";
--k (the length of k-mers); default: 15;
--out (directory path to save dumped files); default: "";
--out_graph (path to save dumped DOT file); default: "";
--workers (number of threads to use); default: 1;
Try --helpfull to get a list of all flags.
The following command reads ./data/*.kmerset.gz
, and compresses the obtained k-mer sets. The output will be saved
to ./compressed/*.bz2
after bzip2-ed. The DOT file representing the graph data will be saved to ./graph.gv
. It
handles canonical k-mers where k is 23. 8 threads will be used.
./kmer-set-multiple-compress --canonical --compressor='bzip2' --decompressor='gzip -d' --extension='bz2' --k=23 --out=./compressed --out_graph=./graph.gv --workers=8 ./data/*.kmerset.gz
$ ./kmerset-multiple-decompress --help
kmerset-multiple-decompress: Decompresses the output of "kmerset-multiple-compress". Usage: ./kmerset-multiple-decompress [options] <path to directory>
Flags:
--canonical (set this flag when handling canonical k-mers); default: true;
--debug (enable debugging messages); default: false;
--decompressor (a program to decompress input files; e.g., "bzip2 -d" for
bzip2, "gzip -d" for gzip, and "" for no decompression); default: "";
--extension (extension of files in folder); default: "txt";
--k (the length of k-mers); default: 15;
--workers (number of threads to use); default: 1;
Try --helpfull to get a list of all flags.
The following command reads the output of kmerset-multiple-compress
saved to ./compressed/*.bz2
, decompresses the
compressed data, and prints the metadata of each of the original k-mer sets. It handles canonical k-mers where k is 23.
8 threads will be used.
./kmerset-multiple-decompress --canonical --decompressor='bzip2 -d' --extension='bz2' --k=23 --workers=8 ./compressed
$ ./spss-benchmark --help
spss-benchmark: Runs a benchmark for SPSS construction using a single k-mer set. Usage: ./spss-benchmark [options] <path to file>
Flags from Users/kazushi/work/research/src/spss-benchmark.cc:
--buckets (number of buckets for SPSS calculation); default: 1;
--debug (enable debugging messages); default: false;
--decompressor (a program to decompress input files; e.g., "bzip2 -d" for
bzip2, "gzip -d" for gzip, and "" for no decompression); default: "";
--k (the length of k-mers); default: 15;
--repeats (number of repeats); default: 1;
--workers (number of threads to use); default: 1;
Try --helpfull to get a list of all flags.
The following command loads the k-mer set represented by foo.kmerset.bz2
, and runs a benchmark to compare our proposed
SPSS construction algorithm with UST algorithm. It will use the k value of 23. 1024 buckets (a parameter for the
propsoed algorithm) and 8 threads will be used.
./spss-benchmark --buckets=1024 --decompressor='bzip2 -d' --k=23 --repeats=10 --workers=8 foo.kmerset.bz2
The input file should contain canonical k-mers.
The following command, when executed in the build directory, invokes all the tests.
ctest
It is also possible to configure test execution by providing arguments to ctest
. Refer
to ctest documentation for details.
lib
contains most of the source code. The code inlib/core
provides core functionalities, and the code outside the directory provides helper functions. The files inlib/core
do not depend on the files outside thelib/core
directory.src
contains source codes for executables. Each.cc
file corresponds to one executable with the same name.test
contains source code for functions and classes defined inlib/core
.benchmark
contains source code for benchmarks for critical functions and classes.
- Currently, the value of
k
can be 15, 19, or 23.