JARVIS3: an improved encoder for genomic sequences
Manually:
git clone https://github.com/cobilab/jarvis3.git cd jarvis3/src/ make
Using Conda:
conda install -c bioconda jarvis3
Example of running JARVIS3 using level 7:
./JARVIS3 -v -l 7 File.seq
To see the possible options type
./JARVIS3 -h
This will print the following options:
██ ███████ ███████ ██ ██ ██ ███████ ███████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ███████ ██████ ██ ██ ██ ███████ ███████
██ ██ ██ ██ ██ ███ ██ ██ ██ ██ ██
███████ ██ ██ ██ ███ ████ ██ ███████ ███████
NAME
JARVIS3 v3.7,
Efficient lossless encoding of genomic sequences
SYNOPSIS
./JARVIS3 [OPTION]... [FILE]
SAMPLE
Run Compression -> ./JARVIS3 -v -l 14 sequence.txt
Run Decompression -> ./JARVIS3 -v -d sequence.txt.jc
DESCRIPTION
Lossless compression and decompression of genomic
sequences for miniaml storage and analysis purposes.
Measure an upper bound of the sequence complexity.
-h, --help
Usage guide (help menu).
-a, --version
Display program and version information.
-x, --explanation
Explanation of the context and repeat models.
-f, --force
Force mode. Overwrites old files.
-v, --verbose
Verbose mode (more information).
-p, --progress
Show progress bar.
-d, --decompress
Decompression mode.
-e, --estimate
It creates a file with the extension ".iae" with the
respective information content. If the file is FASTA or
FASTQ it will only use the "ACGT" (genomic) sequence.
-s, --show-levels
Show pre-computed compression levels (configured).
-l [NUMBER], --level [NUMBER]
Compression level (integer).
Default level: 7.
It defines compressibility in balance with computational
resources (RAM & time). Use -s for levels perception.
-sd [NUMBER], --seed [NUMBER]
Pseudo-random seed.
Default value: 0.
-hs [NUMBER], --hidden-size [NUMBER]
Hidden size of the neural network (integer).
Default value: 40.
-lr [DOUBLE], --learning-rate [DOUBLE]
Neural Network leaning rate (double).
The 0 value turns the Neural Network off.
Default value: 0.03.
-o [FILENAME], --output [FILENAME]
Compressed/decompressed output filename.
[FILENAME]
Input sequence filename (to compress) -- MANDATORY.
File to compress is the last argument.
COPYRIGHT
Copyright (C) 2014-2024.
This is a Free software, under GPLv3. You may redistribute
copies of it under the terms of the GNU - General Public
License v3 <http://www.gnu.org/licenses/gpl.html>.
To see the possible levels (automatic choosen compression parameters), type:
./JARVIS3 -s
This will ouput th following pre-set models for each level:
Level 1: -rm 1:12:0.90:4:0.72:0:0.1:1 Level 2: -rm 1:12:0.90:4:0.72:1:0.1:1 Level 3: -rm 1:13:0.90:4:0.72:1:0.1:1 Level 4: -rm 1:14:0.90:4:0.72:1:0.1:1 Level 5: -rm 2:12:0.90:5:0.60:1:0.1:1 Level 6: -rm 4:12:0.94:7:0.70:1:0.05:3 Level 7: -rm 3:13:0.90:5:0.72:1:0.1:1 Level 8: -rm 3:14:0.90:5:0.72:1:0.1:1 Level 9: -rm 5:14:0.90:5:0.72:1:0.1:1 Level 10: -rm 6:12:0.90:6:0.78:1:0.03:1 Level 11: -rm 8:13:0.90:6:0.78:1:0.03:2 Level 12: -rm 10:12:0.91:7:0.80:1:0.02:3 Level 13: -rm 12:12:0.90:7:0.81:1:0.02:3 Level 14: -lr 0 -cm 1:1:0:0.9/0:0:0:0 -rm 2:12:0.92:7:0.80:1:0.05:2 Level 15: -lr 0 -cm 3:1:0:0.9/0:0:0:0 -rm 3:12:0.93:7:0.81:1:0.05:3 Level 16: -lr 0 -cm 3:1:0:0.9/0:0:0:0 -rm 4:12:0.92:7:0.81:1:0.03:2 Level 17: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 4:13:0.94:7:0.81:1:0.04:3 Level 18: -lr 0 -cm 6:1:0:0.9/0:0:0:0 -rm 4:13:0.94:7:0.81:1:0.04:3 Level 19: -lr 0 -cm 6:1:0:0.9/0:0:0:0 -rm 8:12:0.93:7:0.81:1:0.02:3 Level 20: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 20:12:0.9:7:0.85:1:0.01:4 Level 21: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 50:12:0.9:7:0.85:1:0.01:5 Level 22: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 100:12:0.9:7:0.85:1:0.01:5 Level 23: -lr 0 -cm 4:1:0:0.9/0:0:0:0 -rm 200:12:0.9:7:0.85:1:0.01:6 Level 24: -lr 0 -cm 6:1:0:0.9/0:0:0:0 -rm 6:15:0.93:6:0.81:1:0.02:1 Level 25: -lr 0.03 -hs 24 -cm 6:1:0:0.9/0:0:0:0 -rm 6:15:0.92:6:0.81:1:0.02:1 Level 26: -lr 0.03 -hs 32 -cm 4:1:0:0.9/0:0:0:0 -rm 20:15:0.90:7:0.82:1:0.02:1 Level 27: -lr 0.03 -hs 24 -cm 6:1:0:0.9/0:0:0:0 -rm 15:13:0.92:7:0.85:0:0.02:4 -rm 13:12:0.92:7:0.84:2:0.01:3 Level 28: -lr 0.03 -hs 42 -cm 6:1:0:0.9/0:0:0:0 -rm 6:15:0.93:6:0.81:1:0.02:1 Level 29: -lr 0.03 -hs 42 -cm 6:1:0:0.9/0:0:0:0 -rm 10:15:0.93:6:0.81:1:0.02:1 Level 30: -lr 0.03 -hs 42 -cm 6:1:0:0.9/0:0:0:0 -rm 10:15:0.93:6:0.81:0:0.02:1 -rm 10:15:0.93:6:0.81:2:0.02:1 Level 31: -lr 0.03 -hs 48 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 300:12:0.9:7:0.85:0:0.01:10 -rm 200:12:0.9:7:0.8:2:0.01:4 Level 32: -lr 0.04 -hs 64 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 500:12:0.9:7:0.85:0:0.01:12 -rm 200:12:0.9:7:0.8:2:0.01:4 Level 33: -lr 0.04 -hs 86 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 500:12:0.9:7:0.85:0:0.01:12 -rm 200:12:0.9:7:0.8:2:0.01:4 Level 34: -lr 0.04 -hs 256 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.9/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 -rm 1500:12:0.9:7:0.85:0:0.01:10 -rm 500:12:0.9:7:0.82:2:0.01:3 Level 35: -lr 0.04 -hs 248 -cm 1:1:0:0.9/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 7:1:0:0.9/0:0:0:0 -cm 9:1:1:0.9/0:0:0:0 -cm 11:10:0:0.9/0:0:0:0 -rm 100:14:0.9:7:0.85:1:0.01:3 -rm 200:12:0.88:7:0.85:0:0.01:3 -rm 300:12:0.87:7:0.85:2:0.01:3 Level 36: -lr 0.04 -hs 248 -cm 1:1:0:0.9/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 7:1:0:0.9/0:0:0:0 -cm 9:1:1:0.9/0:0:0:0 -cm 11:10:0:0.9/0:0:0:0 -cm 13:200:1:0.9/1:10:1:0.9 -rm 100:14:0.9:7:0.85:1:0.01:3 -rm 200:12:0.88:7:0.85:0:0.01:8 -rm 300:12:0.87:7:0.85:2:0.01:3 Level 37: -lr 0.01 -hs 248 -cm 1:1:0:0.9/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 6:1:0:0.9/0:0:0:0 -cm 9:1:0:0.9/0:0:0:0 -cm 11:10:1:0.9/0:0:0:0 -cm 14:200:1:0.9/1:10:1:0.9 -rm 300:14:0.88:7:0.85:0:0.01:8 -rm 300:14:0.88:7:0.85:2:0.01:8 -rm 500:12:0.88:7:0.85:0:0.01:15 Level 38: -lr 0 -cm 12:1:0:0.7/0:0:0:0 -rm 2:14:0.95:1:0.9:1:0.1:1 Level 39: -lr 0 -cm 12:1:0:0.7/0:0:0:0 -rm 3:14:0.95:1:0.9:1:0.1:1 Level 40: -lr 0.03 -lr 32 -cm 12:1:0:0.7/0:0:0:0 -rm 4:14:0.95:1:0.9:1:0.1:1
To see the meaning of the model parameters, type:
./JARVIS3 -x
This will output the following content:
-cm [NB_C]:[NB_D]:[NB_I]:[NB_G]/[NB_S]:[NB_E]:[NB_R]:[NB_A] Template of a context model. Parameters: [NB_C]: (integer [1;14]) order size of the regular context model. Higher values use more RAM but, usually, are related to a better compression score. [NB_D]: (integer [1;5000]) denominator to build alpha, which is a parameter estimator. Alpha is given by 1/[NB_D]. Higher values are usually used with higher [NB_C], and related to confident bets. When [NB_D] is one, the probabilities assume a Laplacian distribution. [NB_I]: (integer {0,1,2}) number to define if a sub-program which addresses the specific properties of DNA sequences (Inverted repeats) is used or not. The number 1 turns ON the sub-program using at the same time the regular context model. The number 2 does only contemple the inversions only (NO regular). The number 0 does not contemple its use (Inverted repeats OFF). The use of this sub-program increases the necessary time to compress but it does not affect the RAM. [NB_G]: (real [0;1)) real number to define gamma. This value represents the decayment forgetting factor of the regular context model in definition. [NB_S]: (integer [0;20]) maximum number of editions allowed to use a substitutional tolerant model with the same memory model of the regular context model with order size equal to [NB_C]. The value 0 stands for turning the tolerant context model off. When the model is on, it pauses when the number of editions is higher that [NB_C], while it is turned on when a complete match of size [NB_C] is seen again. This is probabilistic-algorithmic model very useful to handle the high substitutional nature of genomic sequences. When [NB_S] > 0, the compressor used more processing time, but uses the same RAM and, usually, achieves a substantial higher compression ratio. The impact of this model is usually only noticed for higher [NB_C]. [NB_R]: (integer {0,1}) number to define if a sub-program which addresses the specific properties of DNA sequences (Inverted repeats) is used or not. It is similar to the [NR_I] but for tolerant models. [NB_E]: (integer [1;5000]) denominator to build alpha for substitutional tolerant context model. It is analogous to [NB_D], however to be only used in the probabilistic model for computing the statistics of the substitutional tolerant context model. [NB_A]: (real [0;1)) real number to define gamma. This value represents the decayment forgetting factor of the substitutional tolerant context model in definition. Its definition and use is analogus to [NB_G]. ... (you may use several context models) -rm [NB_R]:[NB_C]:[NB_B]:[NB_L]:[NB_G]:[NB_I]:[NB_W]:[NB_Y] Template of a repeat model. Parameters: [NB_R]: (integer [1;10000] maximum number of repeat models for the class. On very repetive sequences the RAM increases along with this value, however it also improves the compression capability. [NB_C]: (integer [1;14]) order size of the repeat context model. Higher values use more RAM but, usually, are related to a better compression score. [NB_B]: (real (0;1]) beta is a real value, which is a parameter for discarding or maintaining a certain repeat model. [NB_L]: (integer (1;20]) a limit threshold to play with [NB_B]. It accepts or not a certain repeat model. [NB_G]: (real [0;1)) real number to define gamma. This value represents the decayment forgetting factor of the regular context model in definition. [NB_I]: (integer {0,1,2}) number to define if a sub-program which addresses the specific properties of DNA sequences (Inverted repeats) is used or not. The number 1 turns ON the sub-program using at the same time the regular context model. The number 0 does not contemple its use (Inverted repeats OFF). The number 2 uses exclusively Inverted repeats. The use of this sub-program increases the necessary time to compress but it does not affect the RAM. [NB_W]: (real (0;1)) initial weight for the repeat class. [NB_Y]: (integer {0}, [1;50]) maximum cache size. This will use a table cache with the specified size. The size must be in balance with the k-mer size [NB_C].
First, make sure to give permissions to the script by typing the following at the src/ folder
chmod +x JARVIS3.sh
The extension of compressing FASTA and FASTQ data contains a menu to expose the parameters that can be accessed using:
./JARVIS3.sh --help
This will ouput the following menu
------------------------------------------------------- JARVIS3, v3.7. High reference-free compression of DNA sequences, FASTA data, and FASTQ data. Program options --------------------------------------- -h, --help Show this, -a, --about Show program information, -c, --install Install/compile programs, -s, --show Show compression levels, -l , --level JARVIS3 compression level, -b , --block Block size to be splitted, -t , --threads Number of JARVIS3 threads, -dn, --dna Assume DNA sequence type, -fa, --fasta Assume FASTA data type, -fq, --fastq Assume FASTQ data type, -au, --automatic Detect data type (def), -d, --decompress Decompression mode, Input options ----------------------------------------- -i , --input Input DNA filename. Example ----------------------------------------------- ./JARVIS3.sh --block 16MB -t 8 -i sample.seq ./JARVIS3.sh --decompress -t 4 -i sample.seq.tar -------------------------------------------------------
Preparing JARVIS3 for FASTA and FASTQ:
./JARVIS3.sh --install
Compression of FASTA data:
./JARVIS3.sh --threads 8 --fasta --block 10MB --input sample.fa
Decompression of FASTA data:
./JARVIS3.sh --decompress --fasta --threads 4 --input sample.fa.tar
Compression of FASTQ data:
./JARVIS3.sh --threads 8 --fastq --block 40MB --input sample.fq
Decompression of FASTQ data:
./JARVIS3.sh --decompress --fastq --threads 4 --input sample.fq.tar
Sousa, Maria JP, Armando J. Pinho, and Diogo Pratas. "JARVIS3: an efficient encoder for genomic data." Bioinformatics 40.12 (2024): btae725.
For any issue let us know at issues link.
For more information:
http://www.gnu.org/licenses/gpl-3.0.html