-
Notifications
You must be signed in to change notification settings - Fork 2
Advanced usage
apytram allows almost full customization by the user. Thus, there are many parameters that can be set and the section below is a bit long!
The RNA-seq data (a fastq or fasta file), given by the -fq
or -fa
option, is formatted into a BLAST database, whose name is given by the -d
option.
The data type, paired or single end, must be given by the -dt
option. If data is paired-end, all reads (1 and 2) must be concatenated in a single file. WARNING: Paired read names must end with 1 or 2.
Reads contained in this file will all be used, so the file must have already been cleaned up.
Note that if your data is in fastq format, it will be converted into a fasta file. This conversion can take a lot of time (between XX and YY hours on a laptop).
As the database building step is time consuming, if a BLAST database is already present, the database building will be skipped.
Sequences in the query file (-q
option) will serve as the first reference sequences. The possibility to use a multi-reference query file is still being tested.
-
Reads recruting (BLAST):
Bait sequences are used to recruit homologous reads by BLAST. The
-e
option allows fixing the evalue threshold (default: 0.001). If the data is paired-end, all paired reads are added to the read list. Sequences of all these reads are put in a temporary file which is accessible if the-tmp
option is used. -
Reads assembly (Trinity):
All reads found by BLAST (present in the temporary file) are assembled de novo by Trinity with default parameters. Note that If the data contains paired-end reads, Trinity takes reads as paired reads.
-
Quality and filtering of reconstructed contigs (Exonerate):
Exonerate is used to align each assembled contig to the reference. The length of the contig, the length of the alignment with the reference, the percentage of identity and the alignment score are collected. Reconstructed sequences are filtered according to the options
-id
,-mal
and-len
. Contigs that pass all filters are used as reference for the next iteration. -
Comparison with previous iteration (Exonerate):
Exonerate is used to compare the reconstructed sequences of iteration n+1 with iteration n. If the quality of the assembled seqeunce is not improved, the process is stopped.
-
Coverage calculation (Mafft):
So as to estimate the overall quality of the contig assembly, apytram calculates 2 coverage values:
- a Strictcoverage that represents the percentage of the query sites with a homologous site in the reconstructed sequence(s)
- a Largecoverage that represents the percentage of sites in the alignment with at least one representative in the reconstructed sequence(s) divided by the length of the reference so by definition it can be superior to 100. NOT REALLY CLEAR FOR ME Note that if there is not only one reference sequence, the first sequence in the reference file is took as reference for the coverage counter but all references are aligned.*
At the end of an iteration, reconstructed sequence(s) become the reference sequences for the next iteration.
If one of these criteria is completed during an iteration, the iterative process will stop. The following criteria are implemented as default settings :
- The number of max iteration (
-i
option) is reached. - Recruted reads are the same as in the previous iteration.
- Reconstructed sequences at the end of an iteration are almost the same as after the previous iteration. Almost means each sequence of an iteration has a corresponding sequence in the previous iteration with at least 99% identity and 98% of its length DEFAULT? CAN IT BE CHANGED? "EACH" SEQUENCE? Not only the best one?.
- The number of reconstructed sequences has not changed AND the total length, score and LargeCoverage of all reconstructed sequences have not been improved. The use of the Largecoverage value in this step allows to keep iterating if the UTR in 5' and 3' are getting longer, even if the coding sequence does not.
If the --required_coverage
option is used, the iterating process will stop if the Strictcoverage is superior to the Requiredcoverage.
N.B.: None of these criteria is applied if the --finish_all_iter
option is used.
The last criteria which can stop the iterative process is a time limit given by the -time_max
option (by default 7200 seconds, but this can be changed).
If this time limit is reached, no new iteration will begin even with the --finish_all_iter
option. Please note that this means that a job can thus spend more than the -time_max
limit if the database building and the last iteration last more than -time_max
setting.
This can be set with the -threads $nb_threads
option.
./apytram.py -d $db_name -dt $db_type -fq $fastq_name -q $query_file_name -threads $nb_threads
MAYBE SAY STHG ABOUT THE DEFAULT NUMBER OR THE MIN, MAX NUMBER, ETC.
If you use the -tmp
option, temporary files will not be deleted at the end of the job.
./apytram.py -d $db_name -dt $db_type -fq $fastq_name -q $query_file_name -tmp $tmp_directory_name
If you want to restart a job IN CASE OR FAIL OR WHATEVER, you can use the -i_start
option only if the -tmp
option was set for the job you want to restart.
./apytram.py -d $db_name -dt $db_type -fq $fastq_name -q $query_file_name -tmp $tmp_directory_name -i_start X
In case you would only like to build a BLAST database out of your data, you can simply do this within apytram, omitting the -q
option.
./apytram.py -d $db_name -t $db_type -fq $fastq_name
A final filter (using the -fmal
, -fid
and -flen
options) can be applied on the reconstructed sequences to be more stringent than the threshold used during the iterative process.
The reference sequence of a reconstructed sequence is the sequence from the query file that is the more homologous to it (according to Exonerate score, see "How does apytram work?"). THIS IS NOT CLEAR FOR ME; WHAT HAPPENS WHEN YOU HAVE SEVERAL REFERENCES?
-
If you want to keep only reconstructed sequences that have a length superior to X percent of the reference sequence:
./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -flen X
-
If you want to keep only reconstructed sequences that have an identity percentage superior to Y percent with the reference sequence on the whole alignment length:
./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -fid Y
-
If you want to keep only reconstructed sequences that align on the reference on a length superior to Z percent of the reference sequence length: ``` ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -fmal Z
```
- If you want to combine all these options, it's possible:
./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -flen X -fid Y -fmal Z