Skip to content
tlorin edited this page Jan 4, 2016 · 26 revisions

apytram allows almost full customization by the user. Thus, there are many parameters that can be set and the section below is a bit long!

Database building

The RNA-seq data (a fastq or fasta file), given by the -fq or -fa option, is formatted into a BLAST database, whose name is given by the -d option.

The data type, paired or single end, must be given by the -dt option. If data is paired-end, all reads (1 and 2) must be concatenated in a single file. WARNING: Paired read names must end with 1 or 2. Reads contained in this file will all be used, so the file must have already been cleaned up.

Note that if your data is in fastq format, it will be converted into a fasta file. This conversion can take a lot of time (between XX and YY hours on a laptop).

As the database building step is time consuming, if a BLAST database is already present, the database building will be skipped.

Iterative process

Sequences in the query file (-q option) will serve as the first reference sequences. The possibility to use a multi-reference query file is still being tested.

A classical iteration (see Figure XX)

  1. Reads recruting (BLAST):

    Bait sequences are used to recruit homologous reads by BLAST. The -e option allows fixing the evalue threshold (default: 0.001). If the data is paired-end, all paired reads are added to the read list. Sequences of all these reads are put in a temporary file which is accessible if the -tmp option is used.

  2. Reads assembly (Trinity):

    All reads found by BLAST (present in the temporary file) are assembled de novo by Trinity with default parameters. Note that If the data contains paired-end reads, Trinity takes reads as paired reads.

  3. Quality and filtering of reconstructed contigs (Exonerate):

    Exonerate is used to align each assembled contig to the reference. The length of the contig, the length of the alignment with the reference, the percentage of identity and the alignment score are collected. Reconstructed sequences are filtered according to the options -id, -mal and -len. Contigs that pass all filters are used as reference for the next iteration.

  4. Comparison with previous iteration (Exonerate):

    Exonerate is used to compare the reconstructed sequences of iteration n+1 with iteration n. If the quality of the assembled seqeunce is not improved, the process is stopped.

  5. Coverage calculation (Mafft):

    So as to estimate the overall quality of the contig assembly, apytram calculates 2 coverage values:

    • a Strictcoverage that represents the percentage of the query sites with a homologous site in the reconstructed sequence(s)
    • a Largecoverage that represents the percentage of sites in the alignment with at least one representative in the reconstructed sequence(s) divided by the length of the reference so by definition it can be superior to 100. NOT REALLY CLEAR FOR ME Note that if there is not only one reference sequence, the first sequence in the reference file is took as reference for the coverage counter but all references are aligned.*

    Figure to explain.

At the end of an iteration, reconstructed sequence(s) become the reference sequences for the next iteration.

Criteria to stop the iterative process

If one of these criteria is completed during an iteration, the iterative process will stop. The following criteria are implemented as default settings :

  • The number of max iteration (-i option) is reached.
  • Recruted reads are the same as in the previous iteration.
  • Reconstructed sequences at the end of an iteration are almost the same as after the previous iteration. Almost means each sequence of an iteration has a corresponding sequence in the previous iteration with at least 99% identity and 98% of its length DEFAULT? CAN IT BE CHANGED? "EACH" SEQUENCE? Not only the best one?.
  • The number of reconstructed sequences has not changed AND the total length, score and LargeCoverage of all reconstructed sequences have not been improved. The use of the Largecoverage value in this step allows to keep iterating if the UTR in 5' and 3' are getting longer, even if the coding sequence does not.

If the --required_coverage option is used, the iterating process will stop if the Strictcoverage is superior to the Requiredcoverage.

N.B.: None of these criteria is applied if the --finish_all_iter option is used.

The last criteria which can stop the iterative process is a time limit given by the -time_max option (by default 7200 seconds, but this can be changed). If this time limit is reached, no new iteration will begin even with the --finish_all_iter option. Please note that this means that a job can thus spend more than the -time_max limit if the database building and the last iteration last more than -time_max setting.

Change the number of threads

This can be set with the -threads $nb_threads option.

./apytram.py -d $db_name -dt $db_type -fq $fastq_name -q $query_file_name -threads $nb_threads

MAYBE SAY STHG ABOUT THE DEFAULT NUMBER OR THE MIN, MAX NUMBER, ETC.

Access to temporary files

If you use the -tmp option, temporary files will not be deleted at the end of the job.

./apytram.py -d $db_name -dt $db_type -fq $fastq_name -q $query_file_name -tmp $tmp_directory_name

Restart a job

If you want to restart a job IN CASE OR FAIL OR WHATEVER, you can use the -i_start option only if the -tmp option was set for the job you want to restart.

./apytram.py -d $db_name -dt $db_type -fq $fastq_name -q $query_file_name -tmp $tmp_directory_name -i_start X

Database building only

In case you would only like to build a BLAST database out of your data, you can simply do this within apytram, omitting the -q option.

./apytram.py -d $db_name -t $db_type -fq $fastq_name

Final filter

A final filter (using the -fmal, -fid and -flen options) can be applied on the reconstructed sequences to be more stringent than the threshold used during the iterative process.

The reference sequence of a reconstructed sequence is the sequence from the query file that is the more homologous to it (according to Exonerate score, see "How does apytram work?"). THIS IS NOT CLEAR FOR ME; WHAT HAPPENS WHEN YOU HAVE SEVERAL REFERENCES?

  • If you want to keep only reconstructed sequences that have a length superior to X percent of the reference sequence: ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -flen X

  • If you want to keep only reconstructed sequences that have an identity percentage superior to Y percent with the reference sequence on the whole alignment length: ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -fid Y

  • If you want to keep only reconstructed sequences that align on the reference on a length superior to Z percent of the reference sequence length: ``` ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -fmal Z

```
  • If you want to combine all these options, it's possible:
    ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -flen X -fid Y -fmal Z