Skip to content

How does apytram work?

CarineRey edited this page Dec 21, 2015 · 16 revisions

How does apytram work?

Database building

The RNA-seq data (a fastq or fasta file), given by the -fq or -fa option, is formatted into a BLAST database, whose name is given by the -d option.

The data type, paired or single end, must be given by the -dt option. If data are paired-end, all reads (1 and 2) must be concatenated in a uniq file. WARNING: Paired read names must end with 1 or 2. Reads contained in this file will all be used, so the file must have already been clean up.

If this file is a fastq file, it will be converted into a fasta file. This conversion can take a lot of time (between XX and YY hours on a laptop).

As the database building step is time consuming, if a BLAST database is already present, the database building will be skipped.

Iterative process

Sequences in the query file (-q option) will serve as the first bait sequences. The possibility to use a multi-reference query file is still in testing.

A classical iteration (see Figure XX)

  • Reads recruting (Blast):

    Bait sequences are used to recrute homologous reads by BLAST. The -e option allows fixing the evalue threshold. If the data is paired-end (-dt option), all paired reads are added to the read list.

    Sequences of all these reads are put in a temporary file which is accessible if the -tmp option is used.

  • Read assembly (Trinity):

    All reads present in this temporary file are assembled (de novo) by Trinity with default parameters. If the data contains paired-end reads, Trinity takes reads as paired reads.

  • Reconstructed sequences quality (Exonerate):

    Exonerate is used to align each sequence to the reference. The length of the sequence, the length of the alignment, the percentage of identity, the alignment score are collected.

  • Reconstructed sequence filtering:

    Reconstructed sequences are filtered according to the option -id -mal -len. Sequences which pass all filters are used as bait sequences for the next iteration.

  • Comparison with previous iteration (Exonerate):

    Exonerate is used to find the "parent" reconstructed sequences from the previous iteration for each reconstructed sequences and check they are really different.

  • Coverage calculation (Mafft):

apytram calculates 2 coverage values, a Strictcoverage that means the percentage of sites of the query with a homologous site in the reconstructed sequences, and a Largecoverage that means the percentage of sites in the alignment with at least one representative in the reconstructed sequences divided by the length of the reference so by definition it can be superior to 100.

If there is not only one reference sequence, the first sequence in the reference file is took as reference for the coverage counter but all references are aligned.

![Figure to explain.](https://github.com/CarineRey/apytram/blob/master/Documentation/CoverageExplanation.png?raw=true "Coverage explain")

At the end of an iteration, reconstructed sequences become the bait sequences for the next iteration.

Criteria to stop the iterative process

If one of these criteria is completed during an iteration, the iterative process will stop. The following criteria are implemented as default settings :

  • The number of max iteration (-i option) is attained.
  • Recruted reads are the same as during the previous iteration.
  • Reconstructed sequences at the end of an iteration are almost the same as after the previous iteration. That means each sequence of an iteration has a corresponding sequence in the previous iteration with at least 99% identity and 98% of its length.
  • The number of reconstructed sequences has not changed AND the total length, score and LargeCoverage of all reconstructed sequences have not been improved.

If the --required_coverage option is used, the iterating process will stop if the Strict coverage is superior to the Required_coverage.

All these criteria are not applied if the --finish_all_iteration is used.

The last criteria which can stop the iterative process is a time limit given by -time_max (by default 7200 seconds). If this limit is attained, no new iteration begin even with the --finish_all_iteration option. So a job can spent more than this limit if the first iteration and the database building is longer.

Final Filter

A final filter (-fmal -fid -flen option) can be applied on the reconstructed sequences to be more stringent than the threshold used during the iterative process. See the "basic use" rubric to more information.

Writing outputs file

Finally, output files are writen:

  • $OUTPUT_PREFIX.fasta:

    A fasta file containing all reconstructed sequences of the last iteration which pass the final filter.

  • $OUTPUT_PREFIX.best.fasta:

    A fasta file containing the reconstructed sequences of the last iteration with the best scores, and which pass the final filter.

  • $OUTPUT_PREFIX.stats.pdf:

available if --plot option. A pdf containing 2 figures to know global information at each iteration.
  • $OUTPUT_PREFIX.stats.csv:

    available if --plot or --stats options. A csv file containing the raw data needed to draw the figures present in the .stats.pdf

  • $OUTPUT_PREFIX.ali.png:

    A figure representing the alignment of all reconstructed sequences which pass the final filter on the query. White represents a gap, blue a base of the reference, green an identical base of the reference in a reconstructed sequence, red a different base of the reference and yellow a base corresponding to a gap in the reference.