Skip to content

How does apytram work?

CarineRey edited this page Jan 8, 2016 · 16 revisions

How does apytram work?

Principle

Summary

apytram allows the assembly of sequences from RNA-seq data (paired-end or single-end) using one or more reference homologous sequences. The reference sequences can come from the species of interest or from a different species.

For this, the program will need:

  • a database: usually your RNA-Seq reads, formatted as a BLAST database (see Database building section)
  • a query file in fasta format, with the homologous sequence(s) of your gene(s) of interest

A basic run consists in (see Iterative process section for more detail):

  1. aligning the query sequence(s) on the database

  2. keeping the reads that align with the query sequence(s)

  3. assembling those reads into contigs

  4. using the contigs as the new reference for the next iteration (perform steps 1 to 3 with this new reference: obtention of longer contigs)

  5. comparing contigs from iteration i and iteration i-1

  6. deciding if the assembly is better and if it's worth starting a new iteration (see the Criteria to stop the iterative process section)

Database building

The RNA-seq data (a fastq or fasta file), given by the -fq or -fa option, is formatted into a BLAST database, whose name is given by the -d option.

The data type, paired or single end, must be given by the -dt option. If data is paired-end, all reads (1 and 2) must be concatenated in a single file. WARNING: Paired read names must end with 1 or 2. Reads contained in this file will all be used, so the file must have already been cleaned up.

Note that if your data is in fastq format, it will be converted into a fasta file. This conversion can take some minutes.

As the database building step is time consuming, if a BLAST database is already present, the database building will be skipped.

Iterative process

Sequences in the query file (-q option) will serve as the first reference sequences. The possibility to use a multi-reference query file is still in testing.

A classical iteration

  1. Reads recruting (BLAST):

    Bait sequences are used to recrute homologous reads by BLAST. The -e option allows fixing the evalue threshold (default: 0.001). If the data is paired-end, all paired reads are added to the read list.

    Names (.txt format) and sequences (.fasta format) of all these reads are accessible in temporary files if the -tmp option is used.

  2. Reads assembly (Trinity):

    All reads found by BLAST (present in the temporary file) are assembled de novo by Trinity with default parameters. Note that If the data contains paired-end reads, Trinity takes reads as paired reads.

  3. Quality and filtering of reconstructed contigs (Exonerate):

    Exonerate is used to align each assembled contig to the reference. The length of the contig, the length of the alignment with the reference, the percentage of identity and the alignment score are collected. Reconstructed sequences are filtered according to the options -id, -mal and -len. Contigs that pass all filters are used as reference for the next iteration.

  4. Comparison with previous iteration (Exonerate):

    Exonerate is used to compare the reconstructed sequences of iteration n+1 with iteration n. If the quality of the assembled seqeunce is not improved, the process is stopped.

  5. Coverage calculation (Mafft):

    So as to estimate the overall quality of the contig assembly, apytram calculates 2 coverage values:

    • a Strictcoverage that represents the percentage of the query sites with a homologous site in the reconstructed sequence(s)
    • a Largecoverage that represents the percentage of sites in the alignment with at least one representative in the reconstructed sequence(s) divided by the length of the reference so by definition it can be superior to 100. Note that if there is not only one reference sequence, the first sequence in the reference file is took as reference for the coverage counter but all references are aligned.

    Figure to explain.

At the end of an iteration, reconstructed sequence(s) become the reference sequences for the next iteration.

Criteria to stop the iterative process

If one of these criteria is completed during an iteration, the iterative process will stop. The following criteria are implemented as default settings :

  • The number of max iteration (-i option) is reached.
  • Recruted reads are the same as in the previous iteration.
  • Reconstructed sequences at the end of an iteration are almost the same as after the previous iteration. Almost means each sequence of an iteration has a corresponding sequence in the previous iteration with at least 99% identity and 98% of its length. These threshold are by default and can not be changed. This criteria allows stopping iterations when Trinity assemblies differ of few bases.
  • The number of reconstructed sequences has not changed AND the total length, score and LargeCoverage of all reconstructed sequences have not been improved. The use of the Largecoverage value in this step allows to keep iterating if the UTR in 5' and 3' are getting longer, even if the coding sequence does not.

If the --required_coverage option is used, the iterating process will stop if the Strictcoverage is superior to the Required_coverage.

N.B.:None of these criteria is applied if the --finish_all_iter option is used.

The last criteria which can stop the iterative process is a time limit given by the -time_max option (by default 7200 seconds, but this can be changed). If this time limit is reached, no new iteration will begin even with the --finish_all_iteration option. Please note that this means that a job can thus spend more than the -time_max limit if the database building and the last iteration last more than -time_max setting.

Final Filter

A final filter (-fmal, -fid and -flen options) can be applied on the reconstructed sequences to be more stringent than the threshold used during the iterative process. See the Basic usage page for more information.

Writing outputs file

Finally, several output files are written in your working directory:

  • $OUTPUT_PREFIX.fasta:

    A fasta file containing all reconstructed sequences of the last iteration that pass the final filter.

  • $OUTPUT_PREFIX.best.fasta:

    A fasta file containing the best reconstructed sequences of the last iteration for each references of the query file. Thes sequences must have pass the final filter. The best sequence is determined by the best homology score calculated by Exonerate.

  • $OUTPUT_PREFIX.stats.pdf:

Available if the ```--plot``` option is set. This produces a *.pdf* file containing 2 figures containing global information at each iteration.
  • $OUTPUT_PREFIX.stats.csv:
Available with the ```--plot``` or ```--stats``` options. This produces a *.csv* file containing the raw data used to draw the figures present in the *.stats.pdf* file
  • $OUTPUT_PREFIX.ali.png:

    A figure representing the alignment of all reconstructed sequences with the query (only contigs that pass the final filter are shown). White represents a gap, blue a base of the reference, green an identical base of the reference in a reconstructed sequence, red a different base compared with the reference and yellow a base corresponding to a gap in the reference.

  • $OUTPUT_PREFIX.ali.fasta:

    Alignment that is used to generate the $OUTPUT_PREFIX.ali.png file.