Skip to content

Creating the correct database format for Transposome

Evan Staton edited this page Aug 14, 2013 · 13 revisions

The sequence identifiers in the custom database should be in RepBase format. Here is an example:

>GYPSY68-LTR_AG	Gypsy	Anopheles gambiae
tgtggtatgtgagagtagcagtgtgggtgtgcgggcagtaggaggttggacggaataaagagcagacgtg
tgttactgtagttccggtgttttcgatggaatatcaca

That is, ">" followed by the repeat name, then a tab is followed by the superfamily name, and last, another tab is followed by the source "genus species" separated by a space.

The repeat names should follow the convention for naming TEs. For example, here is an example of a Copia element from sunflower.

>RLC-amov-1_Contig40_HLAB-P347K0_25829_34256	Copia	Helianthus annuus
cccttcgatggaagtctgatctttcgatcagtattcctgatccttcgacaggttcaacatcgatagatgat
...

In the repeat name above, you can see the superfamily (RLC), followed by the family (amov), the specific element in this family (1), and the source location (Contig40_HLAB-P347K0_25829_34256). There is some variation on this format where you might also see something like:

>RLC_amov_1_Contig40_HLAB-P347K0_25829_34256	Copia	Helianthus annuus
cccttcgatggaagtctgatctttcgatcagtattcctgatccttcgacaggttcaacatcgatagatgat
...

That is, underscores instead of hyphens in the repeat name. This is not important, either is fine.

If, for example, you have a library of custom LTR retrotransposons for your species that you would like to use for annotation, you may format this data using the following script:

#!/usr/bin/env perl

use 5.012;
use strict;
use warnings;
use Transposome::SeqIO;

my $usage = "\n$0 infile > outfile\n";
my $infile = shift or die $usage;

my $seqio = Transposome::SeqIO->new( file => $infile );

my $seqfh = $seqio->get_fh;
while (my $seq = $seqio_fa->next_seq($seqfh)) {
    if ($seq->get_id =~ /^RLG/) {
        say join "\n", ">".$seq->get_id."\t"."Gypsy"."\t"."Helianthus annuus", $seq->get_seq;
    }
    elsif ($seq->get_id =~ /^RLC/) {
        say join "\n", ">".$seq->get_id."\t"."Copia"."\t"."Helianthus annuus", $seq->get_seq;
    }
    else {
        # should never get here, but the data may be malformed
        say "\n[ERROR]: ",$seq->get_id," does not seem to match RLG or RLC";
    }
}

This could now be added to the RepBase database as follows:

cat your_repeats.fasta RepBase.fasta | sed 's/-/_/g;s/*/_/g' > custom_repeat_library.fasta

and given to Transposome as your repeat database. The sed command is important, as RepBase identifiers by default will cause problems with BLAST+.