Skip to content

Creating the correct database format for Transposome

sestaton edited this page Nov 28, 2014 · 13 revisions

The sequence identifiers in the custom database should be in RepBase format. Here is an example:

>GYPSY68-LTR_AG	Gypsy	Anopheles gambiae
tgtggtatgtgagagtagcagtgtgggtgtgcgggcagtaggaggttggacggaataaagagcagacgtg
tgttactgtagttccggtgttttcgatggaatatcaca

That is, ">" followed by the repeat name, then a tab is followed by the superfamily name, and last, another tab is followed by the source "genus species" separated by a space.

The repeat names should follow the convention for naming TEs. For example, here is an example of a Copia element from sunflower.

>RLC-amov-1_Contig40_HLAB-P347K0_25829_34256	Copia	Helianthus annuus
cccttcgatggaagtctgatctttcgatcagtattcctgatccttcgacaggttcaacatcgatagatgat
...

In the repeat name above, you can see the superfamily (RLC), followed by the family (amov), the specific element in this family (1), and the source location (Contig40_HLAB-P347K0_25829_34256). There is some variation on this format where you might also see something like:

>RLC_amov_1_Contig40_HLAB-P347K0_25829_34256	Copia	Helianthus annuus
cccttcgatggaagtctgatctttcgatcagtattcctgatccttcgacaggttcaacatcgatagatgat
...

That is, underscores instead of hyphens in the repeat name. This is not important, either is fine.

If, for example, you have a library of custom repeats for your species that you would like to use for annotation, you may format this data using the format_database.pl script.

Formatting the ID to remove certain characters (for example, "*" and "-") is important because RepBase identifiers by default will cause problems with BLAST+ (this changes are automatically handled by the format_database.pl script).