Skip to content

Creating the correct database format for Transposome

sestaton edited this page Nov 28, 2014 · 13 revisions

The sequence identifiers in the custom database should be in RepBase format. Here is an example:

>GYPSY68-LTR_AG	Gypsy	Anopheles gambiae

That is, ">" followed by the repeat name, then a tab is followed by the superfamily name, and last, another tab is followed by the source "genus species" separated by a space.

The repeat names should follow the convention for naming TEs. For example, here is an example of a Copia element from sunflower.

>RLC-amov-1_Contig40_HLAB-P347K0_25829_34256	Copia	Helianthus annuus

In the repeat name above, you can see the superfamily (RLC), followed by the family (amov), the specific element in this family (1), and the source location (Contig40_HLAB-P347K0_25829_34256). There is some variation on this format where you might also see something like:

>RLC_amov_1_Contig40_HLAB-P347K0_25829_34256	Copia	Helianthus annuus

That is, underscores instead of hyphens in the repeat name. This is not important, either is fine.

If, for example, you have a library of custom repeats for your species that you would like to use for annotation, you may format this data using the script.

Formatting the ID to remove certain characters (for example, "*" and "-") is important because RepBase identifiers by default will cause problems with BLAST+ (this changes are automatically handled by the script).