-
Notifications
You must be signed in to change notification settings - Fork 6
Creating the correct database format for Transposome
The sequence identifiers in the custom database should be in RepBase format. Here is an example:
>GYPSY68-LTR_AG Gypsy Anopheles gambiae
tgtggtatgtgagagtagcagtgtgggtgtgcgggcagtaggaggttggacggaataaagagcagacgtg
tgttactgtagttccggtgttttcgatggaatatcaca
That is, ">" followed by the repeat name, then a tab is followed by the superfamily name, and last, another tab is followed by the source "genus species" separated by a space.
The repeat names should follow the convention for naming TEs. For example, here is an example of a Copia element from sunflower.
>RLC-amov-1_Contig40_HLAB-P347K0_25829_34256 Copia Helianthus annuus
cccttcgatggaagtctgatctttcgatcagtattcctgatccttcgacaggttcaacatcgatagatgat
...
In the repeat name above, you can see the superfamily (RLC), followed by the family (amov), the specific element in this family (1), and the source location (Contig40_HLAB-P347K0_25829_34256). There is some variation on this format where you might also see something like:
>RLC_amov_1_Contig40_HLAB-P347K0_25829_34256 Copia Helianthus annuus
cccttcgatggaagtctgatctttcgatcagtattcctgatccttcgacaggttcaacatcgatagatgat
...
That is, underscores instead of hyphens in the repeat name. This is not important, either is fine.
If, for example, you have a library of custom repeats for your species that you would like to use for annotation, you may format this data using the format_database.pl script.
Formatting the ID to remove certain characters (for example, "*" and "-") is important because RepBase identifiers by default will cause problems with BLAST+ (this changes are automatically handled by the format_database.pl script).