TSEBRA-Experiments

This repository contains all data and scripts used for the experiments of the ToDo: [TSEBRA Paper].

Both experiments require that TSEBRA has been downloaded and that its bin folder is added to $PATH, e.g.:

git clone https://github.com/LarsGab/TSEBRA
export PATH="$(pwd)/TSEBRA/bin:$PATH"

Experiment 1 (Comparison to BRAKER1 and BRAKER2)

The first experiment was carried out with all species listed in species.tab. It demonstrates the increase of accuarcies of TSEBRA compared to BRAKER1 and BRAKER2.

Before you start

Choose a species and replace "Enter species" with a species name from species.tab, e.g. "Bombus_terrestris".

species="Enter_species"
species_dir=$(pwd)"/$species"

You have to choose a level of exclusion for the protein database of the BRAKER2 run. We used the levels 'species_excluded', 'family_excluded', 'order_excluded' for the species listed in model_species.tab. For all other species we only used 'order_excluded'.

### Choose a test level and remove the '#'
### Only choose species_excluded or family_excluded if your species is listed in model_species.tab
#level="species_excluded"
#level="family_excluded"
#level="order_excluded"

braker1_dir="${species_dir}/braker1/"
braker2_dir="${species_dir}/braker2/${level}"

BRAKER1¹ and BRAKER2²

You can use the BRAKER1 and BRAKER2 results prepared at ToDo: [Link to Webserver]:

wget link/to/webserver/$species.tar.gz
tar -xzvf $species.tar.gz

Alternatively, you can create your own BRAKER runs, if you want to do the experiment from scratch.

Then you have to

install BRAKER v2.1.5,
prepare genome and annotation as described in EukSpecies-BRAKER2,
prepare RNA-seq hints and protein data as described in BRAKER2-exp into $species_dir.

You have to run BRAKER1 only once per species!

Run BRAKER1:

braker.pl --genome=${species_dir}/data/genome.fasta.masked --hints=${species_dir}/varus/varus.gff --softmasking --species=${species}_braker1 --workingdir=$braker1_dir

Run BRAKER2:

mkdir -p $braker2_dir
braker.pl --genome=${species_dir}/data/genome.fasta.masked --prot_seq=${species_dir}/data/${level}.fa --softmasking --species=${species}_${level} --epmode --prg=ph --nocleanup --AUGUSTUS_ab_initio --workingdir=$braker2_dir

TSEBRA

Ensure that the transcript and gene IDs of the BRAKER prediction files are in order:

fix_gtf_ids.py --gtf ${braker1_dir}braker.gtf --out ${braker1_dir}/braker_fixed.gtf
fix_gtf_ids.py --gtf ${braker2_dir}/braker.gtf --out ${braker2_dir}/braker_fixed.gtf

Run TSEBRA:

tsebra_default=${species_dir}/tsebra_default/${level}/
mkdir -p $tsebra_default
tsebra.py -g ${braker1_dir}/braker_fixed.gtf,${braker2_dir}/braker_fixed.gtf -c ${species_dir}/../config/default.cfg -e ${braker1_dir}/hintsfile.gff,${braker2_dir}/hintsfile.gff -o ${tsebra_default}/tsebra_default.gtf

Evaluation

Compute F1-score, Sensitivity, Specificity:

eval_exp1.py --species_dir $species_dir --test_level $level

The TSEBRA and evaluation results are located at $species_dir/tsebra_default/$level/.

Experiment 2 (Comparison to EVidenceModeler³)

The second experiment was carried out with all model species listed in model_species.tab and the following manual only works for these species. In this experiment we compared TSEBRA with EVidenceModeler (EVM) [cite].

Before you start

For this experiment you need to

perform Experiment 1,
install EVidenceModeler.

If you haven't done it for the first Experiment:

prepare genome and annotation as described in EukSpecies-BRAKER2
install TSEBRA

Choose a species and replace "Enter species" with a species name from model_species.tab, e.g. "Drosophila_melanogaster".

species="Enter_species"
species_dir=$(pwd)"/$species"

Choose a level of exclusion for the protein database of the BRAKER2 run and remove the corresponding '#'.

### Choose a test level and remove the '#':
#level="species_excluded"
#level="family_excluded"
#level="order_excluded"

Download the prepared files from ToDo[Enter link to webserver], if you didn't download them for the first experiment.

wget link/to/webserver/$species.tar.gz
tar -xzvf $species.tar.gz ${species}/EVM/ -C ${species_dir}

PASA⁴

You can use the PASA results we have prepared and copy them to your $species_dir. They are already in your $species_dir and you do not need to run following commands, if you used the prepared files in the experiment 1.

Otherwise:

tar -xzvf $species.tar.gz $species/pasa/ -C $species_dir

Alternatively, you can create your own PASA run. Then, you have to make your own VARUS run from scratch as decribed in BRAKER2-exp into $species_dir/varus/ and you have to install:

Add them to your $PATH variable.

Extract RNA-read sequences from VARUS.bam

samtools fasta ${species_dir}/varus/VARUS.bam > ${species_dir}/varus/VARUS.fasta

Run Trinity⁵ (you can change the max_memory and CPU to suit your system):

Trinity --seqType fa --max_memory 4G --CPU 4 --single ${species_dir}/varus/VARUS.fasta --run_as_paired --output ${species_dir}/trinity/

Run PASA:

mkdir ${species_dir}/pasa/
cp ${species_dir}/../config/alignAssembly.config ${species_dir}/pasa/
sed -i "s,pathtoyourpasadir,$species_dir/pasa,g" ${species_dir}/pasa/alignAssembly.config
Launch_PASA_pipeline.pl -c ${species_dir}/pasa/alignAssembly.config -C -R -g ${species_dir}/genome/genome.fasta.masked --ALIGNERS blat,gmap -t ${species_dir}/trinity/Trinity.fasta

Partition

Enter here the absolute path to the directory where EVidenceModeler is installed

evm_path="ENTER EVM PATH"

Partiton and prepare all data for EVM, TSEBRA and their evaluation:

partition.py --species_dir $species_dir --test_level $level --evm_path $evm_path --out ${species_dir}/EVM/${level}/

If you want to reconstruct the results from ToDo: [PAPER] then use the provided partition test set:

sed -i "s,pathtoyoutpartitions,$species_dir/EVM/$level/partitions/,g" ${species_dir}/EVM/${level}/partitions/part_test.lst
sed -i "s,pathtoyoutpartitions,$species_dir/EVM/$level/partitions/,g" ${species_dir}/EVM/${level}/partitions/part_train.lst

Or you can sample 90% of the partions as the test set:

sample_partitions.py --partition_dir ${species_dir}/EVM/${level}/partitions/ --seed ${species_dir}/EVM/${level}/seed_value.out

EVM

Run EVM for all test partitions. (adjust the number of threads to fit your system)

runEVM.py --species_dir $species_dir --test_level $level --evm_path $evm_path --threads 4

TSEBRA

Run TSEBRA for all test partitions. (adjust the number of threads to fit your system)

runTSEBRA.py --species_dir $species_dir --test_level $level --threads 4

Evaluation

Evaluate the test partitions. (adjust the number of threads to fit your system)

eval_exp2.py --species_dir $species_dir --test_level $level --threads 4

The results are located at $species_dir/EVM/$level/evaluation/.

Summary of all Results

Enter here the path to the directory containing the folders for the species for which you have performed the experiments.

parent_dir="Enter path to parent dir"

Create table with all available results.

eval_summary.py --parent_dir $parent_dir

Each row contains the result for a species and test level. A row contains the result for the second experiment if results for both experiments are present. You can find the table in $parent_dir/evaluation/.

You can plot the average Sensitivity and Specificity for all species in $parent_dir for gene, transcript and CDS level from the second experiment. This requires that matplotlib is installed, e.g. with :

pip install matplotlib

Create plot:

eval_summary.py --parent_dir $parent_dir

You can find the plot at $parent_dir/evaluation/plot_exp2.png.

Licence

All source code, i.e. bin/*.py and bin/*.pl are under the Artistic License (see https://opensource.org/licenses/Artistic-2.0).

References

[1] Hoff, Katharina J, Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2015. “BRAKER1: Unsupervised Rna-Seq-Based Genome Annotation with Genemark-et and Augustus.” Bioinformatics 32 (5). Oxford University Press: 767--69.↑

[2] Tomas Bruna, Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke and Mark Borodvsky. 2021. “BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics 3(1):lqaa108.↑

[3] Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., ... & Wortman, J. R. 2008. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology, 9(1), 1-22.↑

[4] Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. 2003 Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res, 31, 5654-5666.↑

[5] Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., ... & Regev, A. 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), 644.↑

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TSEBRA-Experiments

Experiment 1 (Comparison to BRAKER1 and BRAKER2)

Before you start

BRAKER1¹ and BRAKER2²

TSEBRA

Evaluation

Experiment 2 (Comparison to EVidenceModeler³)

Before you start

PASA⁴

Partition

EVM

TSEBRA

Evaluation

Summary of all Results

Licence

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
bin		bin
config		config
README.md		README.md
model_species.tab		model_species.tab
species.tab		species.tab

LarsGab/TSEBRA-experiments

Folders and files

Latest commit

History

Repository files navigation

TSEBRA-Experiments

Experiment 1 (Comparison to BRAKER1 and BRAKER2)

Before you start

BRAKER11 and BRAKER22

TSEBRA

Evaluation

Experiment 2 (Comparison to EVidenceModeler3)

Before you start

PASA4

Partition

EVM

TSEBRA

Evaluation

Summary of all Results

Licence

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

BRAKER1¹ and BRAKER2²

Experiment 2 (Comparison to EVidenceModeler³)

PASA⁴

Packages