EVOSeq is an evolutionary algorithm designed to search for common nucleotide motifs in given FASTA sequences. It provides a flexible, modular approach to motif discovery, allowing users to customize key parameters such as population size, mutation probability, and reproduction operators.
An evolutionary algorithm is a type of optimization algorithm inspired by the principles of natural selection. It iteratively improves a population of candidate solutions (motifs) through selection, reporoduction and mutation, favoring solutions that better match the given dataset.
- Easy to read, flexible, and relatively fast
- User-customizable parameters, including population size, individual motif length, mutation rate, and reproduction operators
- Modular implementation allows easy modifications and extensions
- Supports optional comparative analysis, distinguishing motifs between different organisms
- FASTA file of sequences (positive set) – the algorithm searches for common motifs in these sequences.
- (Optional) FASTA file for comparison (negative set) – used to check how discovered motifs differentiate between datasets.
- (Optional) FASTA file for initial population generation – if not provided, motifs are generated randomly.
- SVG Report – visualizes the distribution of discovered motifs across positive and (optionally) negative sets.
- CSV File - dicovered motifs with their scores
population_size
– Number of motifs per generation. Smaller values (<50) speed up execution but may lose diversity, while larger values (≥100) improve diversity but increase computation time.len_range
– Tuple(min_length, max_length)
defining the motif size. Smaller values may overfit data, while larger values may struggle to generalize. Recommended starting value:(5,8)
.match_reward
– Score assigned when a motif appears in the dataset (default:10
). Higher values increase the likelihood of motif survival in future generations.len_reward
– Rewards longer motifs to counterbalance the natural prevalence of shorter motifs. Options:quadratic
(default),linear
,linear_additive
, or custom implementations.mutation_prob
– Probability of single-point mutation per generation (0-1
, default:0.1
). Prevents premature convergence by introducing random variation.crossover_point
– Integer value defining the split point for crossover reproduction. Must be smaller than the minimum motif length.tournament_size
– Number of motifs participating in selection tournaments. Smaller values allow weaker motifs to survive; larger values accelerate convergence but may reduce diversity.evaluation_size
– Number of randomly selected sequences used for motif evaluation. Larger values improve accuracy but slow down computation. Default:100
.
- Python 3.12.2
- Required Python libraries (install using
pip
):pip install -r requirements.txt
- Clone the repository:
git clone https://github.com/yourusername/EVOSeq.git cd EVOSeq
- Modify parameters and provide input files in
main.py
. - Run the algorithm:
python3 main.py