Skip to content

Evolutionary algorithm for motif discovery in nucleotide sequences

Notifications You must be signed in to change notification settings

karatedava/EVOSeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EVOSeq

Overview

EVOSeq is an evolutionary algorithm designed to search for common nucleotide motifs in given FASTA sequences. It provides a flexible, modular approach to motif discovery, allowing users to customize key parameters such as population size, mutation probability, and reproduction operators.

What is an Evolutionary Algorithm?

An evolutionary algorithm is a type of optimization algorithm inspired by the principles of natural selection. It iteratively improves a population of candidate solutions (motifs) through selection, reporoduction and mutation, favoring solutions that better match the given dataset.

Features

  • Easy to read, flexible, and relatively fast
  • User-customizable parameters, including population size, individual motif length, mutation rate, and reproduction operators
  • Modular implementation allows easy modifications and extensions
  • Supports optional comparative analysis, distinguishing motifs between different organisms

Input & Output

Input

  1. FASTA file of sequences (positive set) – the algorithm searches for common motifs in these sequences.
  2. (Optional) FASTA file for comparison (negative set) – used to check how discovered motifs differentiate between datasets.
  3. (Optional) FASTA file for initial population generation – if not provided, motifs are generated randomly.

Output

  • SVG Report – visualizes the distribution of discovered motifs across positive and (optionally) negative sets.
  • CSV File - dicovered motifs with their scores

Parameter Description

Key Algorithm Parameters

  • population_size – Number of motifs per generation. Smaller values (<50) speed up execution but may lose diversity, while larger values (≥100) improve diversity but increase computation time.
  • len_range – Tuple (min_length, max_length) defining the motif size. Smaller values may overfit data, while larger values may struggle to generalize. Recommended starting value: (5,8).
  • match_reward – Score assigned when a motif appears in the dataset (default: 10). Higher values increase the likelihood of motif survival in future generations.
  • len_reward – Rewards longer motifs to counterbalance the natural prevalence of shorter motifs. Options: quadratic (default), linear, linear_additive, or custom implementations.
  • mutation_prob – Probability of single-point mutation per generation (0-1, default: 0.1). Prevents premature convergence by introducing random variation.
  • crossover_point – Integer value defining the split point for crossover reproduction. Must be smaller than the minimum motif length.
  • tournament_size – Number of motifs participating in selection tournaments. Smaller values allow weaker motifs to survive; larger values accelerate convergence but may reduce diversity.
  • evaluation_size – Number of randomly selected sequences used for motif evaluation. Larger values improve accuracy but slow down computation. Default: 100.

Installation

Prerequisites

  • Python 3.12.2
  • Required Python libraries (install using pip):
    pip install -r requirements.txt
  • Clone the repository:
    git clone https://github.com/yourusername/EVOSeq.git
    cd EVOSeq

Execution

  1. Modify parameters and provide input files in main.py.
  2. Run the algorithm:
    python3 main.py

About

Evolutionary algorithm for motif discovery in nucleotide sequences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages