Skip to content

Protein engineering with large language models

License

Notifications You must be signed in to change notification settings

soldatmat/PELLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein Engineering with Large Language Models

This repository contains source code developed as part of my diploma thesis available at https://dspace.cvut.cz/handle/10467/115759.

  • The three proposed MLDE methods are implemented in code/de.

  • The two tradintional DE benchmarks are implemented in code/single_mutation_walk and code/recombine_mutation.

  • Folder code/llm contains code from initial exploratory experiments, which is not essential to any of the methods.

  • Folder code/plot and code/dimred contain code used to generate included graphics.

Installation

To start using the proposed MLDE methods implemented in code/de, please follow these steps:

  1. Install Julia.
  2. Install Conda.
  3. In code/de/setup_pycall.jl set CONDA_PATH to path to the Conda root folder.
  4. In bash run:
cd code/de
source setup.sh

Acknowledgements

All of the proposed MLDE methods require a pre-trained protein language model as an embedding extractor.
If you use the ESM-1b model, cite the original paper.

  • ESM-1b: Alexander Rives et al. “Biological structure and function emerge from scaling unsu- pervised learning to 250 million protein sequences”. In: Proceedings of the National Academy of Sciences 118.15 (2021), e2016239118.

The used datasets are included in data. If you use them, don't forget to cite the original papers.
Correct citations are included with each dataset in a CITE_AS.txt file with corresponding BibTeX template.

  • GB1: Nicholas C Wu et al. “Adaptation in protein fitness landscapes is facilitated by indirect paths”. In: Elife 5 (2016), e16965.

  • PhoQ: Anna I Podgornaia and Michael T Laub. “Pervasive degeneracy and epistasis in a protein-protein interface”. In: Science 347.6222 (2015), pp. 673–677.