Spliceformer: Transformer-Based Splice Site Prediction

Spliceformer is a machine learning tool that leverages transformers to detect splice sites in pre-mRNA sequences and estimate the impact of genetic variants on splicing by analyzing long nucleotide sequences (up to 45,000 context length).

This repository provides:

Jupyter Notebooks: Includes code for dataset construction, model training, and evaluation.
Pre-trained Weights: Ready-to-use PyTorch models for splice site prediction and estimating variant effects on splicing.

For more details, please refer to the paper.

Pre-trained Model Overview

The pre-trained weights for Spliceformer-45k and SpliceAI-10k provided in this repository were trained on:

ENSEMBL v87 Annotations: Protein-coding transcripts with support level 1.
RNA-Seq Data:
- Icelandic Cohort: 17,848 whole blood RNA-Seq samples sequenced by deCODE Genetics.
- GTEx V8: 15,201 RNA-Seq samples across 49 human tissues.

Setup

# Step 1: Clone the repository
git clone https://github.com/benniatli/Spliceformer.git
cd Spliceformer

# Step 2: Install required dependencies
# Install dependencies listed in requirements.txt
pip install -r requirements.txt

Colab

Further examples for making predictions with pre-trained weights and predicting the effect of genetic variants are shown in the following colab notebook:

`spliceformer-usage.ipynb` .

Jupyter notebooks

To reproduce the results in the Jupyter notebooks it is neccesary to run them roughly in the following order:

Construct dataset
- construct_Ensemble_datasets.ipynb
Model training and evaluation
- train_transformer_model.ipynb
- train_spliceai_10k_model.ipynb
- train_spliceai_model_gencode_keras.ipynb
- test_spliceai_model_pretrained.ipynb
Fine-tuning on RNA-Seq annotations (only necessary for some of the analysis notebooks)
- train_transformer_model-fine_tuning-GTEX-ICE-Combined-all.ipynb
Analysis
- get_splice_site_prediction_scores.ipynb
- spliceAI_vs_transformer_score_comparison.ipynb
- get_attention_plots.ipynb
- get_umap_plots.ipynb
- plot_transformer_training_loss.ipynb
- novel_splice_transformer.ipynb
- get_sqtl_delta_for_transformer.ipynb

To run step 4 it is only necessary to run train_transformer_model.ipynb and train_spliceai_10k_model.ipynb from step 2, but some of the notebooks require models fine-tuned on RNA-Seq data.

It is also possible to skip step 2 and 3 and use the pre-trained models instead (/Results/PyTorch_Models).

Citing

If you find this work useful in your research, please consider citing:

@article{jonsson2024transformers,
  title={Transformers significantly improve splice site prediction},
  author={J{\'o}nsson, Benedikt A and Halld{\'o}rsson, G{\'\i}sli H and {\'A}rdal, Stein{\th}{\'o}r and R{\"o}gnvaldsson, S{\"o}lvi and Einarsson, Ey{\th}{\'o}r and Sulem, Patrick and Gu{\dh}bjartsson, Dan{\'\i}el F and Melsted, P{\'a}ll and Stef{\'a}nsson, K{\'a}ri and {\'U}lfarsson, Magn{\'u}s {\"O}},
  journal={Communications Biology},
  volume={7},
  number={1},
  pages={1--9},
  year={2024},
  publisher={Nature Publishing Group}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Code		Code
Results/PyTorch_Models		Results/PyTorch_Models
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spliceformer: Transformer-Based Splice Site Prediction

Pre-trained Model Overview

Setup

Colab

`spliceformer-usage.ipynb` .

Jupyter notebooks

Citing

About

Releases 1

Packages

Languages

License

benniatli/Spliceformer

Folders and files

Latest commit

History

Repository files navigation

Spliceformer: Transformer-Based Splice Site Prediction

Pre-trained Model Overview

Setup

Colab

spliceformer-usage.ipynb .

Jupyter notebooks

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`spliceformer-usage.ipynb` .

Packages