--> Migration to the newer version of OpenNMT is in progress. See the "reborn" branch.
This is the code for the paper Reagent Prediction with a Molecular Transformer Improves Reaction Data Quality.
The repository is effectively a fork of the Molecular Transformer.
Idea:
- Train a transformer to predict reagents for organic reactions in the way of SMILES-to-SMILES translation.
- Infer missing reagents for some reactions in the training set.
- Train a transformer for reaction product prediction on the dataset with improved reagents.
onmt
is a directory with the OpenNMT code.
preprocess.py
, train.py
, translate.py
, score_predictions.py
are files used in the Molecular Transformer code.
src
folder contains preprocessing for reagent prediction.
prepare_data.py
is the main script that preprocesses USPTO for reagents prediction with MT.
environment.yml
is the conda environment specification.
The entire USPTO data used to assemble the training set for reagent prediction can be downloaded here.
The training set for reagents prediction was obtained from it using the prepare_data.py
script. It does not overlap with the USPTO MIT test set. The tokenized data can be downloaded here.
The tokenized data for product prediction is stored here. For the description of these data, please refer to the README of the original Molecular Transformer.
The data for product prediction with altered reagents can be downloaded here.
-
Create a conda environment from the specification file:
conda env create -f environment.yml conda activate reagents_pred
Install pytorch separately:
wget https://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-linux_x86_64.whl pip install torch-0.4.1-cp36-cp36m-linux_x86_64.whl rm torch-0.4.1-cp36-cp36m-linux_x86_64.whl pip install torchtext==0.3.1
-
Download the datasets and put them in the
data/tokenized
directory. -
Train a reagents prediction model: First, preprocess the data for an OpenNMT model:
python3 preprocess.py -train_src data/tokenized/${DATASET_NAME}/src-train.txt \ -train_tgt data/tokenized/${DATASET_NAME}/tgt-train.txt \ -valid_src data/tokenized/${DATASET_NAME}/src-val.txt \ -valid_tgt data/tokenized/${DATASET_NAME}/tgt-val.txt \ -save_data data/tokenized/${DATASET_NAME}/${DATASET_NAME} \ -src_seq_length 1000 -tgt_seq_length 1000 \ -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab
Then, run a model:
python3 train.py -data data/tokenized/${DATASET_NAME}/${DATASET_NAME} \ -save_model experiments/checkpoints/${DATASET_NAME}/${DATASET_NAME}_model \ -seed 42 -gpu_ranks 0 -save_checkpoint_steps 10000 -keep_checkpoint 20 \ -train_steps 500000 -param_init 0 -param_init_glorot -max_generator_batches 32 \ -batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0 -accum_count 4 \ -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 \ -learning_rate 2 -label_smoothing 0.0 -report_every 10 \ -layers 4 -rnn_size 256 -word_vec_size 256 -encoder_type transformer -decoder_type transformer \ -dropout 0.1 -position_encoding -share_embeddings \ -global_attention general -global_attention_function softmax -self_attn_type scaled-dot \ -heads 8 -transformer_ff 2048 -tensorboard
-
Train a basline product prediction model:
Train a Molecular Transformer on, say,MIT_separated
data. For this, run thepreprocess.py
script and thetrain.py
script
as shown above but withDATASET_NAME=MIT_separated
. -
Use a trained reagent model to improve reagents in a dataset for product prediction.
The scriptreagent_substitution.py
uses a reagents prediction model to change reagents in data which is the input
to a product prediction model. To change the reagents in, say,MIT_separated
, run the script as follows:python3 reagent_substitution.py --data_dir data/tokenized/MIT_separated \ --reagent_model <MODEL_NAME> \ --reagent_model_vocab <MODEL_SRC_VOCAB> \ --beam_size 5 --gpu 0
MODEL_NAME
may be stored, inexperiments/checkpoints/
.MODEL_SRC_VOCAB
(a .json file )may be stored indata/vocabs/
.
Or download the final data here. -
Train product prediction models on datasets cleaned by a reagents prediction model like in step 5.
The trained reagent and product models in the forms of .pt
files are stored here.
To make predictions for reactions supplied in a .txt file as SMILES, use the following script:
```bash
python3 translate.py -model <PATH TO THE MODEL WEIGHTS> \
-src <PATH TO THE TEST REACTIONS WITHOUT REAGENTS> \
-output <PATH TO THE .TXT FILE WHERE THE PREDICTIONS WILL BE STORED> \
-batch_size 64 -replace_unk -max_length 200 -fast -beam_size 5 -n_best 5 -gpu <GPU ID>
```
The supplied reactions should be tokenized with tokens separated by spaces, like in files produced by prepare_data.py
.
Paper:
@article{andronov_voinarovska_andronova_wand_clevert_schmidhuber_2022,
author ="Andronov, Mikhail and Voinarovska, Varvara and Andronova, Natalia and Wand, Michael and Clevert, Djork-Arné and Schmidhuber, Jürgen",
title ="Reagent prediction with a molecular transformer improves reaction data quality",
journal ="Chem. Sci.",
year ="2023",
volume ="14",
issue ="12",
pages ="3235-3246",
publisher ="The Royal Society of Chemistry",
doi ="10.1039/D2SC06798F",
url ="http://dx.doi.org/10.1039/D2SC06798F"
The underlying framework:
@inproceedings{opennmt,
author = {Guillaume Klein and
Yoon Kim and
Yuntian Deng and
Jean Senellart and
Alexander M. Rush},
title = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
booktitle = {Proc. ACL},
year = {2017},
url = {https://doi.org/10.18653/v1/P17-4012},
doi = {10.18653/v1/P17-4012}
}