Code to accompany our paper: Challenging reaction prediction models to generalize to novel chemistry (see also the repository here for splitting code). This repo contains a sequence-to-sequence [1,2,3] model for reaction prediction using Hugging Face [4]. Note this repo makes use of Git LFS for storing large files.
We recommend installing the required Python packages using the environment files in the env
directory.
Note that we provide an environment for macOS, but this is only to aid with debugging/editing the code and is not suitable
for training/evaluating models.
- Install prerequisites if not already on your system: Git LFS and Conda/Mambda or their lighterweight equivalents, Miniconda/Micromamba.
- Clone the repository:
git clone git@github.com:john-bradshaw/rxn-lm.git
- Installation via using the provided yaml file, e.g. for Conda inside the cloned directory:
conda env create -f envs/conda-linux-env.yml
- Activate environment:
conda activate rxn-lm
. - Set environment variables:
source setup.sh
(this needs to run from this directory; you can set the environment variables in this file as you see fit). - Create a wandb (W&B) key file at
wandb_key.txt
in repo root and add your wandb key.
To run our unit tests:
pytest testing/
The data preparation scripts are in scripts/data_prep
. Usually the process from going from a dataset in SMILES line format
(i.e., a file where each line is a SMILES string representing a reaction) to a dataset in JSONL format is as follows:
- Clean up reactions (e.g., remove atom maps) using
01_clean_data.py
. - Create a vocabulary using
02_create_vocab.py
. - Convert the dataset to JSONL format using
03_convert_to_jsonl.py
.
Note that if using our rxn-splits code to produce OOD splits, then the
files will already be in JSONL format. While we cannot provide the datasets we used in our paper (as they are derived from
the proprietary Pistachio dataset), we provide
an already processed dataset (derived from USPTO) in data/processed/USPTO_mixed_augm
as an example.
The train and hyperparameter optimization scripts are in scripts/training
. The scripts should be executed within this
directory.
To train a single model, run the following command:
python single_run.py --group_name <group_name> --run_name <run_name> --local_dir_name results --config_pth <config_name>
The config defines the data location and hyperaparameters for the model. For instance, to run on the provided
USPTO dataset, you could use configs/uspto_augm/uspto_stereo_mixed_deftune.json
.
The hyperparameter optimization is done using Ray Tune.
To use this first start Ray in one shell:
ray start --head
Then run the hyperparameter optimization script:
python hyp_opt_via_ray_tune.py --experiment_name <experiment name> --num_samples 100 --config_pth <config path> --search_space_config_pth <search space config> --address auto
For instance to run on the provided USPTO dataset you could use configs/uspto_augm/uspto_stereo_mixed_deftune.json
and
search_spaces/default_search_space_uspto.json
for the <config path>
and <search space config>
respectively. You
can modify the address
passed to Ray as required.
Details of the hyperparameter optimization will be shown on Weights and Biases (W&B) (see below),
and one can also use the loadhyp_opt_via_ray_tune.py
to query the best hyperparameters found.
We use Weights and Biases (W&B) for tracking experiments. If you edit our setup.sh
file to run
this in offline mode, then you can sync the wandb logs later by running:
wandb sync --include-offline ./*/wandb/offline*
W&B runs are reported using the run name, and are grouped together by experiment/group names.
Once you have trained a model you can evaluate it using the scripts in scripts/evaluating
.
To evalaute a model with a single set of weights (possibly derived via averaging), one can use the script:
python single_eval.py --config_pth <config path> --run_name <eval run name> --save_preds
,
where the <config path>
is the path to the evaluation config. This allows you to run the same model over different
evaluation datasets.
To enable the evaluation of multiple sets of weights over the same test sets, we provide the script parallel_eval.py
.
This is useful example in evaluating the time-based splits, where models trained on different time cutoffs need to be
evaluated on the same held-out 1970s–present day test sets. This will create the configs used for each of the single evals
automatically. The parallel_eval.py
is used as follows (to run 3 evaluations in parallel):
python parallel_eval.py --num_runs 3 --config_pth <config path> --parallel_run_name <parallel run name> --torch_num_threads_per_run 20 --save_preds
If you're looking for a language-based chemical reaction prediction model, you might also be interested in the following repositories:
- Molecular Transformer. Note that the model we use here is very similar, also using an encoder-decoder style transformer language model.
- Graph2SMILES: a graph neural network encoder, SMILES decoder reaction prediction model.
- Chemformer: a BART-style approach (including pretraining on a denoising task, which is not done here) for molecular tasks, including reaction prediction.
[1] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. and Zettlemoyer, L. (2019)
‘BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension’,
arXiv [cs.CL]. Available at: http://arxiv.org/abs/1910.13461.
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017)
‘Attention Is All You Need’, in I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett (eds) Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 5998–6008.
[3] Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C.A., Bekas, C. and Lee, A.A. (2019)
‘Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction’, ACS Central Science, 5(9), pp. 1572–1583.
[4] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,
Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M.,
Lhoest, Q. and Rush, A.M. (2019) ‘HuggingFace’s Transformers: State-of-the-art Natural Language Processing’,
arXiv [cs.CL]. Available at: https://doi.org/10.48550/arXiv.1910.03771.