Code to accompany our paper: Challenging reaction prediction models to generalize to novel chemistry (see also the repository here for model code). This repo contains code to create OOD splits of the Pistachio chemical reaction dataset.
Note that due to its proprietary nature, we do not provide the Pistachio dataset in this repo — just our code for splitting it in different ways.
- Install prerequisites if not already on your system: Conda/Mambda
or their lighterweight equivalents, Miniconda/Micromamba. (We will show the commands below for Conda
but you can swap in
mamba
forconda
etc as appropriate). - Clone the repository:
git clone git@github.com:john-bradshaw/rxn-splits.git
- Install the required Python environment using the provided yaml file, e.g. for Conda inside the cloned directory:
conda env create -f envs/rxn-splits-env.yml
- Activate environment:
conda activate rxn-splits
. - Add the directory of this repo to your Python path:
export PYTHONPATH=${PYTHONPATH}:$(pwd)
(from inside the directory this README lives in). - Configure
config.ini
to point to the correct paths on your machine.
To run our unit tests:
pytest testing/
Note these scripts are memory hungry — we do not make any particular effort to be memory-efficient, instead deciding to take advantage of being able to keep the whole reaction dataset in memory in the machines we used!
Scripts should be run in the folder they live in (e.g., before running the dataclean scripts, move to that directory
cd scripts/dataclean
).
Script to extract, clean, and deduplicate the Pistachio dataset, used to create a data source for forming the subsequent splits.
01_clean_data.py
: goes through the Pistachio folders, parses the files to extract the key information about each reaction
(e.g., year, SMILES, etc.), filters out reactions not meeting certain (user-defined) criteria, deduplicates (on the reagents level),
and saves the resulting reaction to a pickle file. Follows the process laid out in §A.1 of our paper.
Scripts to create the different OOD splits. These scripts take in config files defining their parameters. We have included example config files in the respective directories, but the path for the cleaned data pickle file (from §2.1 above) needs to be added to these files before they can be used.
Contains scripts to create the author- and document-based splits (§A.2 of our paper).
01_create_author_document_split.py
: creates the splits. Takes in a config as argument (i.e., full command ispython 01_create_author_document_split.py --split_def_path author_document_based_split_defs.json
) that defines the splits.02_test_author_document_split.py
: (optional) tests aspects of the created split, e.g., that the splits come from different authors.
Scripts to create the NameRxn splits (§A.4 of our paper).
01_create_namerxn_split.py
: creates the splits. Takes in as arguments the path to a config (e.g.,split_def.json
) and the path to a folder defining each NameRxn split to make.02_test_namerxn_split.py
: (optional) tests aspects of the created split, e.g., goes back through and checks NameRxn codes used are separate between the train and OOD test sets.
Scripts to create the time-based splits, following the procedure laid out in §A.3 of our paper. Specifically, the script
01_create_tb_split.py
creates the time-based training and test sets, taking in a config file defining the split parameters (i.e.,
full command is python 01_create_tb_split.py --params_path time_split_def_fixed_training.json
), where the config defines
dataset sizes and what cutoff years to split on.
Contains the script a_obtain_namerxn_split.py
for creating the Buchwald–Hartwig test set from the already created
time-based splits. Follows the process laid out in §A.3 of our paper.