ChemFold provides several methods for computing train-validation-test splits, designed for both ordinary ML and federated ML tasks involving small molecules. Details can be found in the following publication: https://doi.org/10.1186/s13321-021-00576-2. Following methods are included:
- Random split
- Sphere exclusion clustering based split
- Locality sensitive hashing (LSH) based split
- Scaffold trees
ChemFold can be installed by cloning its repository. First we need two dependencies rdkit (at least 2021.03.3) and cython:
conda install -c conda-forge rdkit ## at least 2021.03.3
pip install cython
pip install -e .
python -m chemfold.scaffold_network --infolder <root folder of melloddy_tuner_output> \
--out <folder to write output to> \
--params_file <location of parameters.json>
For the parameters.json
must contain values for ["key"]["key"]
and ["lsh"]["nfolds"]
entries.
This method also works in federated setting and guarantees consistent folding results across multiple parties.
Here is an example executing random split:
python -m chemfold.random_split --infolder <root folder of melloddy_tuner_output> \
--out <folder to write output to> \
--params_file <location of parameters.json>
This method also works in federated setting and guarantees consistent folding results across multiple parties.
python -m chemfold.sphere_exclusion --x chembl_23mini_x.npy \
--dists 0.2 0.4 0.6 \
--out chembl_23mini_clusters.npy
where chembl_23mini_x.npy
is the sparse matrix (either scipy.sparse saved in .npy or matrix market, ending with .mtx
) of descriptor values such as ECFPs. This command does sequence of sphere exclusion clusterings for distances 0.2, 0.4 and 0.6.
Note this implementation of sphere exclusion can be used only for non-federated setting and does not create consistent results if used by different parties.
MELLODDY TUNER (see next section) directly offers a LSH-based fold splitting. The respective fold-file can be found in files_4_ml/T11_fold_vector.npy
.
After calculating the fold using one of the above functions analyses can be run. The performance difference (needs SparseChem (https://github.com/melloddy/SparseChem) runs before), label and data imbalance as well as similarity of chemical substances within one fold can be analyzed.
python -m chemfold.analyze --inp <ChemFold folder including the output files from the folding methods, e.g., sn_scaff_folds.npy> \
--out <folder to write output to> \
--psc <location of SparseChem folder inclusing the models folder>
--baseline_prefix <prefix used in SparseChem that corresponds to the baseline folding methods>
--analysis <which analyses to run, choose from 'all', 'performance', 'imbalance' and 'similarity'>
ChemFold uses as input processed molecular data, for which can be obtained from MELLODDY-TUNER.
Here we outline basic steps to get MELLODDY-TUNER installed and data preparation executed, for in-depth guide you can check MELLODDY-TUNER's manual.
First, clone the git repository from the MELLODDY gitlab repository:
git clone -b release/1.0 https://github.com/melloddy/MELLODDY-TUNER.git
Create your own enviroment from the given yml file with:
cd MELLODDY-TUNER
conda env create -f melloddy_pipeline_env.yml
Activate the environment:
conda activate melloddy_pipeline
You have to install the melloddy-tuner package with pip:
pip install -e .
See specifications in MELLODDY-TUNER README.md under section Prepare Input Files.
python bin/prepare_4_melloddy.py \
--structure_file {path/to/your/structure_file_T2.csv} \
--activity_file {/path/to/your/activity_data_file_T4.csv} \
--weight_table {/path/to/your/weight_table_T3.csv} \
--config_file {/path/to/the/distributed/parameters.json} \
--output_dir {path/to/the/output_directory} \
--run_name {name of your current run} \
--number_cpu {number of CPUs to use} \
--ref_hash {path/to/the/provided/ref_hash.json} \
Example for ChEMBL:
python bin/prepare_4_melloddy.py \
--structure_file {path/to/your/chembl25_T2.csv} \
--activity_file {/path/to/your/chembl25_T4.csv} \
--weight_table {/path/to/your/chembl25_T3.csv} \
--config_file {/tests/structure_preparation_test/example_parameters.json} \
--output_dir {path/to/the/output_directory} \
--run_name {name of your current run} \
--number_cpu {number of CPUs to use} \
--ref_hash {/tests/structure_preparation_test/ref_hash.json}