Skip to content

Latest commit

 

History

History
235 lines (206 loc) · 11.5 KB

main.md

File metadata and controls

235 lines (206 loc) · 11.5 KB

Installation

SparseChem depends on pytorch, which you have to install first, other dependencies will be installed with the package:

pip install -e .

ChEMBL Example

First data has to be downloaded into examples/chembl:

https://www.esat.kuleuven.be/~aarany/chembl_23_x.mtx
https://www.esat.kuleuven.be/~aarany/chembl_23_y.mtx
https://www.esat.kuleuven.be/~aarany/folding_hier_0.6.npy

Then execute training:

cd ./examples/chembl
python train.py

Specifying parameters of the network

Single layer network with 400 hidden:

python train.py \
  --x ./chembl_23_x.mtx \
  --y ./chembl_23_y.mtx \
  --folding ./folding_hier_0.6.npy \
  --fold_va 0 \
  --batch_ratio    0.02 \
  --hidden_sizes   400 \
  --last_dropout   0.2 \
  --middle_dropout 0.2 \
  --weight_decay   0.0 \
  --epochs         20 \
  --lr             1e-3 \
  --lr_steps       10 \
  --lr_alpha       0.3

We use 0.2 dropout and with no weight decay (regularization). The total of epochs is 20 and learning rate is 1e-3. We also add --lr_steps 10 that means that the after 10 epochs the learning rate is multiplied by 0.3 (lr_alpha value).

This should get us to 0.83 average AUC for tasks with 25 positives and 25 negatives.

Two layer network

To get a two layer network we just add several values to --hidden_sizes.

python train.py \
  --x ./chembl_23_x.mtx \
  --y_class ./chembl_23_y.mtx \
  --folding ./folding_hier_0.6.npy \
  --fold_va 0 \
  --batch_ratio    0.02 \
  --hidden_sizes   400 400 \
  --weight_decay   1e-4 \
  --last_dropout   0.2 \
  --middle_dropout 0.2 \
  --epochs         20 \
  --lr             1e-3 \
  --lr_steps       10 \
  --lr_alpha       0.3

We also modified the weight decay to 1e-4.

AUC calculation

The script uses all data for training but AUCs are calculated only on tasks with enough positive and negative examples, default is 25 each. To instead require at least 50 positives and 50 negatives, add --min_samples_auc 50.

There are few options to reduce the time spent for AUC calculations:

  • --eval_train 0 will turn off AUC calculation for the training set.
  • --eval_frequency 2 will specify that AUCs should be calculated only every 2 epochs (default is 1). If set to -1 the evaluation is only done once at the end of the run.

Input folding

The pipeline also provides an option to fold inputs to a smaller size. For example, adding --fold_inputs 20000 folds the inputs to 20,000 dimension. This is useful for reducing the model size, without hurting the performance too much.

Task weighting

SparseChem also supports task weighting. This can be enabled by adding a --weights_class weights_c.csv option for classification and with --weights_regr weights_r.csv for regression. The CSV file can contain the following columns:

  • task_id integer from 0 to number of classification tasks minus 1,
  • training_weight real value between 0.0 and 1.0 (inclusive) for each task,
  • censored_weight (optional) real value between 0.0 and 1.0 (inclusive). It allows per task down-weighting of censored values in regression.
  • aggregation_weight (optional) real value between 0.0 and 1.0 (inclusive). If specified determines weight when aggregating metrics (AUC-PR etc).
  • task_type (optional) string specifying task type.

The number of tasks in the CSV file --weights_class (--weights_regr) must be equal to the number of tasks in --y_class matrix (--y_regr).

Regression

SparseChem also supports regression and also both regression and classification jointly. Here is an example to use regression:

python train.py \
  --x ./chembl_23_x.mtx \
  --y_regr ./chembl_23_y.mtx \
  --folding ./folding_hier_0.6.npy \
  --fold_va 0 \
  --batch_ratio    0.02 \
  --hidden_sizes   400 400 \
  --last_non_linearity tanh \
  --weight_decay   1e-4 \
  --last_dropout   0.2 \
  --middle_dropout 0.2 \
  --epochs         20 \
  --lr             1e-3 \
  --lr_steps       10 \
  --lr_alpha       0.3

We matrix for --y_regr is sparse matrix (similar to classification). For which SparseChem minimizes the mean squared error (MSE) loss. Note we have also switched the non-linearity to tanh.

Censored regression

It is possible to use censored regression by passing an extra sparse matrix with --y_censor chembl_censor.npy that has the same sparsity pattern as --y_regr. The censor matrix should store the censoring mask for each regression value:

  • -1 for below censoring
  • 0 for no censoring (usual regression)
  • +1 for upper censoring

Running on CPU or other GPUs

The default device is cuda:0. To train the model on CPU just add --dev cpu to the arguments. Similarly, to choose another GPU, we can specify --dev cuda:1.

Predicting on new compounds

After the run is complete the model's parameters and conf are saved under models/ folder. Note you can change the output directory by providing --output_dir some_other_dir.

We then can use predict.py to make predictions for new compounds as follows:

python predict.py \
    --x new_compounds.mtx \
    --outprefix y_hat \
    --conf models/sc_chembl_h400.400_ldo0.2_wd1e-05.json \
    --model models/sc_chembl_h400.400_ldo0.2_wd1e-05.pt \
    --dev cuda:0

where new_compounds.mtx is the sparse feature matrix of the new compounds and --outprefix y_hat specifies the prefix for the file(s) where the predictions are saved to. In this example, the output file for classification tasks will be y_hat-class.npy and for regression tasks y_hat-regr.npy, assuming that the model has both types of tasks. The --conf and --model should point to the configuration and model files that where saved during the training.

The format for the prediction is a Numpy file that can be loaded as follows:

import numpy as np
y_hat = np.load("y_hat.npy")

The predictions themselves are class probabilities (values between 0.0 and 1.0).

There is an option --dropout 1 to switch on the dropout during predictions to obtain stochastic predictions, e.g., for MC-dropout.

Sparse predictions

It is also possible to predict only selected elements instead of the whole output matrix (either classification and/or regression):

  • add --y_class y_class_to_predict.npy (.npy, .npz or .mtx) for classification,
  • add --y_regr y_regr_to_predict.npy (.npy, .npz or .mtx) for regression. If added, these matrices specify locations where to make predictions to.

Here is an example for classification:

python predict.py \
    --x new_compounds.mtx \
    --y_class y_to_predict.npy \
    --outprefix y_hat \
    --conf models/sc_chembl_h400.400_ldo0.2_wd1e-05.json \
    --model models/sc_chembl_h400.400_ldo0.2_wd1e-05.pt \
    --dev cuda:0

Then predict.py will create a file y_hat-class.npy which contains now scipy.sparse matrix with the same shape and the same sparsity pattern as y_to_predict.npy. The resulting file can be loaded by yhat = np.load('y_hat-class.npy', allow_pickle=True).item() or just by:

import sparsechem as sc
yhat = sc.load_sparse('y_hat-class.npy')

Sparse predictions for specific folds

Additionally, if only a specific fold is of interest out of Y then we can add parameters --folding my_folding.npy and --predict_fold 0 to only return predictions for rows who are in fold 0 (where the folding of samples is specified by my_folding.npy). It is possible to predict several folds by specifying them as --predict_fold 0 2 3. Note that the rows from the other (unspecified) folds will be just empty in the y-hat-class.npy matrix.

Retreiving last hidden layers

Instead of outputting the predictions we can use predict.py to output the activations of the last layer. This can be done by adding the option --last_hidden 1.

Then the output file will contain the numpy matrix of the hidden vectors, which can be loaded the same way as predictions.

Full list of all command line arguments

  • --x: Activity file (matrix market, .npy or .npz) (str)
  • --y_class | --y | --y_classification: Activity file (matrix market, .npy or .npz) (str)
  • --y_regr | --y_regression: Activity file (matrix market, .npy or .npz) (str)
  • --y_censor: Censor mask for regression (matrix market, .npy or .npz) (str)
  • --weights_class | --task_weights | --weights_classification: CSV file with columns task_id, training_weight, aggregation_weight, task_type (for classification tasks) (str)
  • --weights_regr | --weights_regression: CSV file with columns task_id, training_weight, censored_weight, aggregation_weight, aggregation_weight, task_type (for regression tasks) (str)
  • --censored_loss: Set this to 0 if censored loss should not be used for training (int, default=1)
  • --folding: Folding file (npy) (str)
  • --fold_va: Validation fold number (int, default=0)
  • --fold_te: Test fold number (removed from dataset) (int)
  • --batch_ratio: Batch ratio (float, default=0.02)
  • --internal_batch_max: Maximum size of the internal batch (int)
  • --normalize_loss: Normalization constant to divide the loss (int, default=None, uses batch size)
  • --normalize_regression: Set this to 1 if the regression tasks should be normalized
  • --hidden_sizes: Hidden sizes (int) (default: [])
  • --last_hidden_sizes: Hidden sizes in the head (int) (default: [])
  • --middle_dropout: Dropout for layers before the last (float, default=0.0)
  • --last_dropout: Last dropout (float, default=0.2)
  • --weight_decay: Weight decay (float, default=0.0)
  • --last_non_linearity: Last layer non-linearity (str, default="relu", choices=["relu", "tanh"])
  • --non_linearity: Before last layer non-linearity (str, default="relu", choices=["relu", "tanh"])
  • --input_transform: Transformation to apply to inputs (str, default="binarize", choices=["binarize", "none", "tanh"])
  • --lr: Learning rate (float, default=1e-3)
  • --lr_alpha: Learning rate decay multiplier (float, default=0.3)
  • --lr_steps: Learning rate decay steps (int, default=[10])
  • --input_size_freq: Number of high importance features (int)
  • --fold_inputs: Fold input to a fixed set (default no folding) (int)
  • --epochs: Number of epochs (type=int, default=20)
  • --min_samples_auc: Minimum number samples (in each class) for AUC calculation" (int, default=25)
  • --min_samples_class: Minimum number samples in each class and in each fold for AUC calculation (only used if aggregation_weight is not provided in --weights_class) (int, default=5)
  • --min_samples_regr: Minimum number of uncensored samples in each fold for regression metric calculation (only used if aggregation_weight is not provided in --weights_regr) (int, default=10)
  • --dev: Compute device to use (str, default="cuda:0", possible ["cpu","cuda:X"])
  • --run_name: Run name for results (str)
  • --output_dir: Output directory, including boards (str, default="models")
  • --prefix: Prefix for run name (str, default='run')
  • --verbose: Verbosity level: 2 = full; 1 = no progress; 0 = no output", type=int, default=2, choices=[0, 1, 2])
  • --save_model: Set this to 0 if the model should not be saved (int, default=1)
  • --save_board: Set this to 0 if the TensorBoard should not be saved
  • --eval_train: Set this to 1 to calculate AUCs for train data (int, default=0)
  • --eval_frequency: The gap between AUC eval (in epochs), -1 means to do an eval at the end. (int, default=1)
  • --optimizer: Choose the optimizer (Adam or SGD, default: Adam)
  • --optimizer_params: Set optimizer specific parameters (Adam(beta1, beta2, epsilon), SGD(momentum), default: Pytorch default)