VECT-GAN: A Variationally Encoded Generative Model for Overcoming Data Scarcity in Pharmaceutical Science
Authors: Youssef Abdalla, Marissa Taub, Priya Akkaru, Eleanor Hilton, Alexander Milanovic, Mine Orlu, Abdul W. Basit, Michael T Cook, Tapabrata Chakraborti, David Shorthouse
Institutions:
- Department of Pharmaceutics, UCL School of Pharmacy, University College London, London, UK
- The Alan Turing Institute, London, UK
- UCL Department of Medical Physics and Biomedical Engineering, University College London, Malet Place Engineering Building, 2 Malet Place, London WC1E 7JE, UK
- UCL Cancer Institute, University College London, Paul O'Gorman Building, 72 Huntley Street, London WC1E 6DD, UK
Abstract: Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and/or noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules – an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package at: https://pypi.org/project/vect-gan/
Preprint: https://arxiv.org/abs/2501.08995
VECT-GAN requires Python 3.11 or later. (While it may run on earlier versions, it is officially tested on Python 3.11+.)
Install via pip:
pip install vect-gan
Alternatively, if you have the source code:
git clone https://github.com/y-babdalla/vect_gan.git
cd vect_gan
pip install .
Once installed, you can import the package in Python:
import vect_gan
Before fine-tuning or using a pre-trained VECT-GAN model, you can train a new model from scratch using the .fit
method. This step is recommended if you have a novel dataset and wish to initialise the generative model from the ground up.
import pandas as pd
from vect_gan.synthesizers.vectgan import VectGan
# Create an instance of VectGan
vect_gan = VectGan()
# Load your training data into a pandas DataFrame
training_data = pd.read_csv("path/to/your/training_data.csv")
# Train the model from scratch
trained_model = vect_gan.fit(
data=training_data,
epochs=200,
batch_size=64,
pac=8,
verbose=True
)
- data: The training dataset (pandas DataFrame).
- epochs: The number of training epochs.
- batch_size: Training batch size.
- pac: Pack size for training (grouping samples for improved stability). Please note that the
pac
size must be divisible by the batch size. - verbose: Controls the verbosity of training logs.
Once you have a trained or pre-trained model, you may wish to fine-tune it on additional data to adapt the model’s learned representations to new or more specialised tasks. Fine-tuning generally uses fewer epochs than a full training routine.
Note: Because VECT-GAN’s underlying
DataTransformer
andDataSampler
are not serialisable, they are not included in the pre-trained model checkpoint. If you download or receive a pre-trained VECT-GAN model, you still need to run a brief fine-tuning step on a small amount of data. We provide some synthetic data examples in thedata/
folder that you can use for this purpose.
import pandas as pd
from vect_gan.synthesizers.vectgan import VectGan
# Create an instance of VectGan
vect_gan = VectGan()
# Load data for fine-tuning
new_data = pd.read_csv("path/to/your/data.csv")
# Fine-tune the model
model = vect_gan.fine_tune(
new_data=new_data,
epochs=10,
batch_size=32,
pac=8,
verbose=True
)
- new_data: The dataset on which to perform fine-tuning (pandas DataFrame).
- epochs: The number of training epochs.
- batch_size: Training batch size.
- pac: Pack size for training (grouping samples for improved stability). Please note that the
pac
size must be divisible by the batch size. - verbose: Controls the verbosity of logs.
Once you have trained or fine-tuned a model, you can generate synthetic data:
# Sample 10 rows of synthetic data
sampled_data = model.sample(10)
print("Sampled Data:")
print(sampled_data)
VECT-GAN uses torch.save
and torch.load
under the hood to handle model checkpoints:
# Save the fine-tuned model
model.save("fine_tuned_model.pt")
# Load the model from disk
vect_gan = vect_gan.load("fine_tuned_model.pt")
The above commands store or retrieve the learned model weights. However, as noted, the data transformation objects are not serialised. This means the loaded model will require an external transformer and data sampler to operate fully. If you want to run the model immediately upon loading (e.g. to generate new samples), you will need to either:
- Fine-tune the loaded model on a small dataset (as described above), or
- Re-initialise the same type of
DataTransformer
andDataSampler
that were used originally.
We welcome contributions in the form of bug reports, feature requests, or pull requests. If you wish to contribute code, kindly follow these steps:
- Fork this repository.
- Create a new branch from
main
. - Implement or fix the feature/bug.
- Submit a pull request, describing your changes in detail.
Please ensure your code follows Python 3 best practices, PEP8 style conventions, and includes comprehensive docstrings.
VECT-GAN is distributed under the GNU GENERAL PUBLIC LICENSE. Please refer to the LICENSE file in the repository for detailed information.
For questions, comments, or feedback regarding VECT-GAN, please contact:
- Lead Author: Youssef Abdalla (youssef.abdalla.16@ucl.ac.uk)
- GitHub Issues: https://github.com/y-babdalla/vect_gan/issues
Please note that the data used to train the models is private. For access to the training and evaluation data, please contact the authors.
If you use VECT-GAN in academic work, kindly cite our research paper. This project is continually evolving, and we appreciate any input that could help improve it.