VECT-GAN: A Variationally Encoded Generative Model for Overcoming Data Scarcity in Pharmaceutical Science

Authors: Youssef Abdalla, Marissa Taub, Priya Akkaru, Eleanor Hilton, Alexander Milanovic, Mine Orlu, Abdul W. Basit, Michael T Cook, Tapabrata Chakraborti, David Shorthouse
Institutions:

Department of Pharmaceutics, UCL School of Pharmacy, University College London, London, UK

The Alan Turing Institute, London, UK

UCL Department of Medical Physics and Biomedical Engineering, University College London, Malet Place Engineering Building, 2 Malet Place, London WC1E 7JE, UK

UCL Cancer Institute, University College London, Paul O'Gorman Building, 72 Huntley Street, London WC1E 6DD, UK

Abstract: Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and/or noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules – an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package at: https://pypi.org/project/vect-gan/

Preprint: https://arxiv.org/abs/2501.08995

Installation

VECT-GAN requires Python 3.11 or later. (While it may run on earlier versions, it is officially tested on Python 3.11+.)

Install via pip:

pip install vect-gan

Alternatively, if you have the source code:

git clone https://github.com/y-babdalla/vect_gan.git
cd vect_gan
pip install .

Once installed, you can import the package in Python:

import vect_gan

Usage

Training the Model from Scratch

Before fine-tuning or using a pre-trained VECT-GAN model, you can train a new model from scratch using the .fit method. This step is recommended if you have a novel dataset and wish to initialise the generative model from the ground up.

import pandas as pd
from vect_gan.synthesizers.vectgan import VectGan

# Create an instance of VectGan
vect_gan = VectGan()

# Load your training data into a pandas DataFrame
training_data = pd.read_csv("path/to/your/training_data.csv")

# Train the model from scratch
trained_model = vect_gan.fit(
    data=training_data,
    epochs=200,
    batch_size=64,
    pac=8,
    verbose=True
)

data: The training dataset (pandas DataFrame).
epochs: The number of training epochs.
batch_size: Training batch size.
pac: Pack size for training (grouping samples for improved stability). Please note that the pac size must be divisible by the batch size.
verbose: Controls the verbosity of training logs.

Fine-Tuning the Molecular Descriptor Model

Once you have a trained or pre-trained model, you may wish to fine-tune it on additional data to adapt the model’s learned representations to new or more specialised tasks. Fine-tuning generally uses fewer epochs than a full training routine.

Note: Because VECT-GAN’s underlying DataTransformer and DataSampler are not serialisable, they are not included in the pre-trained model checkpoint. If you download or receive a pre-trained VECT-GAN model, you still need to run a brief fine-tuning step on a small amount of data. We provide some synthetic data examples in the data/ folder that you can use for this purpose.

import pandas as pd
from vect_gan.synthesizers.vectgan import VectGan

# Create an instance of VectGan
vect_gan = VectGan()

# Load data for fine-tuning
new_data = pd.read_csv("path/to/your/data.csv")

# Fine-tune the model
model = vect_gan.fine_tune(
    new_data=new_data,
    epochs=10,
    batch_size=32,
    pac=8,
    verbose=True
)

new_data: The dataset on which to perform fine-tuning (pandas DataFrame).
epochs: The number of training epochs.
batch_size: Training batch size.
pac: Pack size for training (grouping samples for improved stability). Please note that the pac size must be divisible by the batch size.
verbose: Controls the verbosity of logs.

Sampling Synthetic Data

Once you have trained or fine-tuned a model, you can generate synthetic data:

# Sample 10 rows of synthetic data
sampled_data = model.sample(10)

print("Sampled Data:")
print(sampled_data)

Saving and Loading Models

VECT-GAN uses torch.save and torch.load under the hood to handle model checkpoints:

# Save the fine-tuned model
model.save("fine_tuned_model.pt")

# Load the model from disk
vect_gan = vect_gan.load("fine_tuned_model.pt")

The above commands store or retrieve the learned model weights. However, as noted, the data transformation objects are not serialised. This means the loaded model will require an external transformer and data sampler to operate fully. If you want to run the model immediately upon loading (e.g. to generate new samples), you will need to either:

Fine-tune the loaded model on a small dataset (as described above), or
Re-initialise the same type of DataTransformer and DataSampler that were used originally.

Contributing

We welcome contributions in the form of bug reports, feature requests, or pull requests. If you wish to contribute code, kindly follow these steps:

Fork this repository.
Create a new branch from main.
Implement or fix the feature/bug.
Submit a pull request, describing your changes in detail.

Please ensure your code follows Python 3 best practices, PEP8 style conventions, and includes comprehensive docstrings.

Licence

VECT-GAN is distributed under the GNU GENERAL PUBLIC LICENSE. Please refer to the LICENSE file in the repository for detailed information.

Contact

For questions, comments, or feedback regarding VECT-GAN, please contact:

Lead Author: Youssef Abdalla (youssef.abdalla.16@ucl.ac.uk)
GitHub Issues: https://github.com/y-babdalla/vect_gan/issues

Please note that the data used to train the models is private. For access to the training and evaluation data, please contact the authors.

If you use VECT-GAN in academic work, kindly cite our research paper. This project is continually evolving, and we appreciate any input that could help improve it.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
data		data
examples		examples
vect_gan		vect_gan
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VECT-GAN: A Variationally Encoded Generative Model for Overcoming Data Scarcity in Pharmaceutical Science

Table of Contents

Installation

Usage

Training the Model from Scratch

Fine-Tuning the Molecular Descriptor Model

Sampling Synthetic Data

Saving and Loading Models

Contributing

Licence

Contact

About

Releases

Packages

Languages

License

y-babdalla/vect_gan

Folders and files

Latest commit

History

Repository files navigation

VECT-GAN: A Variationally Encoded Generative Model for Overcoming Data Scarcity in Pharmaceutical Science

Table of Contents

Installation

Usage

Training the Model from Scratch

Fine-Tuning the Molecular Descriptor Model

Sampling Synthetic Data

Saving and Loading Models

Contributing

Licence

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages