t-SMILES: A Scalable Fragment-based Molecular Representation Framework

When using advanced NLP methodologies to solve chemical problems, two fundamental questions arise: 1) What are 'chemical words'? and 2) How can they be encoded as 'chemical sentences’?

This study introduces a scalable, fragment-based, multiscale molecular representation algorithm called t-SMILES (tree-based SMILES) to address the second question. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.

For more details, please refer to the papers.

TSSA, TSDY, TSID: https://www.nature.com/articles/s41467-024-49388-6

TSIS (TSIS, TSISD, TSISO, TSISR): https://arxiv.org/abs/2402.02164

Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show that:

It can build a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. Under this framework, classical SMILES can be unified as a special case of t-SMILES to achieve better balanced performance using hybrid decomposition algorithms.
It exhibits impressive performance on low-resource datasets JNK3 and AID1706, whether the model is original, data augmented, or pre-training fine-tuned;
It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks.
It outperforms previous fragment-based models being competitive with classical SMILES and graph-based methods on Zinc, QM9, and ChEMBL.

To support the t-SMILES algorithm, we introduce a new character, '&', to act as a tree node when the node is not a real fragment in FBT. Additionally, we introduce another new character, '^', to separate two adjacent substructure segments in t-SMILES string, similar to the blank space in English sentences that separates two words.

Four coding algorithms are presented in these studies:

TSSA: t-SMILES with shared atom.
TSDY: t-SMILES with dummy atom but without ID.
TSID: t-SMILES with ID and dummy atom.
TSIS: Simplified TSID, including TSIS, TSISD, TSISO, TSISR.

For example, the six t-SMILES codes of Celecoxib are:

TSID_M:

[1*]C&[1*]C1=CC=C([2*])C=C1&[2*]C1=CC([3*])=NN1[5*]&[3*]C([4*])(F)F&[4*]F^[5*]C1=CC=C([6*])C=C1&&[6*]S(N)(=O)=O&&&

TSDY_M (replace [n*] with *):

*C&*C1=CC=C(*)C=C1&*C1=CC(*)=NN1*&*C(*)(F)F&*F^*C1=CC=C(*)C=C1&&*S(N)(=O)=O&&&

TSSA_M:

CC&C1=CC=CC=C1&CC&C1=C[NH]N=C1&CN&C1=CC=CC=C1^CC^CS&C&N[SH]=O&CF&&&&FCF&&

TSIS_M:

[1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[5*]C1=CC=C([6*])C=C1^[4*]F^[6*]S(N)(=O)=O

TSISD_M:

[1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[4*]F^[5*]C1=CC=C([6*])C=C1^[6*]S(N)(=O)=O

TSISO_M:

[2*]C1=CC([3*])=NN1[5*]^[1*]C1=CC=C([2*])C=C1^[5*]C1=CC=C([6*])C=C1^[3*]C([4*])(F)F^[6*]S(N)(=O)=O^[1*]C^[4*]F

Table. Results for the Distribution-Learning Benchmarks on ChEMBL using Diffusion Models

Here we provide the source code of our method.

Dependencies

We recommend Anaconda to manage the version of Python and installed packages.

Please make sure the following packages are installed:

Python(version >= 3.7)
PyTorch (version == 1.7)
RDKit (version >= 2020.03)
Networkx(version >= 2.4)
Numpy (version >= 1.19)
Pandas (version >= 1.2.2)
Matplotlib (version >= 2.0)
Scipy(version >= 1.4.1)

As to Datamol and rBRICS: please download them from https://github.com/datamol-io/datamol and https://github.com/BiomedSciAI/r-BRICS and copy them into the MolUtils folder.

Usage

DataSet/Graph/CNJTMol.py

encode_single ()

It contained a preprocess function to generate t-SMILES from data set.

DataSet/Graph/CNJMolAssembler.py

decode_single()

It reconstructs molecules form t-SMILES to generate classical SMILES.

In this study, GPT and RNN generative models are used for evaluation.

Acknowledgement

We thank the following Git repositories that gave me a lot of inspirations:

Datamol: https://github.com/datamol-io/datamol
rBRICS: https://github.com/BiomedSciAI/r-BRICS
MolGPT : https://github.com/devalab/molgpt
MGM: https://github.com/nyu-dl/dl4chem-mgm
JTVAE: https://github.com/wengon-jin/icml18-jtnn
hgraph2graph: https://github.com/wengong-jin/hgraph2graph
DeepSmiles: https://github.com/baoilleach/deepsmiles
SELFIES: https://github.com/aspuru-guzik-group/selfies
FragDGM: https://github.com/marcopodda/fragment-based-dgm
CReM: https://github.com/DrrDom/crem
AttentiveFP: https://github.com/OpenDrugAI/AttentiveFP
Guacamol: https://github.com/BenevolentAI/guacamol\_baselines
MOSES: https://github.com/molecularsets/moses
GPT2: https://github.com/samwisegamjeee/pytorch-transformers
DIFFUMOL: https://github.com/ComputeSuda/DIFFUMOL

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
RawData		RawData
media		media
t-SMILES		t-SMILES
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

t-SMILES: A Scalable Fragment-based Molecular Representation Framework

Dependencies

Usage

Acknowledgement

About

Releases

Packages

Languages

juanniwu/t-SMILES

Folders and files

Latest commit

History

Repository files navigation

t-SMILES: A Scalable Fragment-based Molecular Representation Framework

Dependencies

Usage

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages