Skip to content

juanniwu/t-SMILES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

t-SMILES: A Scalable Fragment-based Molecular Representation Framework

When using advanced NLP methodologies to solve chemical problems, two fundamental questions arise: 1) What are 'chemical words'? and 2) How can they be encoded as 'chemical sentences’?

This study introduces a scalable, fragment-based, multiscale molecular representation algorithm called t-SMILES (tree-based SMILES) to address the second question. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.

For more details, please refer to the papers.

TSSA, TSDY, TSID: https://www.nature.com/articles/s41467-024-49388-6

TSIS (TSIS, TSISD, TSISO, TSISR): https://arxiv.org/abs/2402.02164

Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show that:

  1. It can build a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. Under this framework, classical SMILES can be unified as a special case of t-SMILES to achieve better balanced performance using hybrid decomposition algorithms.

  2. It exhibits impressive performance on low-resource datasets JNK3 and AID1706, whether the model is original, data augmented, or pre-training fine-tuned;

  3. It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks.

  4. It outperforms previous fragment-based models being competitive with classical SMILES and graph-based methods on Zinc, QM9, and ChEMBL.

To support the t-SMILES algorithm, we introduce a new character, '&', to act as a tree node when the node is not a real fragment in FBT. Additionally, we introduce another new character, '^', to separate two adjacent substructure segments in t-SMILES string, similar to the blank space in English sentences that separates two words.

Four coding algorithms are presented in these studies:

  1. TSSA: t-SMILES with shared atom.

  2. TSDY: t-SMILES with dummy atom but without ID.

  3. TSID: t-SMILES with ID and dummy atom.

  4. TSIS: Simplified TSID, including TSIS, TSISD, TSISO, TSISR.

For example, the six t-SMILES codes of Celecoxib are:

TSID_M:

  • [1*]C&[1*]C1=CC=C([2*])C=C1&[2*]C1=CC([3*])=NN1[5*]&[3*]C([4*])(F)F&[4*]F^[5*]C1=CC=C([6*])C=C1&&[6*]S(N)(=O)=O&&&

TSDY_M (replace [n*] with *):

  • *C&*C1=CC=C(*)C=C1&*C1=CC(*)=NN1*&*C(*)(F)F&*F^*C1=CC=C(*)C=C1&&*S(N)(=O)=O&&&

TSSA_M:

  • CC&C1=CC=CC=C1&CC&C1=C[NH]N=C1&CN&C1=CC=CC=C1^CC^CS&C&N[SH]=O&CF&&&&FCF&&

TSIS_M:

  • [1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[5*]C1=CC=C([6*])C=C1^[4*]F^[6*]S(N)(=O)=O

TSISD_M:

  • [1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[4*]F^[5*]C1=CC=C([6*])C=C1^[6*]S(N)(=O)=O

TSISO_M:

  • [2*]C1=CC([3*])=NN1[5*]^[1*]C1=CC=C([2*])C=C1^[5*]C1=CC=C([6*])C=C1^[3*]C([4*])(F)F^[6*]S(N)(=O)=O^[1*]C^[4*]F

Table. Results for the Distribution-Learning Benchmarks on ChEMBL using Diffusion Models

Here we provide the source code of our method.

Dependencies

We recommend Anaconda to manage the version of Python and installed packages.

Please make sure the following packages are installed:

  1. Python(version >= 3.7)

  2. PyTorch (version == 1.7)

  3. RDKit (version >= 2020.03)

  4. Networkx(version >= 2.4)

  5. Numpy (version >= 1.19)

  6. Pandas (version >= 1.2.2)

  7. Matplotlib (version >= 2.0)

  8. Scipy(version >= 1.4.1)

As to Datamol and rBRICS: please download them from https://github.com/datamol-io/datamol and https://github.com/BiomedSciAI/r-BRICS and copy them into the MolUtils folder.

Usage

  1. DataSet/Graph/CNJTMol.py

encode_single ()

It contained a preprocess function to generate t-SMILES from data set.

  1. DataSet/Graph/CNJMolAssembler.py

decode_single()

It reconstructs molecules form t-SMILES to generate classical SMILES.

In this study, GPT and RNN generative models are used for evaluation.

Acknowledgement

We thank the following Git repositories that gave me a lot of inspirations:

  1. Datamol: https://github.com/datamol-io/datamol

  2. rBRICS: https://github.com/BiomedSciAI/r-BRICS

  3. MolGPT : https://github.com/devalab/molgpt

  4. MGM: https://github.com/nyu-dl/dl4chem-mgm

  5. JTVAE: https://github.com/wengon-jin/icml18-jtnn

  6. hgraph2graph: https://github.com/wengong-jin/hgraph2graph

  7. DeepSmiles: https://github.com/baoilleach/deepsmiles

  8. SELFIES: https://github.com/aspuru-guzik-group/selfies

  9. FragDGM: https://github.com/marcopodda/fragment-based-dgm

  10. CReM: https://github.com/DrrDom/crem

  11. AttentiveFP: https://github.com/OpenDrugAI/AttentiveFP

  12. Guacamol: https://github.com/BenevolentAI/guacamol\_baselines

  13. MOSES: https://github.com/molecularsets/moses

  14. GPT2: https://github.com/samwisegamjeee/pytorch-transformers

  15. DIFFUMOL: https://github.com/ComputeSuda/DIFFUMOL

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published