When using advanced NLP methodologies to solve chemical problems, two fundamental questions arise: 1) What are 'chemical words'? and 2) How can they be encoded as 'chemical sentences’?
This study introduces a scalable, fragment-based, multiscale molecular representation algorithm called t-SMILES (tree-based SMILES) to address the second question. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.
For more details, please refer to the papers.
TSSA, TSDY, TSID: https://www.nature.com/articles/s41467-024-49388-6
TSIS (TSIS, TSISD, TSISO, TSISR): https://arxiv.org/abs/2402.02164
Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show that:
-
It can build a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. Under this framework, classical SMILES can be unified as a special case of t-SMILES to achieve better balanced performance using hybrid decomposition algorithms.
-
It exhibits impressive performance on low-resource datasets JNK3 and AID1706, whether the model is original, data augmented, or pre-training fine-tuned;
-
It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks.
-
It outperforms previous fragment-based models being competitive with classical SMILES and graph-based methods on Zinc, QM9, and ChEMBL.
To support the t-SMILES algorithm, we introduce a new character, '&', to act as a tree node when the node is not a real fragment in FBT. Additionally, we introduce another new character, '^', to separate two adjacent substructure segments in t-SMILES string, similar to the blank space in English sentences that separates two words.
Four coding algorithms are presented in these studies:
-
TSSA: t-SMILES with shared atom.
-
TSDY: t-SMILES with dummy atom but without ID.
-
TSID: t-SMILES with ID and dummy atom.
-
TSIS: Simplified TSID, including TSIS, TSISD, TSISO, TSISR.
For example, the six t-SMILES codes of Celecoxib are:
TSID_M:
- [1*]C&[1*]C1=CC=C([2*])C=C1&[2*]C1=CC([3*])=NN1[5*]&[3*]C([4*])(F)F&[4*]F^[5*]C1=CC=C([6*])C=C1&&[6*]S(N)(=O)=O&&&
TSDY_M (replace [n*] with *):
- *C&*C1=CC=C(*)C=C1&*C1=CC(*)=NN1*&*C(*)(F)F&*F^*C1=CC=C(*)C=C1&&*S(N)(=O)=O&&&
TSSA_M:
- CC&C1=CC=CC=C1&CC&C1=C[NH]N=C1&CN&C1=CC=CC=C1^CC^CS&C&N[SH]=O&CF&&&&FCF&&
TSIS_M:
- [1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[5*]C1=CC=C([6*])C=C1^[4*]F^[6*]S(N)(=O)=O
TSISD_M:
- [1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[4*]F^[5*]C1=CC=C([6*])C=C1^[6*]S(N)(=O)=O
TSISO_M:
- [2*]C1=CC([3*])=NN1[5*]^[1*]C1=CC=C([2*])C=C1^[5*]C1=CC=C([6*])C=C1^[3*]C([4*])(F)F^[6*]S(N)(=O)=O^[1*]C^[4*]F
Table. Results for the Distribution-Learning Benchmarks on ChEMBL using Diffusion Models
Here we provide the source code of our method.
We recommend Anaconda to manage the version of Python and installed packages.
Please make sure the following packages are installed:
-
Python(version >= 3.7)
-
PyTorch (version == 1.7)
-
RDKit (version >= 2020.03)
-
Networkx(version >= 2.4)
-
Numpy (version >= 1.19)
-
Pandas (version >= 1.2.2)
-
Matplotlib (version >= 2.0)
-
Scipy(version >= 1.4.1)
As to Datamol and rBRICS: please download them from https://github.com/datamol-io/datamol and https://github.com/BiomedSciAI/r-BRICS and copy them into the MolUtils folder.
- DataSet/Graph/CNJTMol.py
encode_single ()
It contained a preprocess function to generate t-SMILES from data set.
- DataSet/Graph/CNJMolAssembler.py
decode_single()
It reconstructs molecules form t-SMILES to generate classical SMILES.
In this study, GPT and RNN generative models are used for evaluation.
We thank the following Git repositories that gave me a lot of inspirations:
-
hgraph2graph: https://github.com/wengong-jin/hgraph2graph
-
DeepSmiles: https://github.com/baoilleach/deepsmiles
-
SELFIES: https://github.com/aspuru-guzik-group/selfies
-
FragDGM: https://github.com/marcopodda/fragment-based-dgm
-
AttentiveFP: https://github.com/OpenDrugAI/AttentiveFP
-
Guacamol: https://github.com/BenevolentAI/guacamol\_baselines
-
GPT2: https://github.com/samwisegamjeee/pytorch-transformers