Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in kekulization #55

Closed
adamoyoung opened this issue Jul 30, 2021 · 4 comments
Closed

Potential bug in kekulization #55

adamoyoung opened this issue Jul 30, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@adamoyoung
Copy link

adamoyoung commented Jul 30, 2021

I was using selfies.encoder with a non-kekulized smiles string NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O, and I got the error `Encoding error 'NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O': kekulization algorithm failed'.

However, I am able to kekulize the string with rdkit, using rdkit.Chem.Kekulize(mol); rdkit.Chem.MolToSmiles(mol,kekuleSmiles=True). The resulting smiles string is NC(=O)C1=CC=CC2=C1C1=CC=C(C=C1)NC2=O which can then be encoded as the selfies string [N][C][Branch1_2][C][=O][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=C][C][=C][Branch1_1][Branch1_1][C][=C][Ring1][Branch1_2][N][C][Ring1][Branch2_3][=O] without error. I am just wondering if this is expected behaviour or a possibly a bug, I understand that kekulization algorithms sometimes can produce different results.

I am using python 3.7.10, selfies 1.0.3, and rdkit 2018.09.3

@MarioKrenn6240
Copy link
Collaborator

Thank you, i can confirm this bug and we will look into it asap.

@MarioKrenn6240 MarioKrenn6240 added the bug Something isn't working label Jul 31, 2021
@robpollice
Copy link
Contributor

Dear adamoyoung,
We identified the problem and after extensive discussion also have a solution. The problem with this SMILES string is that it is misusing aromatic characters. The point is that aromatic characters do not make sense when you connect it on both ends with explicit bonds. The solution will be to convert the aromatic characters that have explicit bonds on both ends into non-aromatic characters. That also is how rdkit handles this SMILES for instance. We will add the solution to the repo after the upcoming SELFIES Workshop next week as we do not want to make any changes before. You are cordially invited to attend. You will find more details about the workshop here: https://accelerationconsortium.substack.com/p/selfies-workshop-aug-13

@adamoyoung
Copy link
Author

Interesting! Thanks for the fix and the invite!

The string in question was one I found from PubChem, original_str == C1=CC2=C(C3=CC=C(C=C3)NC2=O)C(=C1)C(=O)N. When you convert this string into an RDKit mol object with mol = MolFromSmiles(original_str), then convert it back to a string with new_str = MolToSmiles(mol,canonical=True,isomericSmiles=False,kekuleSmiles=False) you end up with new_str == NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O, aka the problematic string.

@alstonlo
Copy link
Collaborator

Hi @adamoyoung,

We have implemented a more flexible SMILES parser in selfies v2.0.0, such that the problem SMILES string (and others like it) are now accepted. For example,

original = "NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O"
decoded = sf.decoder(sf.encoder(original))
print(Chem.CanonSmiles(original) == Chem.CanonSmiles(decoded))  # now True!

Thanks for the bug report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants