Potential bug in kekulization #55

adamoyoung · 2021-07-30T17:58:16Z

I was using selfies.encoder with a non-kekulized smiles string NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O, and I got the error `Encoding error 'NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O': kekulization algorithm failed'.

However, I am able to kekulize the string with rdkit, using rdkit.Chem.Kekulize(mol); rdkit.Chem.MolToSmiles(mol,kekuleSmiles=True). The resulting smiles string is NC(=O)C1=CC=CC2=C1C1=CC=C(C=C1)NC2=O which can then be encoded as the selfies string [N][C][Branch1_2][C][=O][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=C][C][=C][Branch1_1][Branch1_1][C][=C][Ring1][Branch1_2][N][C][Ring1][Branch2_3][=O] without error. I am just wondering if this is expected behaviour or a possibly a bug, I understand that kekulization algorithms sometimes can produce different results.

I am using python 3.7.10, selfies 1.0.3, and rdkit 2018.09.3

The text was updated successfully, but these errors were encountered:

MarioKrenn6240 · 2021-07-31T19:01:28Z

Thank you, i can confirm this bug and we will look into it asap.

robpollice · 2021-08-06T21:25:33Z

Dear adamoyoung,
We identified the problem and after extensive discussion also have a solution. The problem with this SMILES string is that it is misusing aromatic characters. The point is that aromatic characters do not make sense when you connect it on both ends with explicit bonds. The solution will be to convert the aromatic characters that have explicit bonds on both ends into non-aromatic characters. That also is how rdkit handles this SMILES for instance. We will add the solution to the repo after the upcoming SELFIES Workshop next week as we do not want to make any changes before. You are cordially invited to attend. You will find more details about the workshop here: https://accelerationconsortium.substack.com/p/selfies-workshop-aug-13

adamoyoung · 2021-08-09T16:43:44Z

Interesting! Thanks for the fix and the invite!

The string in question was one I found from PubChem, original_str == C1=CC2=C(C3=CC=C(C=C3)NC2=O)C(=C1)C(=O)N. When you convert this string into an RDKit mol object with mol = MolFromSmiles(original_str), then convert it back to a string with new_str = MolToSmiles(mol,canonical=True,isomericSmiles=False,kekuleSmiles=False) you end up with new_str == NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O, aka the problematic string.

alstonlo · 2021-10-23T20:43:27Z

Hi @adamoyoung,

We have implemented a more flexible SMILES parser in selfies v2.0.0, such that the problem SMILES string (and others like it) are now accepted. For example,

original = "NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O"
decoded = sf.decoder(sf.encoder(original))
print(Chem.CanonSmiles(original) == Chem.CanonSmiles(decoded))  # now True!

Thanks for the bug report!

MarioKrenn6240 added the bug Something isn't working label Jul 31, 2021

MarioKrenn6240 closed this as completed Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug in kekulization #55

Potential bug in kekulization #55

adamoyoung commented Jul 30, 2021 •

edited

Loading

MarioKrenn6240 commented Jul 31, 2021

robpollice commented Aug 6, 2021

adamoyoung commented Aug 9, 2021

alstonlo commented Oct 23, 2021

Potential bug in kekulization #55

Potential bug in kekulization #55

Comments

adamoyoung commented Jul 30, 2021 • edited Loading

MarioKrenn6240 commented Jul 31, 2021

robpollice commented Aug 6, 2021

adamoyoung commented Aug 9, 2021

alstonlo commented Oct 23, 2021

adamoyoung commented Jul 30, 2021 •

edited

Loading