strict PDB parsing #1966

kain88-de · 2018-07-02T14:16:52Z

Expected behaviour

We have a PDB with residues IDS that are out of order. MDAnalysis should read this PDB like any other valid PDB file. The standard doesn't say anything about ordering. It would be nice to have a strict flag that only allows standard conform PDBs without any of the inofficial extensions that have been added over the decades.

Actual behaviour

Because we support large PDBs with more then 9999 residues we reset the counter and assume that if a new residue number is smaller then the last that we reached a number equal or above 10000.

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)")
dev

The text was updated successfully, but these errors were encountered:

richardjgowers · 2018-07-02T14:20:36Z

@kain88-de in what way don't we support resids out of order? Is this if you have more than 10k residues and they're also randomly arranged?

kain88-de · 2018-07-02T14:41:55Z

CRYST1  150.000  150.000  150.000  90.00  90.00  90.00 P 1           1
MODEL 1
ATOM      1 CA   MET B   1     125.516  77.887  77.186  0.00  0.00
ATOM      2 CA   GLN B   2     122.688  77.679  79.724  0.00  0.00
ATOM      3 CA   ILE B   3     119.011  78.655  79.616  0.00  0.00
ATOM      4 CA   PHE B   4     116.099  78.523  82.065  0.00  0.00
ATOM      5 CA   VAL B   5     112.851  76.834  81.073  0.00  0.00
ATOM      6 CA   LYS B   6     109.785  78.432  82.647  0.00  0.00
ATOM      7 CA   THR B   7     106.648  76.317  82.924  0.00  0.00
ATOM      8 CA   LEU B   8     103.279  78.003  82.391  0.00  0.00
ATOM      9 CA   THR B   9     102.383  76.888  85.920  0.00  0.00
ATOM     10 CA   GLY B  10     105.349  78.757  87.379  0.00  0.00
ATOM     11 CA   LYS B6999     108.642  76.867  87.561  0.00  0.00
ATOM     12 CA   THR B  12     112.048  77.801  86.140  0.00  0.00
TER
ENDMDL

In this example I do want atom 12 to have resid 12.

ln [1]: u = mda.Universe(test.pdb)
In [2]: u.residues.resids
Out[2]: 
array([    1,     2,     3,     4,     5,     6,     7,     8,     9,
          10,  6999, 10012])

I tested a little bit. This doesn't happen for all resid values of atom 11. Only when reaches a value larger than 5013.

richardjgowers · 2018-07-02T15:32:57Z

Yeah the code checks for a downwards jump of greater than 5,000 to guess when a resid has looped, this allows small fluctuations. Could easily add a dont-fix-resids kwarg to the parser

kain88-de · 2018-07-02T15:38:56Z

I would prefer a strict flag. I don’t really if we go away with lots of tiny switches to tune the PDB reader.

…

On Mon 2. Jul 2018 at 17:32, Richard Gowers ***@***.***> wrote: Yeah the code checks for a downwards jump of greater than 5,000 to guess when a resid has looped, this allows small fluctuations. Could easily add a dont-fix-resids kwarg to the parser — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1966 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEGnVsA4w9_fS72N91zsWwc4vVBWCRP4ks5uCj0qgaJpZM4U_Zdz> .

orbeckst · 2018-07-03T17:28:14Z

We used to have two PDB readers... I think we disliked maintaining both but I can see the appeal of having at least one that actually follows the standard and just fails if the input file does not follow.

richardjgowers · 2018-07-03T17:43:52Z

we could add a strict keyword, then around all the silly tricks we can just check if we're allowing them, ie if not self.strict and resid whatever..., then it's just a single Reader/Parser

orbeckst · 2018-07-03T20:20:46Z

If the hacks can be easily isolated in such a fashion then that's a possibility.

I admit that the purist in me wants to see clean code that does one thing but the pragmatist realizes that code in the wild has to be useful, too, and sometimes very successful evolution isn't pretty:

kain88-de · 2018-07-04T13:25:46Z

The strict keyword has two advantages for me. It is one simple switch and it's meaning is easy to guess also for a new user. We can write a fast cython implementation to read a standard compliant PDB in the future (something pandas is doing with csv). This is good if having a faster PDB reader is desirable.

Instead of a strict flag I a flavor flag would also be OK. This could, for now, have the values permissive and strict. But in the future others can be added like hybrid-36. This way we can remove a lot of the guesswork code we have right now to read a single frame. Instead it's all done on initialization.

orbeckst · 2018-07-04T23:36:26Z

Flavor sounds good, btw hybrid-36 is issue #1897 .

orbeckst added the Format-PDB label Jul 3, 2018

Luthaf mentioned this issue Mar 28, 2019

Add Chemfiles as a coordinate reader/writer #1862

Merged

6 tasks

Luthaf mentioned this issue Jun 10, 2020

Future of chemfiles reader in MDA #2731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strict PDB parsing #1966

strict PDB parsing #1966

kain88-de commented Jul 2, 2018

richardjgowers commented Jul 2, 2018

kain88-de commented Jul 2, 2018

richardjgowers commented Jul 2, 2018

kain88-de commented Jul 2, 2018 via email

orbeckst commented Jul 3, 2018

richardjgowers commented Jul 3, 2018

orbeckst commented Jul 3, 2018

kain88-de commented Jul 4, 2018

orbeckst commented Jul 4, 2018

strict PDB parsing #1966

strict PDB parsing #1966

Comments

kain88-de commented Jul 2, 2018

Expected behaviour

Actual behaviour

Currently version of MDAnalysis:

richardjgowers commented Jul 2, 2018

kain88-de commented Jul 2, 2018

richardjgowers commented Jul 2, 2018

kain88-de commented Jul 2, 2018 via email

orbeckst commented Jul 3, 2018

richardjgowers commented Jul 3, 2018

orbeckst commented Jul 3, 2018

kain88-de commented Jul 4, 2018

orbeckst commented Jul 4, 2018