Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strict PDB parsing #1966

Open
kain88-de opened this issue Jul 2, 2018 · 9 comments
Open

strict PDB parsing #1966

kain88-de opened this issue Jul 2, 2018 · 9 comments

Comments

@kain88-de
Copy link
Member

Expected behaviour

We have a PDB with residues IDS that are out of order. MDAnalysis should read this PDB like any other valid PDB file. The standard doesn't say anything about ordering. It would be nice to have a strict flag that only allows standard conform PDBs without any of the inofficial extensions that have been added over the decades.

Actual behaviour

Because we support large PDBs with more then 9999 residues we reset the counter and assume that if a new residue number is smaller then the last that we reached a number equal or above 10000.

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)")
dev

@richardjgowers
Copy link
Member

@kain88-de in what way don't we support resids out of order? Is this if you have more than 10k residues and they're also randomly arranged?

@kain88-de
Copy link
Member Author

CRYST1  150.000  150.000  150.000  90.00  90.00  90.00 P 1           1
MODEL 1
ATOM      1 CA   MET B   1     125.516  77.887  77.186  0.00  0.00
ATOM      2 CA   GLN B   2     122.688  77.679  79.724  0.00  0.00
ATOM      3 CA   ILE B   3     119.011  78.655  79.616  0.00  0.00
ATOM      4 CA   PHE B   4     116.099  78.523  82.065  0.00  0.00
ATOM      5 CA   VAL B   5     112.851  76.834  81.073  0.00  0.00
ATOM      6 CA   LYS B   6     109.785  78.432  82.647  0.00  0.00
ATOM      7 CA   THR B   7     106.648  76.317  82.924  0.00  0.00
ATOM      8 CA   LEU B   8     103.279  78.003  82.391  0.00  0.00
ATOM      9 CA   THR B   9     102.383  76.888  85.920  0.00  0.00
ATOM     10 CA   GLY B  10     105.349  78.757  87.379  0.00  0.00
ATOM     11 CA   LYS B6999     108.642  76.867  87.561  0.00  0.00
ATOM     12 CA   THR B  12     112.048  77.801  86.140  0.00  0.00
TER
ENDMDL

In this example I do want atom 12 to have resid 12.

ln [1]: u = mda.Universe(test.pdb)
In [2]: u.residues.resids
Out[2]: 
array([    1,     2,     3,     4,     5,     6,     7,     8,     9,
          10,  6999, 10012])

I tested a little bit. This doesn't happen for all resid values of atom 11. Only when reaches a value larger than 5013.

@richardjgowers
Copy link
Member

Yeah the code checks for a downwards jump of greater than 5,000 to guess when a resid has looped, this allows small fluctuations. Could easily add a dont-fix-resids kwarg to the parser

@kain88-de
Copy link
Member Author

kain88-de commented Jul 2, 2018 via email

@orbeckst
Copy link
Member

orbeckst commented Jul 3, 2018

We used to have two PDB readers... I think we disliked maintaining both but I can see the appeal of having at least one that actually follows the standard and just fails if the input file does not follow.

@richardjgowers
Copy link
Member

we could add a strict keyword, then around all the silly tricks we can just check if we're allowing them, ie if not self.strict and resid whatever..., then it's just a single Reader/Parser

@orbeckst
Copy link
Member

orbeckst commented Jul 3, 2018

If the hacks can be easily isolated in such a fashion then that's a possibility.

I admit that the purist in me wants to see clean code that does one thing but the pragmatist realizes that code in the wild has to be useful, too, and sometimes very successful evolution isn't pretty:

Nacktmull (Heterocephalus glaber)

@kain88-de
Copy link
Member Author

The strict keyword has two advantages for me. It is one simple switch and it's meaning is easy to guess also for a new user. We can write a fast cython implementation to read a standard compliant PDB in the future (something pandas is doing with csv). This is good if having a faster PDB reader is desirable.

Instead of a strict flag I a flavor flag would also be OK. This could, for now, have the values permissive and strict. But in the future others can be added like hybrid-36. This way we can remove a lot of the guesswork code we have right now to read a single frame. Instead it's all done on initialization.

@orbeckst
Copy link
Member

orbeckst commented Jul 4, 2018

Flavor sounds good, btw hybrid-36 is issue #1897 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants