Allow read in of VERY large pdb files #1978

arm61 · 2018-07-09T10:35:40Z

Fixes #1897

Changes made in this Pull Request:

Set the base for the int() to 36, this appears to be the correct motive to read in larger than 100 000 atoms (https://stackoverflow.com/questions/1181919/python-base-36-encoding) (http://cci.lbl.gov/hybrid_36/)

PR Checklist

Tests?
Docs? (N/A)
CHANGELOG updated?
Issue raised/referenced?

arm61 · 2018-07-09T14:19:23Z

Not clear why the travis failed. Will investigate later.

richardjgowers · 2018-07-09T21:29:01Z

Hello @arm61 , welcome to MDAnalysis!

I think this is failing because not all indices are base 36, just the first 99,999 or so. I think you can add another middle layer to the try/except block that exists...

try:
    idx = int(thing)
except:
    try:
        idx = int(thing, 36)
    except:
        # wrapped serials case

richardjgowers · 2018-07-09T21:29:31Z

package/CHANGELOG

+
+Fixes
+
+  * Introduced compatibility for packmol (and hopefully generally) for pbd files with


insert this inside the existing chunk below, also add yourself to the AUTHORS file as it's your first contribution

orbeckst · 2018-07-10T00:39:27Z

FYI, PROPKA contains an implementation hybrid36.py and we can use the code because it is published under LGPL.

richardjgowers · 2018-07-10T01:00:29Z

Looks a little slow for calling on every single line of a PDB file

codecov · 2018-07-23T10:56:30Z

Codecov Report

Merging #1978 into develop will increase coverage by 0.01%.
The diff coverage is 96.55%.

@@            Coverage Diff             @@
##           develop   #1978      +/-   ##
==========================================
+ Coverage    88.59%   88.6%   +0.01%     
==========================================
  Files          143     143              
  Lines        17361   17386      +25     
  Branches      2658    2665       +7     
==========================================
+ Hits         15381   15405      +24     
  Misses        1379    1379              
- Partials       601     602       +1

Impacted Files	Coverage Δ
package/MDAnalysis/topology/PDBParser.py	`99.41% <96.55%> (-0.59%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10fb62d...6277126. Read the comment docs.

richardjgowers · 2018-07-23T13:25:28Z

testsuite/MDAnalysisTests/topology/test_pdb.py

+
+def test_PDB_hex():
+    u = mda.Universe(StringIO(PDB_hex), format='PDB')
+    assert len(u.atoms) == 5


Looks good, can you add a test that checks what the atom.id is to make sure we're correctly doing the base 36 conversion

richardjgowers · 2018-07-23T13:25:54Z

And add yourself to the AUTHORS file

kain88-de · 2018-07-23T14:23:49Z

package/CHANGELOG

@@ -65,6 +65,8 @@ Fixes
    pack_into_box() (Issue #1911)
  * Fixed format of MODEL number in PDB file writing (Issue #1950)
  * PDBWriter now properly sets start value
+  * Introduced compatibility for packmol (and hopefully generally) for pbd files with
+    greater than 100 000 atoms (Issue #1897)


We do read those files already now. You added the specific hybrid36 format that wasn't supported. It would be nice if you can name it to be precise in the changelog.

…n, 36) method was not sufficient instead the more involved process used in PHENIX (http://cci.lbl.gov/hybrid_36/) (http://cci.lbl.gov/cctbx_sources/iotbx/pdb/hybrid_36.py) has been used

arm61 · 2018-07-23T14:41:34Z

It turns out that the int(n, 36) does not decode the hybrid36 format correctly. I have used the implementation found in PHENIX instead (see links in the commit message)

arm61 · 2018-07-27T13:24:08Z

Not super clear why only one of the ci instances is failing, any input?

kain88-de · 2018-07-27T14:01:39Z

You can have a look at the log on travis.

package/MDAnalysis/topology/PDBParser.py:92: [W1638(range-builtin-not-iterating), ] range built-in referenced when not iterating
package/MDAnalysis/topology/PDBParser.py:93: [W1638(range-builtin-not-iterating), ] range built-in referenced when not iterating

That means you are using a normal range. On Python2 this will return a list and a generator on python3. The solution is to add from six.moves import range below the future import.
The linter helps us to keep the codebase python2/3 compatible.

richardjgowers · 2018-07-27T14:03:15Z

TBH the linter is wrong there, the range call is inside a zip, so it is iterating it. But yeah if you change to use six.moves.range it will stop complaining

richardjgowers · 2018-07-27T19:05:23Z

package/MDAnalysis/topology/PDBParser.py

@@ -87,6 +87,44 @@ def float_or_default(val, default):
    except ValueError:
        return default

+digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+digits_lower = digits_upper.lower()
+digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))])


rather than having two dicts, and two code paths below, could you not create a dict with both upper and lower case in it? Ie e and E both map to whatever value?

I am not sure this would necessarily work the same as the upper case are treated differently from the lower case (line 118 vs 124). Unless I am not seeing something you are.

arm61 · 2018-08-03T10:39:41Z

Sorry about the delay. other things got in the way.

richardjgowers · 2018-08-08T21:35:31Z

WRT upper/lower case, if this is base 36 surely it doesn’t matter which case and we can mangle the input into either?

arm61 · 2018-08-09T09:00:42Z

From a bit of reading, I don't think this is real base36, it is referred to as hybrid-36.

It is traditional base 36 (using upper case) until that is exhausted, then it uses the lower case to basically make more numbers available. It is a weird monstrosity (as with all pdb formatting) that is a pseudo-base62 almost.

Going to put some tests for the decode_pure() and hy36decode() functions today.

richardjgowers · 2018-08-09T13:28:20Z

@arm61 ewww ok. But yeah, if you can add some tests for values that hit all the different possibilities. You can use a pytest.mark.parametrize which lets you write a list of values and it turns into lots of individual tests, eg here it loops over different residue name parsing tests: https://github.com/MDAnalysis/mdanalysis/blob/develop/testsuite/MDAnalysisTests/lib/test_util.py#L88

arm61 · 2018-08-10T10:01:19Z

I think those tests are pretty comprehensive. Also I agree, the pdb format is gross.

jbarnoud · 2018-08-10T11:57:45Z

package/MDAnalysis/topology/PDBParser.py

+digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+digits_lower = digits_upper.lower()
+digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))])
+digits_lower_values = dict([pair for pair in zip(digits_lower, range(36))])


These are constants in the global name space, they should be CAPITALIZED_WITH_UNDERSCORES.

Resolved in 02bf546

richardjgowers · 2018-08-10T13:38:24Z

Looking through the coverage diff it looks like we can't reach the exceptions in the decode function (probably because we're handling them before the function). I'd just remove them

richardjgowers · 2018-08-10T16:59:29Z

Awesome, thanks @arm61 !

arm61 added 2 commits July 9, 2018 11:25

A possible resolution to issue #1897

352de1e

updated changelog

ac78fc3

richardjgowers reviewed Jul 9, 2018

View reviewed changes

richardjgowers self-assigned this Jul 12, 2018

Second attempt at fix

3541f92

arm61 closed this Jul 23, 2018

arm61 reopened this Jul 23, 2018

arm61 added 2 commits July 23, 2018 10:56

rearrangement

53ada5f

update changelog

53873e1

richardjgowers requested changes Jul 23, 2018

View reviewed changes

kain88-de reviewed Jul 23, 2018

View reviewed changes

Test atom[i].id is now present. Design this test showed that the int(…

d8e0f48

…n, 36) method was not sufficient instead the more involved process used in PHENIX (http://cci.lbl.gov/hybrid_36/) (http://cci.lbl.gov/cctbx_sources/iotbx/pdb/hybrid_36.py) has been used

arm61 closed this Jul 24, 2018

arm61 reopened this Jul 24, 2018

richardjgowers reviewed Jul 27, 2018

View reviewed changes

arm61 added 2 commits August 3, 2018 11:32

add six.moves import range to enable py2 compatibility

b7ec253

Merge branch 'develop' into develop

dac596e

Andrew McCluskey added 2 commits August 7, 2018 08:23

Merge branch 'develop' into develop

f693ed3

Merge branch 'develop' into develop

4dc8cca

arm61 added 2 commits August 10, 2018 10:59

Added tests for hy36decode, which also cover decode_pure

7d5ae61

Merge branch 'develop' of github.com:arm61/mdanalysis into develop

d47546c

jbarnoud reviewed Aug 10, 2018

View reviewed changes

Capitalisation of constants in global name space

02bf546

Remove exceptions

6277126

richardjgowers approved these changes Aug 10, 2018

View reviewed changes

richardjgowers merged commit dbad72c into MDAnalysis:develop Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow read in of VERY large pdb files #1978

Allow read in of VERY large pdb files #1978

arm61 commented Jul 9, 2018 •

edited

Loading

arm61 commented Jul 9, 2018

richardjgowers commented Jul 9, 2018

richardjgowers Jul 9, 2018

orbeckst commented Jul 10, 2018

richardjgowers commented Jul 10, 2018

codecov bot commented Jul 23, 2018 •

edited

Loading

richardjgowers Jul 23, 2018

richardjgowers commented Jul 23, 2018

kain88-de Jul 23, 2018

arm61 commented Jul 23, 2018

arm61 commented Jul 27, 2018

kain88-de commented Jul 27, 2018

richardjgowers commented Jul 27, 2018

richardjgowers Jul 27, 2018

arm61 Aug 3, 2018

arm61 commented Aug 3, 2018

richardjgowers commented Aug 8, 2018

arm61 commented Aug 9, 2018

richardjgowers commented Aug 9, 2018

arm61 commented Aug 10, 2018 •

edited

Loading

jbarnoud Aug 10, 2018

arm61 Aug 10, 2018

richardjgowers commented Aug 10, 2018

richardjgowers commented Aug 10, 2018


		Fixes

		* Introduced compatibility for packmol (and hopefully generally) for pbd files with

Allow read in of VERY large pdb files #1978

Allow read in of VERY large pdb files #1978

Conversation

arm61 commented Jul 9, 2018 • edited Loading

PR Checklist

arm61 commented Jul 9, 2018

richardjgowers commented Jul 9, 2018

richardjgowers Jul 9, 2018

Choose a reason for hiding this comment

orbeckst commented Jul 10, 2018

richardjgowers commented Jul 10, 2018

codecov bot commented Jul 23, 2018 • edited Loading

Codecov Report

richardjgowers Jul 23, 2018

Choose a reason for hiding this comment

richardjgowers commented Jul 23, 2018

kain88-de Jul 23, 2018

Choose a reason for hiding this comment

arm61 commented Jul 23, 2018

arm61 commented Jul 27, 2018

kain88-de commented Jul 27, 2018

richardjgowers commented Jul 27, 2018

richardjgowers Jul 27, 2018

Choose a reason for hiding this comment

arm61 Aug 3, 2018

Choose a reason for hiding this comment

arm61 commented Aug 3, 2018

richardjgowers commented Aug 8, 2018

arm61 commented Aug 9, 2018

richardjgowers commented Aug 9, 2018

arm61 commented Aug 10, 2018 • edited Loading

jbarnoud Aug 10, 2018

Choose a reason for hiding this comment

arm61 Aug 10, 2018

Choose a reason for hiding this comment

richardjgowers commented Aug 10, 2018

richardjgowers commented Aug 10, 2018

arm61 commented Jul 9, 2018 •

edited

Loading

codecov bot commented Jul 23, 2018 •

edited

Loading

arm61 commented Aug 10, 2018 •

edited

Loading