Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow read in of VERY large pdb files #1978

Merged
merged 14 commits into from
Aug 10, 2018
Merged

Allow read in of VERY large pdb files #1978

merged 14 commits into from
Aug 10, 2018

Conversation

arm61
Copy link
Contributor

@arm61 arm61 commented Jul 9, 2018

Fixes #1897

Changes made in this Pull Request:

PR Checklist

  • Tests?
  • Docs? (N/A)
  • CHANGELOG updated?
  • Issue raised/referenced?

@arm61
Copy link
Contributor Author

arm61 commented Jul 9, 2018

Not clear why the travis failed. Will investigate later.

@richardjgowers
Copy link
Member

Hello @arm61 , welcome to MDAnalysis!

I think this is failing because not all indices are base 36, just the first 99,999 or so. I think you can add another middle layer to the try/except block that exists...

try:
    idx = int(thing)
except:
    try:
        idx = int(thing, 36)
    except:
        # wrapped serials case  


Fixes

* Introduced compatibility for packmol (and hopefully generally) for pbd files with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert this inside the existing chunk below, also add yourself to the AUTHORS file as it's your first contribution

@orbeckst
Copy link
Member

FYI, PROPKA contains an implementation hybrid36.py and we can use the code because it is published under LGPL.

@richardjgowers
Copy link
Member

Looks a little slow for calling on every single line of a PDB file

@richardjgowers richardjgowers self-assigned this Jul 12, 2018
@arm61 arm61 closed this Jul 23, 2018
@arm61 arm61 reopened this Jul 23, 2018
@codecov
Copy link

codecov bot commented Jul 23, 2018

Codecov Report

Merging #1978 into develop will increase coverage by 0.01%.
The diff coverage is 96.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop   #1978      +/-   ##
==========================================
+ Coverage    88.59%   88.6%   +0.01%     
==========================================
  Files          143     143              
  Lines        17361   17386      +25     
  Branches      2658    2665       +7     
==========================================
+ Hits         15381   15405      +24     
  Misses        1379    1379              
- Partials       601     602       +1
Impacted Files Coverage Δ
package/MDAnalysis/topology/PDBParser.py 99.41% <96.55%> (-0.59%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10fb62d...6277126. Read the comment docs.


def test_PDB_hex():
u = mda.Universe(StringIO(PDB_hex), format='PDB')
assert len(u.atoms) == 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you add a test that checks what the atom.id is to make sure we're correctly doing the base 36 conversion

@richardjgowers
Copy link
Member

And add yourself to the AUTHORS file

@@ -65,6 +65,8 @@ Fixes
pack_into_box() (Issue #1911)
* Fixed format of MODEL number in PDB file writing (Issue #1950)
* PDBWriter now properly sets start value
* Introduced compatibility for packmol (and hopefully generally) for pbd files with
greater than 100 000 atoms (Issue #1897)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do read those files already now. You added the specific hybrid36 format that wasn't supported. It would be nice if you can name it to be precise in the changelog.

…n, 36) method was not sufficient instead the more involved process used in PHENIX (http://cci.lbl.gov/hybrid_36/) (http://cci.lbl.gov/cctbx_sources/iotbx/pdb/hybrid_36.py) has been used
@arm61
Copy link
Contributor Author

arm61 commented Jul 23, 2018

It turns out that the int(n, 36) does not decode the hybrid36 format correctly. I have used the implementation found in PHENIX instead (see links in the commit message)

@arm61 arm61 closed this Jul 24, 2018
@arm61 arm61 reopened this Jul 24, 2018
@arm61
Copy link
Contributor Author

arm61 commented Jul 27, 2018

Not super clear why only one of the ci instances is failing, any input?

@kain88-de
Copy link
Member

You can have a look at the log on travis.

package/MDAnalysis/topology/PDBParser.py:92: [W1638(range-builtin-not-iterating), ] range built-in referenced when not iterating
package/MDAnalysis/topology/PDBParser.py:93: [W1638(range-builtin-not-iterating), ] range built-in referenced when not iterating

That means you are using a normal range. On Python2 this will return a list and a generator on python3. The solution is to add from six.moves import range below the future import.
The linter helps us to keep the codebase python2/3 compatible.

@richardjgowers
Copy link
Member

TBH the linter is wrong there, the range call is inside a zip, so it is iterating it. But yeah if you change to use six.moves.range it will stop complaining

@@ -87,6 +87,44 @@ def float_or_default(val, default):
except ValueError:
return default

digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits_lower = digits_upper.lower()
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than having two dicts, and two code paths below, could you not create a dict with both upper and lower case in it? Ie e and E both map to whatever value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this would necessarily work the same as the upper case are treated differently from the lower case (line 118 vs 124). Unless I am not seeing something you are.

@arm61
Copy link
Contributor Author

arm61 commented Aug 3, 2018

Sorry about the delay. other things got in the way.

@richardjgowers
Copy link
Member

WRT upper/lower case, if this is base 36 surely it doesn’t matter which case and we can mangle the input into either?

@arm61
Copy link
Contributor Author

arm61 commented Aug 9, 2018

From a bit of reading, I don't think this is real base36, it is referred to as hybrid-36.

It is traditional base 36 (using upper case) until that is exhausted, then it uses the lower case to basically make more numbers available. It is a weird monstrosity (as with all pdb formatting) that is a pseudo-base62 almost.

Going to put some tests for the decode_pure() and hy36decode() functions today.

@richardjgowers
Copy link
Member

@arm61 ewww ok. But yeah, if you can add some tests for values that hit all the different possibilities. You can use a pytest.mark.parametrize which lets you write a list of values and it turns into lots of individual tests, eg here it loops over different residue name parsing tests: https://github.com/MDAnalysis/mdanalysis/blob/develop/testsuite/MDAnalysisTests/lib/test_util.py#L88

@arm61
Copy link
Contributor Author

arm61 commented Aug 10, 2018

I think those tests are pretty comprehensive. Also I agree, the pdb format is gross.

digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits_lower = digits_upper.lower()
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))])
digits_lower_values = dict([pair for pair in zip(digits_lower, range(36))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are constants in the global name space, they should be CAPITALIZED_WITH_UNDERSCORES.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 02bf546

@richardjgowers
Copy link
Member

Looking through the coverage diff it looks like we can't reach the exceptions in the decode function (probably because we're handling them before the function). I'd just remove them

@richardjgowers richardjgowers merged commit dbad72c into MDAnalysis:develop Aug 10, 2018
@richardjgowers
Copy link
Member

Awesome, thanks @arm61 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants