-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow read in of VERY large pdb files #1978
Conversation
Not clear why the travis failed. Will investigate later. |
Hello @arm61 , welcome to MDAnalysis! I think this is failing because not all indices are base 36, just the first 99,999 or so. I think you can add another middle layer to the try/except block that exists... try:
idx = int(thing)
except:
try:
idx = int(thing, 36)
except:
# wrapped serials case |
package/CHANGELOG
Outdated
|
||
Fixes | ||
|
||
* Introduced compatibility for packmol (and hopefully generally) for pbd files with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert this inside the existing chunk below, also add yourself to the AUTHORS file as it's your first contribution
FYI, PROPKA contains an implementation hybrid36.py and we can use the code because it is published under LGPL. |
Looks a little slow for calling on every single line of a PDB file |
Codecov Report
@@ Coverage Diff @@
## develop #1978 +/- ##
==========================================
+ Coverage 88.59% 88.6% +0.01%
==========================================
Files 143 143
Lines 17361 17386 +25
Branches 2658 2665 +7
==========================================
+ Hits 15381 15405 +24
Misses 1379 1379
- Partials 601 602 +1
Continue to review full report at Codecov.
|
|
||
def test_PDB_hex(): | ||
u = mda.Universe(StringIO(PDB_hex), format='PDB') | ||
assert len(u.atoms) == 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, can you add a test that checks what the atom.id
is to make sure we're correctly doing the base 36 conversion
And add yourself to the AUTHORS file |
package/CHANGELOG
Outdated
@@ -65,6 +65,8 @@ Fixes | |||
pack_into_box() (Issue #1911) | |||
* Fixed format of MODEL number in PDB file writing (Issue #1950) | |||
* PDBWriter now properly sets start value | |||
* Introduced compatibility for packmol (and hopefully generally) for pbd files with | |||
greater than 100 000 atoms (Issue #1897) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do read those files already now. You added the specific hybrid36 format that wasn't supported. It would be nice if you can name it to be precise in the changelog.
…n, 36) method was not sufficient instead the more involved process used in PHENIX (http://cci.lbl.gov/hybrid_36/) (http://cci.lbl.gov/cctbx_sources/iotbx/pdb/hybrid_36.py) has been used
It turns out that the int(n, 36) does not decode the hybrid36 format correctly. I have used the implementation found in PHENIX instead (see links in the commit message) |
Not super clear why only one of the ci instances is failing, any input? |
You can have a look at the log on travis.
That means you are using a normal |
TBH the linter is wrong there, the range call is inside a zip, so it is iterating it. But yeah if you change to use six.moves.range it will stop complaining |
@@ -87,6 +87,44 @@ def float_or_default(val, default): | |||
except ValueError: | |||
return default | |||
|
|||
digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" | |||
digits_lower = digits_upper.lower() | |||
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than having two dicts, and two code paths below, could you not create a dict with both upper and lower case in it? Ie e
and E
both map to whatever value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this would necessarily work the same as the upper case are treated differently from the lower case (line 118 vs 124). Unless I am not seeing something you are.
Sorry about the delay. other things got in the way. |
WRT upper/lower case, if this is base 36 surely it doesn’t matter which case and we can mangle the input into either? |
From a bit of reading, I don't think this is real base36, it is referred to as hybrid-36. It is traditional base 36 (using upper case) until that is exhausted, then it uses the lower case to basically make more numbers available. It is a weird monstrosity (as with all pdb formatting) that is a pseudo-base62 almost. Going to put some tests for the |
@arm61 ewww ok. But yeah, if you can add some tests for values that hit all the different possibilities. You can use a |
I think those tests are pretty comprehensive. Also I agree, the pdb format is gross. |
digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" | ||
digits_lower = digits_upper.lower() | ||
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))]) | ||
digits_lower_values = dict([pair for pair in zip(digits_lower, range(36))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are constants in the global name space, they should be CAPITALIZED_WITH_UNDERSCORES
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in 02bf546
Looking through the coverage diff it looks like we can't reach the exceptions in the decode function (probably because we're handling them before the function). I'd just remove them |
Awesome, thanks @arm61 ! |
Fixes #1897
Changes made in this Pull Request:
PR Checklist