-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster name selections #2755
Faster name selections #2755
Conversation
This is a different way of storing atom names in the This is then beneficial for quickly matching strings (ie names), as you match against a set of names to yield integers, then quickly match integers. There's an assumption that the number of unique atom names is very small compared to the number of atoms in the system, ie atom names are repeated often. I did some timing with this sequence of actions: import MDAnalysis as mda
from MDAnalysisTests.datafiles import GRO
u = mda.Universe(GRO)
names = u.atoms.names
ag = u.select_atoms('name O*')
ag.names = 'X'
I think I can probably make Universe creation faster by rewriting how the Parsers are building the AtomName attribute (ie build the lookup array as I go). I think atoms are selected more often than they are set, so I'm not too concerned about the performance drop in attribute setting. Then in a very hand wavy way, I think this will also help serialisation efforts as this is more compact to store/send. |
So if @MDAnalysis/coredevs could try this on their typical workloads and see if it works for them. |
Codecov Report
@@ Coverage Diff @@
## develop #2755 +/- ##
===========================================
+ Coverage 92.97% 93.01% +0.03%
===========================================
Files 187 187
Lines 24787 24937 +150
Branches 3231 3258 +27
===========================================
+ Hits 23046 23195 +149
- Misses 1693 1694 +1
Partials 48 48
Continue to review full report at Codecov.
|
@IAlibay as a way of slowing you down from making more PRs, can you review this? Is it understandable/maintainable and does it show performance improvements on your typical systems? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good to me, my main worry was that selection might not work if atoms were renamed but it all works :) Would be good to add a test to check that post-rename selection still holds up (there might already be one but I couldn't find it).
See below for benchmark data, in my million atom prmtop system we get a ~ 337x speedup (select_universe("name CA")
).
Universe creation is pretty much within error of what's on develop, so the gain is in my opinion worth it.
It might be good to try to expand this to resnames, as can be seen from the "select all water oxygens" case, there seems to be a high cost to matching residue names too.
Benchmark data:
On a million atom system, universe creation:
This code:
11.1 s +/- 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Current develop:
10.9 s +/- 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a basic "select all alpha carbons"
code used:
def select_ca(u):
ca = u.select_atoms('name CA')
This code:
2.83 ms +/- 66.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Current develop:
956 ms +/- 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a basic "select all the water oxygens"
code used:
def select_water_oxygens(u):
w_oh = u.select_atoms('resname WAT and name O')
This code:
1.09 s +/- 8.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Current develop:
1.98 s +/- 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Rename waters loop
code used:
def rename_waters(u):
u_new = mda.Universe('large.prmtop')
water_o = u_new.select_atoms('name O and resname WAT')
water_h = u_new.select_atoms('name H* and resname WAT')
water_o.atoms.names = "OW"
water_h.atoms.names = "HW"
This code:
14.2 s +/- 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Current develop:
16 s +/- 281 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
|
||
self.nmidx[i] = nextidx | ||
|
||
self.name_lookup = np.array(name_lookup, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than doing the individual append operations, it might be cheaper/cleaner to just do np.array([i for i in namedict.keys()], dtype=object)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there's also possibly a way to exploit the guaranteed ordering of the keys that I'm not exploiting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardjgowers is this optimization something you still want to look into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise in practice this isn't going to happen over many iterations, but append
can be quite slow, especially if you're just doing a numpy conversion at the end anyways. If there's any way to avoid it (and make the code look cleaner at the same time), I would encourage it (same thing for the similar appends below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that with this code you know in advance what size name_lookup
will be. And finding out in advance will be relatively costly.
I benchmarked a typical system
u = mda.Universe(data.GRO_MEMPROT)
%timeit build_name_index_numpy(u.atoms.names)
with len(set(u.atoms.names)) == 219
different atom names.
Just finding out how many different names there are
%timeit len(set(u.atoms.names))
costs sizable 2.3 ms (on my ageing laptop)
2.29 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
How does this compare to what we have?
Current code
The current code (@richardjgowers 's version)
def build_name_index_append(vals):
namedict = dict() # maps str to nmidx
name_lookup = [] # maps idx to str
# eg namedict['O'] = 5 & name_lookup[5] = 'O'
nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom
# eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'
for i, val in enumerate(vals):
try:
nmidx[i] = namedict[val]
except KeyError:
nextidx = len(namedict)
namedict[val] = nextidx
name_lookup.append(val)
nmidx[i] = nextidx
name_lookup = np.array(name_lookup, dtype=object)
values = name_lookup[nmidx]
return namedict, nmidx, name_lookup, values
is already pretty fast
%timeit build_name_index_append(u.atoms.names)
gives the speed to beat:
9.91 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Note how len(set(vals))
would already be 1/5 of the total run time?
Replace append with pre-allocated array
Following @IAlibay suggestion and replacing append with a pre-allocated array
def build_name_index_insert(vals):
namedict = dict() # maps str to nmidx
name_lookup = np.empty(len(set(vals)), dtype=object) # maps idx to str
# eg namedict['O'] = 5 & name_lookup[5] = 'O'
nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom
# eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'
nextidx = 0
for i, val in enumerate(vals):
try:
nmidx[i] = namedict[val]
except KeyError:
namedict[val] = nextidx
name_lookup[nextidx] = val
nmidx[i] = nextidx
nextidx += 1
values = name_lookup[nmidx]
return namedict, nmidx, name_lookup, values
gives
%timeit build_name_index_insert(u.atoms.names)
11.1 ms ± 712 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is a tad slower than the version with append.
So I would say unless we expect O(10^3) different string values, the append version is almost certainly ok.
NumPy
I tried out a "numpy" version
def build_name_index_numpy(vals):
name_lookup = np.unique(vals)
namedict = {name: idx for idx, name in enumerate(name_lookup)}
nmidx = np.zeros_like(vals, dtype=int)
for name in name_lookup:
nmidx[vals == name] = namedict[name]
values = name_lookup[nmidx]
return namedict, nmidx, name_lookup, values
but that gave an abysmal
240 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
which is 24 times slower than the try/except/append version. You can shave off 60 ms by replacing np.unique
with name_lookup = np.array(list(set(vals)))
but the >200 repeated assignments nmidx[vals == name] = namedict[name]
kill performance:
name = "H1"
vals = u.atoms.names
%timeit nmidx[vals == name] = namedict[name]
752 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Maybe if there was a way to do all assignments in one go, one would have a hope of doing better.
But even just building the dict itself
%timeit namedict = {name: idx for idx, name in enumerate(name_lookup)}
19.5 µs ± 776 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
is slower than the existing code so I think it's going to be tough to do much better – maybe someone with better numpy skills than I can see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my suggestion was actually to do this:
def build_name_index_append(vals):
namedict = dict() # maps str to nmidx
# eg namedict['O'] = 5 & name_lookup[5] = 'O'
nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom
# eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'
for i, val in enumerate(vals):
try:
nmidx[i] = namedict[val]
except KeyError:
nextidx = len(namedict)
namedict[val] = nextidx
nmidx[i] = nextidx
name_lookup = np.array([i for i in namedict.keys()], dtype=object)
values = name_lookup[nmidx]
return namedict, nmidx, name_lookup, values
The idea here was that if you end up having to do lots of irregular append operations in the original construct, you could avoid some of the memory access overheads (plus it looks a bit nicer imho).
However, in practice that idea doesn't hold up, in part since the KeyError is triggered so infrequently, you end up with code that has within error performance (13.5 ms vs 13.6 ms in the above GRO_MEMPROT system).
Even if all 43480 elements in the name array are unique, you only save ~ 2 ms [44 ms -> 42 ms], although this is mainly an academic point as passing an array of unique names is out of scope of this function (i.e. triggering the KeyError probably becomes the biggest cost).
Ok I've made all string topology attrs use this approach. I think I can improve selections like |
So the code adds a numpy array and a smaller (unless everything is unique) list + dictionary set per attribute. At what point do we need to consider the extra memory cost of Universe creation? |
The memory cost is likely almost always smaller. Yes there are more python
objects, but each resname is stored as an integer (2 bytes for uint16)
rather than a Python string, likely 3+? I’ll double check these assumptions
though :)
…On Sun, Jun 21, 2020 at 21:27, Irfan Alibay ***@***.***> wrote:
So the code adds a numpy array and a smaller (unless everything is unique)
list + dictionary set per attribute. At what point do we need to consider
the extra memory cost of Universe creation?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2755 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGSGB7MAAIAJJL2RKSFJRTRXZUKNANCNFSM4N5OMNXQ>
.
|
Update on my million atom benchmark for this new patch.
It's probably worth noting that I suspect that the TOPParser is particularly inefficient, so it's likely that the difference in universe creation is going to be a lot different if say a GRO file was read. |
@IAlibay thanks for benchmarking. I'm not shocked that this is a little slower to construct, we're constructing an index of possible types, so there's a Python loop over all atom values, for each attribute. I'll see if I can take the edge off that... Killing a 2s selection time is nice though :) |
Is the only thing left here the refactoring of the duplicated code? It would be nice to get this regression fixed sooner than later (also for PR #2798 ). |
CHANGELOG, too. |
81a0fb5
to
00ba0ee
Compare
Hello @richardjgowers! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-08-25 10:30:30 UTC |
@@ -828,14 +845,20 @@ class ProteinSelection(Selection): | |||
'CLEU', 'CILE', 'CVAL', 'CASF', 'CASN', 'CGLN', 'CARG', 'CHID', 'CHIE', | |||
'CHIP', 'CTRP', 'CPHE', 'CTYR', 'CGLU', 'CASP', 'CLYS', 'CPRO', 'CCYS', | |||
'CCYX', 'CMET', 'CME', 'ASF', | |||
]) | |||
} | |||
|
|||
def __init__(self, parser, tokens): | |||
pass | |||
|
|||
def apply(self, group): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select_atoms('protein')
went from 50ms to 0.5ms on GRO
Something is messing up the azure pipelines stuff here but it's not immediately obvious to me as to what it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything in there that would not work in the backport PR #2798 , i.e., anything that requires Python ≥ 3.6 ?
(It would be ok but it would be helpful to have a note how to backport any Py 3.6+ constructs. Fixing the performance regression on selections is the main reason why we haven’t publicly announced 1.0!)
00ba0ee
to
55f9a5b
Compare
AppVeyor fails on TestICodes.test_set_residues_singular – is this related to this PR? |
@MDAnalysis/coredevs does someone fancy giving this a once over? |
@MDAnalysis/coredevs please |
package/MDAnalysis/core/selection.py
Outdated
|
||
.. versionchanged:: 2.0.0 | ||
sug_atoms changed to set (from numpy array) | ||
performance improved by ~100x on larger systems | ||
""" | ||
token = 'nucleicsugar' | ||
sug_atoms = np.array(["C1'", "C2'", "C3'", "C4'", "O4'"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what the versionchanged says....
name_lookup = [] # maps idx to str | ||
# eg namedict['O'] = 5 & name_lookup[5] = 'O' | ||
|
||
self.nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not mind a more explicit name than nmidx
. name_index
perhaps? But there may be something even better I do not think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesnmidx
need to be public or could it just be hidden behind an underscore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there are duplicates of this code in _ResidueStringAtrr
and _SegmentStringAttr
that would also need to be changed for consistency.
On balance I can live with nmidx
in what's clearly an internal class, especially as there are comments that explain what it is.
ping @IAlibay for approval ping @richardjgowers to address @jbarnoud 's comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looking good to me. The only real issue is the array
-> set
fix that @jbarnoud found.
I don't know how to reduce the duplicated code in a simple manner so I'll just be content with it.
There was one comment by @IAlibay on a potential performance improvement. I'll leave it to @richardjgowers and @IAlibay to figure that one out.
name_lookup = [] # maps idx to str | ||
# eg namedict['O'] = 5 & name_lookup[5] = 'O' | ||
|
||
self.nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there are duplicates of this code in _ResidueStringAtrr
and _SegmentStringAttr
that would also need to be changed for consistency.
On balance I can live with nmidx
in what's clearly an internal class, especially as there are comments that explain what it is.
|
||
self.nmidx[i] = nextidx | ||
|
||
self.name_lookup = np.array(name_lookup, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardjgowers is this optimization something you still want to look into?
I am not in a position to review in depth, thought from what I read on my phone it looks good to me. I am uneasy with the amount of code duplication, but it can be a separate pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm, mainly:
- docstring changes needed (we probably need a better way to capture bad
versionchanged
placements since it doesn't error out but just doesn't render the right way) - test for uncovered
else
- can we clean up the append operations (if not, nevermind, maybe some other time)?
|
||
self.nmidx[i] = nextidx | ||
|
||
self.name_lookup = np.array(name_lookup, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise in practice this isn't going to happen over many iterations, but append
can be quite slow, especially if you're just doing a numpy conversion at the end anyways. If there's any way to avoid it (and make the code look cleaner at the same time), I would encourage it (same thing for the similar appends below).
newidx = len(self.namedict) | ||
self.namedict[values] = newidx | ||
newnames.append(values) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the code (i.e. the else
part of the if
construct) isn't being tested (at least according to codecov). Could we add at least a quick test for this? Since it's core, it's probably a bad idea to leave stuff untested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully covered by new test on setting multiple segids
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
@richardjgowers could you please have a look at @IAlibay 's comments? Then this should be finally good to go! |
re: avoiding the append loop. Not sure it's possible as we don't know the length beforehand. Then we definitely want it as a numpy array later as we fancy index it to death. there's some code duplication, this is because the different versions work on atoms/residues/segments or have different name arrays (nucl_res / sug_atoms). You could make it all more generic and accessing a variable variable but I think it gets less readable as you try and play code golf with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov seems reasonable, lgtm!
* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)
* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)
* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.string_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)
* modified AtomNames topologyattr to include lookup table index * cheeky little optimisation * rework atom name selection to use lookup tables * Update topologyattrs.py * fixed test supplying integer as atom name really topologyattrs need to be statically typed and protective about this * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * removed some code duplication made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive * Update package/MDAnalysis/core/selection.py @jbarnoud's fix * apply suggestions from code review Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> * added test for setting multiple segids at once Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Fixes #2751
Changes made in this Pull Request:
PR Checklist