-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simple selection by name is slow due to fnmatch #2751
Comments
Btw, if you want to see recent regressions, go to https://www.mdanalysis.org/benchmarks/#regressions?sort=3&dir=desc and look for dates in 2020/2019 near the top of the list. |
I think using something like |
Maybe it’s not fnmatch itself but whatever we did in the PR that introduced it killed performance. I didn’t have time to look at the old PR but when I briefly looked at your new PR I saw that there used to be a list comprehension over all atom names to run fnmatch at each single one of them and that alone is probably already problematic when all one really wants to do is |
We used to use something closer to what you've said there, ie letting numpy do the loop rather than a list comprehension. You could rewrite the atom name matching to only use In [1]: import numpy as np
In [2]: names = np.array(['Ca'] * 100000, dtype=object)
In [3]: %timeit names == 'Ca'
1.07 ms ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: ids = np.array([1] * 100000)
In [5]: %timeit ids == 1
37.9 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: ids = np.array([1] * 100000, dtype=np.uint16)
In [8]: ids
Out[8]: array([1, 1, 1, ..., 1, 1, 1], dtype=uint16)
In [9]: %timeit ids == 1
6.8 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) It's much faster to match integers, if we "limit" ourselves to 16 bits/65,000 names, we can scan the array ~500x faster, you just have to have the str->int conversion with a dict of names. |
Thanks for the explanation. This starts looking like building a relational database. 16 bits for the int keys should be more than plenty. |
PR #2755 fixed the regression: see https://www.mdanalysis.org/benchmarks/#selections.SimpleSelectionBench.time_simple_selections?p-selection_string='resname%20LYS'&commits=eb18a338&x-axis=commit&python=3.6 : note the drop down to 207.9 µs for the last commit, which is faster than the ~1.3 ms before the regression. 🍾 |
* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)
* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)
* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.string_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)
Expected behavior
Simple atom selections such as
u.select_atoms("protein")
oru.select_atoms("name CA")
or "resname LYS" are not slowed down by advanced selection capabilities.Actual behavior
The benchmarks https://www.mdanalysis.org/benchmarks/#selections.SimpleSelectionBench.time_simple_selections?p-selection_string='name%20CA'&six=%5Bnone%5D&ram=16GB clearly show that performance got worse by 53.58x (from 997.988μs to 53.474ms) when PR #2551 , which introduced fnmatch, was merged. Specifically, the commit eb18a338 is the point in the benchmark when performance gets worse.
Note
This slowdown might have contributed to #2671 (tests timing out) and the need for PR #2706.
The text was updated successfully, but these errors were encountered: