Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster name selections #2755

Merged
merged 18 commits into from
Aug 25, 2020
Merged

Faster name selections #2755

merged 18 commits into from
Aug 25, 2020

Conversation

richardjgowers
Copy link
Member

@richardjgowers richardjgowers commented Jun 14, 2020

Fixes #2751

Changes made in this Pull Request:

  • AtomNames now use a lookup table in their implementation

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@richardjgowers
Copy link
Member Author

This is a different way of storing atom names in the Topology object, where instead of an array of ['N', 'C', etc] you instead have two arrays, one a set of known atom names, i.e. ['C', 'N', ...] then the atom names are stored as indices into this array.

This is then beneficial for quickly matching strings (ie names), as you match against a set of names to yield integers, then quickly match integers. There's an assumption that the number of unique atom names is very small compared to the number of atoms in the system, ie atom names are repeated often.

I did some timing with this sequence of actions:

import MDAnalysis as mda
from MDAnalysisTests.datafiles import GRO

u = mda.Universe(GRO)

names = u.atoms.names

ag = u.select_atoms('name O*')

ag.names = 'X' 
  • Universe creation is a bit slower (600 -> 625 ms) for me.
  • Attribute access is identical.
  • select_atoms goes from 69 ms to 0.69 ms
  • setting atom names gets slower from 0.6ms to 1.1ms.

I think I can probably make Universe creation faster by rewriting how the Parsers are building the AtomName attribute (ie build the lookup array as I go). I think atoms are selected more often than they are set, so I'm not too concerned about the performance drop in attribute setting.

Then in a very hand wavy way, I think this will also help serialisation efforts as this is more compact to store/send.

@richardjgowers
Copy link
Member Author

So if @MDAnalysis/coredevs could try this on their typical workloads and see if it works for them.

@codecov
Copy link

codecov bot commented Jun 16, 2020

Codecov Report

Merging #2755 into develop will increase coverage by 0.03%.
The diff coverage is 99.50%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2755      +/-   ##
===========================================
+ Coverage    92.97%   93.01%   +0.03%     
===========================================
  Files          187      187              
  Lines        24787    24937     +150     
  Branches      3231     3258      +27     
===========================================
+ Hits         23046    23195     +149     
- Misses        1693     1694       +1     
  Partials        48       48              
Impacted Files Coverage Δ
package/MDAnalysis/core/topologyattrs.py 96.49% <99.26%> (+0.31%) ⬆️
package/MDAnalysis/core/selection.py 99.52% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca2cbe4...ee64c32. Read the comment docs.

@richardjgowers
Copy link
Member Author

@IAlibay as a way of slowing you down from making more PRs, can you review this? Is it understandable/maintainable and does it show performance improvements on your typical systems?

@richardjgowers richardjgowers requested a review from IAlibay June 20, 2020 14:03
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me, my main worry was that selection might not work if atoms were renamed but it all works :) Would be good to add a test to check that post-rename selection still holds up (there might already be one but I couldn't find it).

See below for benchmark data, in my million atom prmtop system we get a ~ 337x speedup (select_universe("name CA")).
Universe creation is pretty much within error of what's on develop, so the gain is in my opinion worth it.

It might be good to try to expand this to resnames, as can be seen from the "select all water oxygens" case, there seems to be a high cost to matching residue names too.

Benchmark data:

On a million atom system, universe creation:

This code:
11.1 s +/- 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Current develop:
10.9 s +/- 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For a basic "select all alpha carbons"

code used:

def select_ca(u):
    ca = u.select_atoms('name CA')

This code:
2.83 ms +/- 66.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Current develop:
956 ms +/- 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For a basic "select all the water oxygens"

code used:

def select_water_oxygens(u):
    w_oh = u.select_atoms('resname WAT and name O')

This code:
1.09 s +/- 8.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Current develop:
1.98 s +/- 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Rename waters loop

code used:

def rename_waters(u):
    u_new = mda.Universe('large.prmtop')
    water_o = u_new.select_atoms('name O and resname WAT')
    water_h = u_new.select_atoms('name H* and resname WAT')
    water_o.atoms.names = "OW"
    water_h.atoms.names = "HW"

This code:
14.2 s +/- 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Current develop:
16 s +/- 281 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


self.nmidx[i] = nextidx

self.name_lookup = np.array(name_lookup, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing the individual append operations, it might be cheaper/cleaner to just do np.array([i for i in namedict.keys()], dtype=object) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's also possibly a way to exploit the guaranteed ordering of the keys that I'm not exploiting here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers is this optimization something you still want to look into?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise in practice this isn't going to happen over many iterations, but append can be quite slow, especially if you're just doing a numpy conversion at the end anyways. If there's any way to avoid it (and make the code look cleaner at the same time), I would encourage it (same thing for the similar appends below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that with this code you know in advance what size name_lookup will be. And finding out in advance will be relatively costly.

I benchmarked a typical system

u = mda.Universe(data.GRO_MEMPROT)
%timeit build_name_index_numpy(u.atoms.names)

with len(set(u.atoms.names)) == 219 different atom names.

Just finding out how many different names there are

%timeit len(set(u.atoms.names))

costs sizable 2.3 ms (on my ageing laptop)

2.29 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How does this compare to what we have?

Current code

The current code (@richardjgowers 's version)

def build_name_index_append(vals):
        namedict = dict()  # maps str to nmidx
        name_lookup = []  # maps idx to str
        # eg namedict['O'] = 5 & name_lookup[5] = 'O'

        nmidx = np.zeros_like(vals, dtype=int)  # the lookup for each atom
        # eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'

        for i, val in enumerate(vals):
            try:
                nmidx[i] = namedict[val]
            except KeyError:
                nextidx = len(namedict)
                namedict[val] = nextidx
                name_lookup.append(val)

                nmidx[i] = nextidx

        name_lookup = np.array(name_lookup, dtype=object)
        values = name_lookup[nmidx]
        
        return namedict, nmidx, name_lookup, values

is already pretty fast

%timeit build_name_index_append(u.atoms.names)

gives the speed to beat:

9.91 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note how len(set(vals)) would already be 1/5 of the total run time?

Replace append with pre-allocated array

Following @IAlibay suggestion and replacing append with a pre-allocated array

def build_name_index_insert(vals):
        namedict = dict()  # maps str to nmidx
        name_lookup = np.empty(len(set(vals)), dtype=object)  # maps idx to str
        # eg namedict['O'] = 5 & name_lookup[5] = 'O'

        nmidx = np.zeros_like(vals, dtype=int)  # the lookup for each atom
        # eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'

        nextidx = 0
        for i, val in enumerate(vals):
            try:
                nmidx[i] = namedict[val]
            except KeyError:
                namedict[val] = nextidx
                name_lookup[nextidx] = val
                nmidx[i] = nextidx
                nextidx += 1

        values = name_lookup[nmidx]
        
        return namedict, nmidx, name_lookup, values

gives

%timeit build_name_index_insert(u.atoms.names)
11.1 ms ± 712 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is a tad slower than the version with append.

So I would say unless we expect O(10^3) different string values, the append version is almost certainly ok.

NumPy

I tried out a "numpy" version

def build_name_index_numpy(vals):
    name_lookup = np.unique(vals)
    namedict = {name: idx for idx, name in enumerate(name_lookup)}
    nmidx = np.zeros_like(vals, dtype=int)
    
    for name in name_lookup:
        nmidx[vals == name] = namedict[name]
        
    values = name_lookup[nmidx]
    
    return namedict, nmidx, name_lookup, values

but that gave an abysmal

240 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is 24 times slower than the try/except/append version. You can shave off 60 ms by replacing np.unique with name_lookup = np.array(list(set(vals))) but the >200 repeated assignments nmidx[vals == name] = namedict[name] kill performance:

name = "H1"
vals = u.atoms.names
%timeit nmidx[vals == name] = namedict[name]
752 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Maybe if there was a way to do all assignments in one go, one would have a hope of doing better.

But even just building the dict itself

%timeit namedict = {name: idx for idx, name in enumerate(name_lookup)}
19.5 µs ± 776 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

is slower than the existing code so I think it's going to be tough to do much better – maybe someone with better numpy skills than I can see it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my suggestion was actually to do this:

def build_name_index_append(vals):
        namedict = dict()  # maps str to nmidx
        # eg namedict['O'] = 5 & name_lookup[5] = 'O'

        nmidx = np.zeros_like(vals, dtype=int)  # the lookup for each atom
        # eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'

        for i, val in enumerate(vals):
            try:
                nmidx[i] = namedict[val]
            except KeyError:
                nextidx = len(namedict)
                namedict[val] = nextidx
                nmidx[i] = nextidx

        name_lookup = np.array([i for i in namedict.keys()], dtype=object)
        values = name_lookup[nmidx]

        return namedict, nmidx, name_lookup, values

The idea here was that if you end up having to do lots of irregular append operations in the original construct, you could avoid some of the memory access overheads (plus it looks a bit nicer imho).

However, in practice that idea doesn't hold up, in part since the KeyError is triggered so infrequently, you end up with code that has within error performance (13.5 ms vs 13.6 ms in the above GRO_MEMPROT system).

Even if all 43480 elements in the name array are unique, you only save ~ 2 ms [44 ms -> 42 ms], although this is mainly an academic point as passing an array of unique names is out of scope of this function (i.e. triggering the KeyError probably becomes the biggest cost).

@richardjgowers
Copy link
Member Author

Ok I've made all string topology attrs use this approach. I think I can improve selections like protein to leverage the new stuff too.

@IAlibay
Copy link
Member

IAlibay commented Jun 21, 2020

So the code adds a numpy array and a smaller (unless everything is unique) list + dictionary set per attribute. At what point do we need to consider the extra memory cost of Universe creation?

@richardjgowers
Copy link
Member Author

richardjgowers commented Jun 21, 2020 via email

@IAlibay
Copy link
Member

IAlibay commented Jun 22, 2020

Update on my million atom benchmark for this new patch.

  • Universe creation has further increased to 12.4s from 10.9s (develop).
  • Selecting waters has decreased to 54.2 ms from 1.98s (develop)
  • water renaming hasn't substantially changed (slight reduction but within error).

It's probably worth noting that I suspect that the TOPParser is particularly inefficient, so it's likely that the difference in universe creation is going to be a lot different if say a GRO file was read.

@richardjgowers
Copy link
Member Author

@IAlibay thanks for benchmarking. I'm not shocked that this is a little slower to construct, we're constructing an index of possible types, so there's a Python loop over all atom values, for each attribute. I'll see if I can take the edge off that... Killing a 2s selection time is nice though :)

@orbeckst
Copy link
Member

Is the only thing left here the refactoring of the duplicated code? It would be nice to get this regression fixed sooner than later (also for PR #2798 ).

@orbeckst
Copy link
Member

CHANGELOG, too.

@richardjgowers richardjgowers force-pushed the faster_name_selections branch from 81a0fb5 to 00ba0ee Compare July 5, 2020 10:07
@pep8speaks
Copy link

pep8speaks commented Jul 5, 2020

Hello @richardjgowers! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 577:10: E225 missing whitespace around operator

Line 479:1: W293 blank line contains whitespace
Line 527:80: E501 line too long (83 > 79 characters)
Line 1778:1: W293 blank line contains whitespace
Line 1797:51: W291 trailing whitespace
Line 1826:80: E501 line too long (83 > 79 characters)
Line 1829:51: W291 trailing whitespace
Line 2014:1: W293 blank line contains whitespace
Line 2033:51: W291 trailing whitespace
Line 2062:80: E501 line too long (83 > 79 characters)
Line 2065:51: W291 trailing whitespace
Line 2067:1: W293 blank line contains whitespace

Line 77:80: E501 line too long (100 > 79 characters)

Line 93:80: E501 line too long (84 > 79 characters)
Line 93:81: E231 missing whitespace after ','

Comment last updated at 2020-08-25 10:30:30 UTC

@@ -828,14 +845,20 @@ class ProteinSelection(Selection):
'CLEU', 'CILE', 'CVAL', 'CASF', 'CASN', 'CGLN', 'CARG', 'CHID', 'CHIE',
'CHIP', 'CTRP', 'CPHE', 'CTYR', 'CGLU', 'CASP', 'CLYS', 'CPRO', 'CCYS',
'CCYX', 'CMET', 'CME', 'ASF',
])
}

def __init__(self, parser, tokens):
pass

def apply(self, group):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_atoms('protein') went from 50ms to 0.5ms on GRO

@IAlibay
Copy link
Member

IAlibay commented Jul 5, 2020

Something is messing up the azure pipelines stuff here but it's not immediately obvious to me as to what it is.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything in there that would not work in the backport PR #2798 , i.e., anything that requires Python ≥ 3.6 ?

(It would be ok but it would be helpful to have a note how to backport any Py 3.6+ constructs. Fixing the performance regression on selections is the main reason why we haven’t publicly announced 1.0!)

package/CHANGELOG Show resolved Hide resolved
@richardjgowers richardjgowers force-pushed the faster_name_selections branch from 00ba0ee to 55f9a5b Compare July 11, 2020 15:39
@orbeckst
Copy link
Member

AppVeyor fails on TestICodes.test_set_residues_singular – is this related to this PR?

@richardjgowers
Copy link
Member Author

@MDAnalysis/coredevs does someone fancy giving this a once over?

@richardjgowers
Copy link
Member Author

@MDAnalysis/coredevs please


.. versionchanged:: 2.0.0
sug_atoms changed to set (from numpy array)
performance improved by ~100x on larger systems
"""
token = 'nucleicsugar'
sug_atoms = np.array(["C1'", "C2'", "C3'", "C4'", "O4'"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the versionchanged says....

name_lookup = [] # maps idx to str
# eg namedict['O'] = 5 & name_lookup[5] = 'O'

self.nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not mind a more explicit name than nmidx. name_index perhaps? But there may be something even better I do not think of.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesnmidx need to be public or could it just be hidden behind an underscore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there are duplicates of this code in _ResidueStringAtrr and _SegmentStringAttr that would also need to be changed for consistency.

On balance I can live with nmidx in what's clearly an internal class, especially as there are comments that explain what it is.

@orbeckst
Copy link
Member

orbeckst commented Aug 7, 2020

ping @IAlibay for approval

ping @richardjgowers to address @jbarnoud 's comments

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good to me. The only real issue is the array -> set fix that @jbarnoud found.

I don't know how to reduce the duplicated code in a simple manner so I'll just be content with it.

There was one comment by @IAlibay on a potential performance improvement. I'll leave it to @richardjgowers and @IAlibay to figure that one out.

package/MDAnalysis/core/selection.py Outdated Show resolved Hide resolved
name_lookup = [] # maps idx to str
# eg namedict['O'] = 5 & name_lookup[5] = 'O'

self.nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there are duplicates of this code in _ResidueStringAtrr and _SegmentStringAttr that would also need to be changed for consistency.

On balance I can live with nmidx in what's clearly an internal class, especially as there are comments that explain what it is.


self.nmidx[i] = nextidx

self.name_lookup = np.array(name_lookup, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers is this optimization something you still want to look into?

@orbeckst
Copy link
Member

@IAlibay @jbarnoud do you see any blocking issues here?

I am keen to see if it looks better in our benchmarks.

@orbeckst orbeckst added this to the 1.0.x milestone Aug 16, 2020
@jbarnoud
Copy link
Contributor

I am not in a position to review in depth, thought from what I read on my phone it looks good to me. I am uneasy with the amount of code duplication, but it can be a separate pull request.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, mainly:

  • docstring changes needed (we probably need a better way to capture bad versionchanged placements since it doesn't error out but just doesn't render the right way)
  • test for uncovered else
  • can we clean up the append operations (if not, nevermind, maybe some other time)?

package/MDAnalysis/core/selection.py Show resolved Hide resolved
package/MDAnalysis/core/selection.py Show resolved Hide resolved
package/MDAnalysis/core/selection.py Show resolved Hide resolved
package/MDAnalysis/core/selection.py Show resolved Hide resolved
package/MDAnalysis/core/selection.py Show resolved Hide resolved

self.nmidx[i] = nextidx

self.name_lookup = np.array(name_lookup, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise in practice this isn't going to happen over many iterations, but append can be quite slow, especially if you're just doing a numpy conversion at the end anyways. If there's any way to avoid it (and make the code look cleaner at the same time), I would encourage it (same thing for the similar appends below).

newidx = len(self.namedict)
self.namedict[values] = newidx
newnames.append(values)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code (i.e. the else part of the if construct) isn't being tested (at least according to codecov). Could we add at least a quick test for this? Since it's core, it's probably a bad idea to leave stuff untested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully covered by new test on setting multiple segids

orbeckst and others added 2 commits August 18, 2020 23:37
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
@orbeckst
Copy link
Member

@richardjgowers could you please have a look at @IAlibay 's comments? Then this should be finally good to go!

@richardjgowers
Copy link
Member Author

re: avoiding the append loop. Not sure it's possible as we don't know the length beforehand. Then we definitely want it as a numpy array later as we fancy index it to death.

there's some code duplication, this is because the different versions work on atoms/residues/segments or have different name arrays (nucl_res / sug_atoms). You could make it all more generic and accessing a variable variable but I think it gets less readable as you try and play code golf with it.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov seems reasonable, lgtm!

@richardjgowers richardjgowers merged commit 45e56e8 into develop Aug 25, 2020
@richardjgowers richardjgowers deleted the faster_name_selections branch August 25, 2020 13:35
@orbeckst orbeckst mentioned this pull request Aug 25, 2020
4 tasks
orbeckst pushed a commit that referenced this pull request Aug 26, 2020
* fix #2751
* Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1)
* modified AtomNames topologyattr to include lookup table index
* rework atom name selection to use lookup tables
* fixed test supplying integer as atom name
* Update test_topologyattrs.py
* use dict-lookup string attrs EVERYWHERERE
* made protein selection faster, 48ms -> 0.5ms on GRO testfile
* improved nucleic/backbone selections
* Added explicit tests for Resnames topologyattr
  tests now provide str types for resnames/icodes
* use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching
  was done case-insensitive)

Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

(cherry picked from commit 45e56e8)
orbeckst pushed a commit that referenced this pull request Aug 26, 2020
* fix #2751
* Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1)
* modified AtomNames topologyattr to include lookup table index
* rework atom name selection to use lookup tables
* fixed test supplying integer as atom name
* Update test_topologyattrs.py
* use dict-lookup string attrs EVERYWHERERE
* made protein selection faster, 48ms -> 0.5ms on GRO testfile
* improved nucleic/backbone selections
* Added explicit tests for Resnames topologyattr
  tests now provide str types for resnames/icodes
* use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching
  was done case-insensitive)

Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

(cherry picked from commit 45e56e8)
orbeckst pushed a commit that referenced this pull request Aug 27, 2020
* fix #2751
* Python 2.7/3 backport of PR #2755 (use six.string_types, update CHANGELOG and versionchanged as 1.0.1)
* modified AtomNames topologyattr to include lookup table index
* rework atom name selection to use lookup tables
* fixed test supplying integer as atom name
* Update test_topologyattrs.py
* use dict-lookup string attrs EVERYWHERERE
* made protein selection faster, 48ms -> 0.5ms on GRO testfile
* improved nucleic/backbone selections
* Added explicit tests for Resnames topologyattr
  tests now provide str types for resnames/icodes
* use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching
  was done case-insensitive)

Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

(cherry picked from commit 45e56e8)
PicoCentauri pushed a commit to PicoCentauri/mdanalysis that referenced this pull request Mar 30, 2021
* modified AtomNames topologyattr to include lookup table index

* cheeky little optimisation

* rework atom name selection to use lookup tables

* Update topologyattrs.py

* fixed test supplying integer as atom name

really topologyattrs need to be statically typed and protective about this

* Update test_topologyattrs.py

* use dict-lookup string attrs EVERYWHERERE

* removed some code duplication

made protein selection faster, 48ms -> 0.5ms on GRO testfile

* improved nucleic/backbone selections

* Added explicit tests for Resnames topologyattr

tests now provide str types for resnames/icodes

* use fnmatchcase to be case sensitive

* Update package/MDAnalysis/core/selection.py

@jbarnoud's fix

* apply suggestions from code review

Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>

* added test for setting multiple segids at once

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

simple selection by name is slow due to fnmatch
7 participants