-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: improved performance of small multiindexes #16324
Conversation
@jreback Did you post the wrong timings? As you show it is actually slower for the big one, and the same for the smaller one .. |
Or are the timings of the PR compared to 0.19.2 timings? (I assumed master or 0.20.1 timings) And in that case, with this PR we are still considerably slower compared to 0.19.2 (after first indexing operation and hashtable creation) When I was testing it yesterday, I also assumed it was some trade-off on the size of the MI for the new hashtable based MultiIndexEngine, and the ObjectEngine. So I tried to find the good cut-off by varying the MI size, but I actually got systematically faster results with the Object engine, for MI's up to 1 million elements.
So the engine creation gets a lot slower (and that is important as well of course), but after that the actual indexing operation is still consistently faster when using the ObjectEngine, even for |
Regarding that last remark about the asv benchmarks: the reason is that it does the indexing timings on a 'cold' index, so the result seems to be dominated by the engine creation rather than the actual |
Codecov Report
@@ Coverage Diff @@
## master #16324 +/- ##
==========================================
- Coverage 90.37% 90.34% -0.04%
==========================================
Files 161 161
Lines 50863 50866 +3
==========================================
- Hits 45966 45953 -13
- Misses 4897 4913 +16
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16324 +/- ##
==========================================
- Coverage 90.39% 90.34% -0.06%
==========================================
Files 161 161
Lines 50863 50866 +3
==========================================
- Hits 45978 45954 -24
- Misses 4885 4912 +27
Continue to review full report at Codecov.
|
@jorisvandenbossche pls read my impl notes. The timings are correct. Your measurement is incorrect.
This is by far the most important part of the cost. |
I had read your comments in the classes, but I don't read an explanation for my comments above. So if you would like to clarify. To be clear, I am not saying the ObjectEngine (the old implementation) is necessarily better (and thanks for this PR!). It is just that the trade-off between performance of index creation and of actual indexing operation, is a difficult one, because it depends on how many times you do such an indexing operation. So it will always be difficult to determine a good cut-off size to use one of both, because it depends on the usecase (apart from that, we should just choose a sensible one). If you do many indexing operations (which is not most of the time not a good usage of course, but not always avoidable), the ObjectEngine is even for large arrays still a lot faster as the MultiIndexEngine (because in such case the faster |
@jorisvandenbossche your comments indicate you did not read my update at the top.
prove it |
sure there is a tradeoff. but around > 10000 elements it is dramatically better in the hash based impl. (sure there is a breakeven point, and its is prob somewhat larger, maybe 50k or 100k elements), BUT this does reflect the ballooing memory of the original impl. simply compare [6] above. The hashtable creation cost vastly outweights everything else. timeit does NOT take this into account (well it does says that things are cache, which is exactly what is happening here). |
OK, I thought you meant the comments in the code. Github doesn't always refresh itself if you keep the page open, so I only saw it now. But what I want to make clear is that this trade-off size of 10k can be very detrimental when you do a lot of single indexing operations (when above that size of 10k, and so using the hash based one). And you easily get at many indexing operations, even if you do not do that manually in a loop. Maybe this is not that common to have the combination of a large MI size and many indexing operations, so maybe we shouldn't care, but if you are in this situation, it can make big difference. Visual example (code to reproduce in http://nbviewer.jupyter.org/gist/jorisvandenbossche/d39063411b480e1fa825b0cb5c1d56fd), full line is first get_loc (so with overhead of hashtable creation), dotted lines are subsequent get_loc calls (added to the first line): So with 10 get_loc calls the cutoff is around 10^5, but if you would make the same plot with 1000 get_loc calls, the ObjectEngine is even slightly faster with 10^7 sized MI (but with probably blown up memory as you indicated) But, as I don't think we want to expose this as an option, we have to choose a cutoff size, so I think the current 10k you used is probably fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other comments apart from our discussion :-)
asv_bench/benchmarks/indexing.py
Outdated
def time_multiindex_string_get_loc(self): | ||
self.mistring.get_loc((999, 19, 'Z')) | ||
self.mi_small.get_loc((99, 'A')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the index has actually 3 levels (did that change? in the beginning I think it had only 2 levels).
But so with the result it is not a full key, so I suppose takes a different code path? (maybe also useful to benchmark, but not what we want in this PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/_libs/index.pyx
Outdated
return super(MultiIndexObjectEngine, self).get_loc(val) | ||
|
||
|
||
cdef class MultiIndexEngine(MultiIndexObjectEngine): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is needed to subclass from MultiIndexObjectEngine, as you actually override all the methods specified in that one. So for code clarity, I would just subclass IndexEngine as before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
# choose our engine based on our size | ||
# the hashing based MultiIndex for larger | ||
# sizes, and the MultiIndexOjbect for smaller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe refer to the github issue number of the pr for the discussion about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
yeah, the calls after ht constructon are basically immaterial you pay the cost. sure you can always overwhelm it, but that is just user error (and to be honest not sure how much could do about that). a |
The gist/notebook for the plot above: http://nbviewer.jupyter.org/gist/jorisvandenbossche/d39063411b480e1fa825b0cb5c1d56fd |
thanks for the review @jorisvandenbossche |
closes pandas-dev#16319 (cherry picked from commit 94ef7b6)
closes #16319
When I did #15245 the goal was two fold
original impl
These are interrelated, as the original impl of a MI constructed a list of tuples for each value (the len of the tuple is the number of levels). Then we construct a PyObject based hashtable. This is pretty slow & has the side effect of blowing up memory as a list of tuples is pretty heavyweight (when you have a large one).
Indexing is simple, you simply hash a passed tuple and do a lookup in the hashtable. This is quite fast.
hashtable impl
The new implementation, instead data hashes the levels themselves and performs transforms to preserve ordering. This is much faster than the original implementation because we have common dtypes and can construct these hashes in a vectorized way. So this is fast and memory efficient
Looking up a value involves constructing the hash for a tuple. This has some overhead as for impl simplicity this is partly python code and has a few steps. This could be improved. So the actual construction of the hash of the value to be lookuped up is slower than the original impl.
Naive timing of an indexing on a new DataFrame is very misleading.
%timeit
gives the min time, which includes the construction cost on the first iteration, and subequent access times. However it returns the min, which is obviously not including the cached time. We actually care about the sum of access times.We always have to construct the hash table in order to index, so its cost is more than relevant, in fact it would dominate things.
This PR allow the original implementation for smaller MultiIndexes. I chose 10000 elmenents (times are still slightly slower for the hash-based construction by bout 1.5x, but balancing memory bloat here). It is vastly better for larger than this. (if some enterprising soul wants to draw the graph of amortized construction cost would be great!)
So we at run-time choose the engine for indexing. We can then have good smallish perf, and nice scaling on larger hash tables.
setup
These timing show index construction cost (first execution), and marginal after than (2nd) for small, med, large.
v0.19.2
PR