-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: some more python overhead reduction #3363
perf: some more python overhead reduction #3363
Conversation
shapes = [x.shape for x in all_arrays] | ||
shape = self.broadcast_shapes(*shapes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted not to make these kinds of changes unless the numbers really support them (with the bias that I find list-comprehensions more readable/Pythonic than for loops).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ambivalent about this distinction; a "for loop to get everything" is a common pattern and it makes it a little more obvious that the lists (all_arrays
and all_shapes
) have the same lengths, because they're append
ed in lock-step. Of course, you also know that with the list comprehension, since it doesn't have an if
clause.
If anything, I'm more concerned about the extra variable names floating around. As a list comprehension, it could have been defined inside of the call to broadcast_shapes
.
But maybe more globally, @pfackeldey, you probably made a lot of these changes while watching the performance numbers, and stopped when you found a significant optimization. Afterward, did you roll back any changes that didn't contribute to the optimization? Many of these look like they would have a small effect: single for
loop versus list comprehension can't be much for short lists (the number of arguments in broadcast_arrays
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested locally and it doesn't add to the speedup. I roll back this change, but add the list comprehension directly into the call of broadcast_shapes
. 👍
@@ -427,6 +427,8 @@ def __getitem__( | |||
if not isinstance(key, tuple): | |||
key = (key,) | |||
|
|||
ndim = self.ndim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the cost of the __getattr__
really too much? (Is ndim
an expensive property?)
I see the value of precomputing a formula like n_missing_dims
, but if self.ndim
is just an attribute, skipping that is a microoptimization, and the code is more readable if there are fewer variables floating around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't cost much, it's calculating the len
of a tuple. It would only become relevant if that happens millions (or rather billions) of time. I'm confident that rolling this back doesn't change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing looks bad to me, but if you found this optimization by refactoring until you got an improvement, please check to see what can be changed back without affecting that improvement. (I think of it like regularization in ML: first do too much, then scale back unused parameters.)
I would guess that the biggest contribution comes from refactoring __getitem__
. That seems like the most likely suspect to me, and we never did an optimization campaign in that code. (It's a pretty big change, so we're relying on the tests to ensure that the behavior hasn't changed.)
I'm closing this PR because: These changes are conceptually beneficial for the runtime, however, these are (likely all) micro optimizations. They are on the same size (or even less relevant) compared to external factors, e.g. the heat of my Mac. Running this benchmark on the current main but in a cooler room than my office reduced the runtime to 21ms. Also my MacOS might do some additional throttling (e.g. based on battery) as @jpivarski pointed out. These systematic unknowns overshadow the micro optimizations of this PR. (Just to be sure, I went back to #3359 and confirmed that these improvements are larger than these external systematic error sources.) The good news: More than half of the trijet mass calculation is spent in a single line: # typetracer backend
In [1]: %timeit trijet.j1 + trijet.j2 + trijet.j3
12.1 ms ± 71.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) This seems to be related to python overhead coming from https://github.com/scikit-hep/vector. I'll have a look there, as this looks more like a promising place for improvement. |
This PR reduces some of the python overhead (it's a continued investigation of #3359).
With this PR the runtimes can be reduced further by ~10% (based on the same example of #3359) to:
The main improvements are:
broadcast_any_list()
TypeTracer.broadcast_arrays()
n_missing_dims
recalculations inTypeTracer.__getitem__
loopnplikes
toIndex
instance calls where possible to avoid runtime type checks and nplike inference with (nplike_of_obj
)