Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: some more python overhead reduction #3363

Closed

Conversation

pfackeldey
Copy link
Collaborator

@pfackeldey pfackeldey commented Jan 8, 2025

This PR reduces some of the python overhead (it's a continued investigation of #3359).

With this PR the runtimes can be reduced further by ~10% (based on the same example of #3359) to:

  • CPU backend: 25.1 ms ± 126 μs -> 23.4 ms ± 269 μs
  • TypeTracer backend: 24.4 ms ± 195 μs -> 22.3 ms ± 130 μs

The main improvements are:

  • Merging 2 for loops in broadcast_any_list()
  • Merging 2 for loops in TypeTracer.broadcast_arrays()
  • Avoid unnecessary n_missing_dims recalculations in TypeTracer.__getitem__ loop
  • Avoid named_axis propagation in slicing when no named_axis are present
  • Add nplikes to Index instance calls where possible to avoid runtime type checks and nplike inference with (nplike_of_obj)

Comment on lines -1169 to -1170
shapes = [x.shape for x in all_arrays]
shape = self.broadcast_shapes(*shapes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted not to make these kinds of changes unless the numbers really support them (with the bias that I find list-comprehensions more readable/Pythonic than for loops).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent about this distinction; a "for loop to get everything" is a common pattern and it makes it a little more obvious that the lists (all_arrays and all_shapes) have the same lengths, because they're appended in lock-step. Of course, you also know that with the list comprehension, since it doesn't have an if clause.

If anything, I'm more concerned about the extra variable names floating around. As a list comprehension, it could have been defined inside of the call to broadcast_shapes.

But maybe more globally, @pfackeldey, you probably made a lot of these changes while watching the performance numbers, and stopped when you found a significant optimization. Afterward, did you roll back any changes that didn't contribute to the optimization? Many of these look like they would have a small effect: single for loop versus list comprehension can't be much for short lists (the number of arguments in broadcast_arrays).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested locally and it doesn't add to the speedup. I roll back this change, but add the list comprehension directly into the call of broadcast_shapes. 👍

@pfackeldey pfackeldey requested a review from jpivarski January 8, 2025 15:21
@@ -427,6 +427,8 @@ def __getitem__(
if not isinstance(key, tuple):
key = (key,)

ndim = self.ndim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cost of the __getattr__ really too much? (Is ndim an expensive property?)

I see the value of precomputing a formula like n_missing_dims, but if self.ndim is just an attribute, skipping that is a microoptimization, and the code is more readable if there are fewer variables floating around.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't cost much, it's calculating the len of a tuple. It would only become relevant if that happens millions (or rather billions) of time. I'm confident that rolling this back doesn't change anything.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing looks bad to me, but if you found this optimization by refactoring until you got an improvement, please check to see what can be changed back without affecting that improvement. (I think of it like regularization in ML: first do too much, then scale back unused parameters.)

I would guess that the biggest contribution comes from refactoring __getitem__. That seems like the most likely suspect to me, and we never did an optimization campaign in that code. (It's a pretty big change, so we're relying on the tests to ensure that the behavior hasn't changed.)

@pfackeldey
Copy link
Collaborator Author

I'm closing this PR because:

These changes are conceptually beneficial for the runtime, however, these are (likely all) micro optimizations. They are on the same size (or even less relevant) compared to external factors, e.g. the heat of my Mac. Running this benchmark on the current main but in a cooler room than my office reduced the runtime to 21ms. Also my MacOS might do some additional throttling (e.g. based on battery) as @jpivarski pointed out. These systematic unknowns overshadow the micro optimizations of this PR. (Just to be sure, I went back to #3359 and confirmed that these improvements are larger than these external systematic error sources.)

The good news:

More than half of the trijet mass calculation is spent in a single line:

# typetracer backend

In [1]: %timeit trijet.j1 + trijet.j2 + trijet.j3
12.1 ms ± 71.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This seems to be related to python overhead coming from https://github.com/scikit-hep/vector. I'll have a look there, as this looks more like a promising place for improvement.

@pfackeldey pfackeldey closed this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants