perf: some more python overhead reduction #3363

pfackeldey · 2025-01-08T14:58:54Z

This PR reduces some of the python overhead (it's a continued investigation of #3359).

With this PR the runtimes can be reduced further by ~10% (based on the same example of #3359) to:

CPU backend: 25.1 ms ± 126 μs -> 23.4 ms ± 269 μs
TypeTracer backend: 24.4 ms ± 195 μs -> 22.3 ms ± 130 μs

The main improvements are:

Merging 2 for loops in broadcast_any_list()
Merging 2 for loops in TypeTracer.broadcast_arrays()
Avoid unnecessary n_missing_dims recalculations in TypeTracer.__getitem__ loop
Avoid named_axis propagation in slicing when no named_axis are present
Add nplikes to Index instance calls where possible to avoid runtime type checks and nplike inference with (nplike_of_obj)

agoose77 · 2025-01-08T15:17:58Z

src/awkward/_nplikes/typetracer.py

-        shapes = [x.shape for x in all_arrays]
-        shape = self.broadcast_shapes(*shapes)


I'd be tempted not to make these kinds of changes unless the numbers really support them (with the bias that I find list-comprehensions more readable/Pythonic than for loops).

I'm ambivalent about this distinction; a "for loop to get everything" is a common pattern and it makes it a little more obvious that the lists (all_arrays and all_shapes) have the same lengths, because they're appended in lock-step. Of course, you also know that with the list comprehension, since it doesn't have an if clause.

If anything, I'm more concerned about the extra variable names floating around. As a list comprehension, it could have been defined inside of the call to broadcast_shapes.

But maybe more globally, @pfackeldey, you probably made a lot of these changes while watching the performance numbers, and stopped when you found a significant optimization. Afterward, did you roll back any changes that didn't contribute to the optimization? Many of these look like they would have a small effect: single for loop versus list comprehension can't be much for short lists (the number of arguments in broadcast_arrays).

I just tested locally and it doesn't add to the speedup. I roll back this change, but add the list comprehension directly into the call of broadcast_shapes. 👍

jpivarski · 2025-01-08T15:36:02Z

src/awkward/_nplikes/typetracer.py

@@ -427,6 +427,8 @@ def __getitem__(
        if not isinstance(key, tuple):
            key = (key,)

+        ndim = self.ndim


Is the cost of the __getattr__ really too much? (Is ndim an expensive property?)

I see the value of precomputing a formula like n_missing_dims, but if self.ndim is just an attribute, skipping that is a microoptimization, and the code is more readable if there are fewer variables floating around.

It doesn't cost much, it's calculating the len of a tuple. It would only become relevant if that happens millions (or rather billions) of time. I'm confident that rolling this back doesn't change anything.

jpivarski

Nothing looks bad to me, but if you found this optimization by refactoring until you got an improvement, please check to see what can be changed back without affecting that improvement. (I think of it like regularization in ML: first do too much, then scale back unused parameters.)

I would guess that the biggest contribution comes from refactoring __getitem__. That seems like the most likely suspect to me, and we never did an optimization campaign in that code. (It's a pretty big change, so we're relying on the tests to ensure that the behavior hasn't changed.)

pfackeldey · 2025-01-08T17:49:55Z

I'm closing this PR because:

These changes are conceptually beneficial for the runtime, however, these are (likely all) micro optimizations. They are on the same size (or even less relevant) compared to external factors, e.g. the heat of my Mac. Running this benchmark on the current main but in a cooler room than my office reduced the runtime to 21ms. Also my MacOS might do some additional throttling (e.g. based on battery) as @jpivarski pointed out. These systematic unknowns overshadow the micro optimizations of this PR. (Just to be sure, I went back to #3359 and confirmed that these improvements are larger than these external systematic error sources.)

The good news:

More than half of the trijet mass calculation is spent in a single line:

# typetracer backend

In [1]: %timeit trijet.j1 + trijet.j2 + trijet.j3
12.1 ms ± 71.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This seems to be related to python overhead coming from https://github.com/scikit-hep/vector. I'll have a look there, as this looks more like a promising place for improvement.

some python improvements

81bd957

pfackeldey deployed to docs January 8, 2025 15:10 — with GitHub Actions View deployment

agoose77 reviewed Jan 8, 2025

View reviewed changes

pfackeldey requested a review from jpivarski January 8, 2025 15:21

jpivarski reviewed Jan 8, 2025

View reviewed changes

jpivarski approved these changes Jan 8, 2025

View reviewed changes

pfackeldey closed this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: some more python overhead reduction #3363

perf: some more python overhead reduction #3363

pfackeldey commented Jan 8, 2025 •

edited

Loading

agoose77 Jan 8, 2025

jpivarski Jan 8, 2025

pfackeldey Jan 8, 2025

jpivarski Jan 8, 2025

pfackeldey Jan 8, 2025

jpivarski left a comment

pfackeldey commented Jan 8, 2025

		shapes = [x.shape for x in all_arrays]
		shape = self.broadcast_shapes(*shapes)

perf: some more python overhead reduction #3363

perf: some more python overhead reduction #3363

Conversation

pfackeldey commented Jan 8, 2025 • edited Loading

agoose77 Jan 8, 2025

Choose a reason for hiding this comment

jpivarski Jan 8, 2025

Choose a reason for hiding this comment

pfackeldey Jan 8, 2025

Choose a reason for hiding this comment

jpivarski Jan 8, 2025

Choose a reason for hiding this comment

pfackeldey Jan 8, 2025

Choose a reason for hiding this comment

jpivarski left a comment

Choose a reason for hiding this comment

pfackeldey commented Jan 8, 2025

pfackeldey commented Jan 8, 2025 •

edited

Loading