-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Reduce operators for consecutive reduced axes #7206
Conversation
Nice boost. (the DML EP also does this inside DirectML.dll, flattening any adjacent axes that are stride contiguous) |
I did not know. It seemed a good idea to reduce the cost of going to the next element to sum up. I made a modification to parallelize the case KR (reduce the last dimension). It is faster. I'm working on the last one RK (reduce the first dimension). Still parallelizing. |
About the binary size, should I exclude this from the minimal build? For the rest, I'll think of a better design. |
How much growth is there from the change? Easiest way to compare is https://osgwiki.com/wiki/SizeBench with before/after RelWithDebInfo builds. In reply to: 823128901 |
I started to refactor this optimization and the previous code to reduce the binary size. I'll measure it. |
Here is the current status: |
What's the before/after? Is it master vs latest changes providing an overall reduction of ~20KB despite adding the new fast reduce logic? In reply to: 827559533 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it means the gain would be even bigger without the new logic. I did not exclude it from the minimal build but I can if you think it is worth doing it. |
Description:
This change only improves reduce function when reduced axes are contiguous: if len(shape) == 4, any single axis is ok, axes=(0, 1) or (1, 2) or (2, 3) is ok, axes=(0, 2) is not covered by this change, former implementation prevails. In that case, the shape can be compressed into three cases: (K = axis not reduced, R = reduced axis):
For these three configuration, the reduction can be optimized
with vectors operations.
One example with configuration KRK. The graph shows the ratio between an implementation and numpy implementation (numpy time / time). It gives the speed up compare to numpy.
Motivation and Context
ReduceSum is much slower than tensorflow on CPU for some configurations. This change makes it almost as fast or faster.
Current status on Speed Up
Case KRK is faster on 2 and 4 cores, case KR, RK are not faster or slower with 4 cores. The implementation relies on Eigen. Current implementation is parallelized, the new one may not be.
On Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz 2.60 GHz, 2 cores.
On Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz, 4 cores.