-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force inline par_for_inner #967
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve, although I expect you may get some pushback from others in the collab 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Amazed that this change provides a 40% speedup.
@pgrete any objections? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the wackier failures of the compiler I've seen.
This seems like a very trivial change in keeping with the original intent, so I'm pressing the button |
IIRC the original motivation why we went away from force inline (what we had originally) to just inline was that the (now legacy) Intel compiler was not able to compile the code any more. |
@jdolence did you run your tests of this performance with legacy intel? |
PR Summary
This PR just changes
KOKKOS_INLINE_FUNCTION
toKOKKOS_FORCEINLINE_FUNCTION
for all of ourpar_for_inner
overloads.In a downstream code, we found that a particular loop was failing to vectorize when using the
par_for_inner
that corresponds to a single simd for loop, whereas just using the raw simd for loop directly resulted in vectorization (and a 40% speedup of the whole code!). ChangingINLINE
toFORCEINLINE
on thepar_for_inner
resolves this, suggesting that the compiler was making the obnoxious choice not to inline and then (presumably) failing to vectorize what I can only guess it thought was a function call.PR Checklist