Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force inline par_for_inner #967

Merged
merged 3 commits into from
Oct 24, 2023
Merged

Force inline par_for_inner #967

merged 3 commits into from
Oct 24, 2023

Conversation

jdolence
Copy link
Collaborator

@jdolence jdolence commented Oct 24, 2023

PR Summary

This PR just changes KOKKOS_INLINE_FUNCTION to KOKKOS_FORCEINLINE_FUNCTION for all of our par_for_inner overloads.

In a downstream code, we found that a particular loop was failing to vectorize when using the par_for_inner that corresponds to a single simd for loop, whereas just using the raw simd for loop directly resulted in vectorization (and a 40% speedup of the whole code!). Changing INLINE to FORCEINLINE on the par_for_inner resolves this, suggesting that the compiler was making the obnoxious choice not to inline and then (presumably) failing to vectorize what I can only guess it thought was a function call.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@jdolence jdolence changed the title Force inline par_for_inner WIP: Force inline par_for_inner Oct 24, 2023
@jdolence jdolence changed the title WIP: Force inline par_for_inner Force inline par_for_inner Oct 24, 2023
Copy link
Collaborator

@pdmullen pdmullen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve, although I expect you may get some pushback from others in the collab 😉

Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Amazed that this change provides a 40% speedup.

@jdolence
Copy link
Collaborator Author

I approve, although I expect you may get some pushback from others in the collab 😉

@pgrete any objections?

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the wackier failures of the compiler I've seen.

@Yurlungur Yurlungur enabled auto-merge October 24, 2023 19:38
@Yurlungur
Copy link
Collaborator

This seems like a very trivial change in keeping with the original intent, so I'm pressing the button

@Yurlungur Yurlungur merged commit abfae20 into develop Oct 24, 2023
49 checks passed
@pgrete
Copy link
Collaborator

pgrete commented Oct 27, 2023

IIRC the original motivation why we went away from force inline (what we had originally) to just inline was that the (now legacy) Intel compiler was not able to compile the code any more.
Might be worth double checking.

@Yurlungur
Copy link
Collaborator

IIRC the original motivation why we went away from force inline (what we had originally) to just inline was that the (now legacy) Intel compiler was not able to compile the code any more. Might be worth double checking.

@jdolence did you run your tests of this performance with legacy intel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants