-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop volatile from reduce/scan join() routines #4931
Conversation
@dalg24 Are you that confident in the tests, or do you really think the CUDA reduce/scan code is solid with the added synchronization? |
This is the direction we want to go. I trust the tooling and the tests until proven wrong. I'd rather merge early in the release cycle than wait. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on board with giving it a try (without extra volatile preloads).
…es, and many implementations The need for `volatile` on join() operations of reducers and reduction subject types was an accomodation for a quirk of CUDA's non-standard memory model. After fixes for data races in our CUDA reductions reported by Nvidia's compute-sanitizer racecheck tool (#4855), tests seem to pass without maintaining the volatile qualifiers.
Nvidia only shipped compute-sanitizer on CUDA 11+. It's possible the compiler has been correspondingly tweaked to be stricter about micro-optimizations that may interact with the memory model. So, only embracing this if/when we require CUDA 11 may be a precaution worth considering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just see what happens. @dalg24 if you are ok with it merge it.
Well, that seems like unanimous support. If you're all happy with it, I'm not going to stand in the way |
Folks this is breaking KK for 5 days, can you fix/revert please? High priority PRs are all blocked because testing is broken. |
Following the fixes for data races described in #4855, it appears that the
volatile
qualifiers may no longer be necessary at all.There may be other fixes for lurking data races in the CUDA reduction implementation that will still need to be applied:
This is a squash of everything in #4901, without the actual
volatile_preload
trick that I thought was necessary.Other uses of
volatile
that may be cleanable following this change are listed here:https://gist.github.com/PhilMiller/575baac87d1965a7bdcb98f812a23dd9
Fixes #4077, #1554