-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaMallocAsync not supported by UCX, may cause failure in OpenMPI+Kokkos+cuda applications #4228
Comments
#4026 is the PR that does this (@matt-stack). We should make the use of cudaMallocAsync opt-in with a CMake flag at least until UCX fully supports that allocator. |
This affects our code as well. It looks like UCX 1.14rc1 may fix this issue: https://github.com/openucx/ucx/blob/d83ef403646473d76be42cbef03fb38652169f78/NEWS#L49. Do you have a reproducer for the issue that can be tested with the latest UCX? |
Geez I found that bug ages ago I forgot all about it. So cray seems to have the same problem. There’s also an easy kokkos patch.
I can test it when Perlmutter is back. Thanks!
…Sent from my iPhone
On Jan 25, 2023, at 9:12 AM, Ben Wibking ***@***.***> wrote:
This affects my code as well. It looks like UCX 1.14rc1 may fix this issue: https://github.com/openucx/ucx/blob/d83ef403646473d76be42cbef03fb38652169f78/NEWS#L49.
Do you have a reproducer for the issue that can be tested with the latest UCX?
—
Reply to this email directly, view it on GitHub<#4228 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABH5YK7HRQ6ZXPBWLSENKSTWUFNFVANCNFSM5B66AZFQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Fixed in #4233. |
See UCX issue openucx/ucx#7194
cudaMallocAsync was added to Kokkos/Cuda sometime in the last few months and results in a application crash in lammps+kokkos+cuda using OpenMPI+UCX.
The UCX issue comments mention 3 PRs to UCX that should resolve the problem.
In the meantime the following patch to Kokkos will fix it:
The text was updated successfully, but these errors were encountered: