Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaMallocAsync not supported by UCX, may cause failure in OpenMPI+Kokkos+cuda applications #4228

Closed
lfmeadow opened this issue Aug 11, 2021 · 4 comments
Labels
InDevelop Enhancement, fix, etc. has been merged into the develop branch;

Comments

@lfmeadow
Copy link

lfmeadow commented Aug 11, 2021

See UCX issue openucx/ucx#7194
cudaMallocAsync was added to Kokkos/Cuda sometime in the last few months and results in a application crash in lammps+kokkos+cuda using OpenMPI+UCX.
The UCX issue comments mention 3 PRs to UCX that should resolve the problem.
In the meantime the following patch to Kokkos will fix it:

diff --git a/core/src/Cuda/Kokkos_CudaSpace.cpp b/core/src/Cuda/Kokkos_CudaSpace.cpp
index 39da8de6b..13c35eb87 100644
--- a/core/src/Cuda/Kokkos_CudaSpace.cpp
+++ b/core/src/Cuda/Kokkos_CudaSpace.cpp
@@ -229,7 +229,7 @@ void *CudaSpace::impl_allocate(
 
 #ifndef CUDART_VERSION
 #error CUDART_VERSION undefined!
-#elif (CUDART_VERSION >= 11020)
+#elif 0 // (CUDART_VERSION >= 11020) UCX issue
   cudaError_t error_code;
   if (arg_alloc_size >= memory_threshold_g) {
     error_code = cudaMallocAsync(&ptr, arg_alloc_size, 0);
@@ -358,7 +358,7 @@ void CudaSpace::impl_deallocate(
   try {
 #ifndef CUDART_VERSION
 #error CUDART_VERSION undefined!
-#elif (CUDART_VERSION >= 11020)
+#elif 0 // (CUDART_VERSION >= 11020) UCX issue
     if (arg_alloc_size >= memory_threshold_g) {
       CUDA_SAFE_CALL(cudaDeviceSynchronize());
       CUDA_SAFE_CALL(cudaFreeAsync(arg_alloc_ptr, 0));
@maxpkatz
Copy link

#4026 is the PR that does this (@matt-stack). We should make the use of cudaMallocAsync opt-in with a CMake flag at least until UCX fully supports that allocator.

@ajpowelsnl ajpowelsnl added the InDevelop Enhancement, fix, etc. has been merged into the develop branch; label Aug 12, 2021
@crtrott crtrott added InDevelop Enhancement, fix, etc. has been merged into the develop branch; and removed InDevelop Enhancement, fix, etc. has been merged into the develop branch; labels Aug 17, 2021
@BenWibking
Copy link

BenWibking commented Jan 25, 2023

This affects our code as well. It looks like UCX 1.14rc1 may fix this issue: https://github.com/openucx/ucx/blob/d83ef403646473d76be42cbef03fb38652169f78/NEWS#L49.

Do you have a reproducer for the issue that can be tested with the latest UCX?

@lfmeadow
Copy link
Author

lfmeadow commented Jan 25, 2023 via email

@masterleinad
Copy link
Contributor

Fixed in #4233.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
InDevelop Enhancement, fix, etc. has been merged into the develop branch;
Projects
None yet
Development

No branches or pull requests

6 participants