Add rapids_test allowing projects to run gpu tests in parallel (#328)

Introduces `rapids_test` functionality to allow tests executed via `ctest -j` to properly resource share GPUs. This is done by having tests state how many GPUs allocations they require, and uses CTest internal job scheduler to properly load balance. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #328
rapidsai · Mar 7, 2023 · f7876e6 · f7876e6
1 parent e6a4d70
commit f7876e6
Show file tree

Hide file tree

Showing 70 changed files with 2,492 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -108,6 +108,16 @@ The most commonly used function are:
 - `rapids_find_package(<project_name> BUILD_EXPORT_SET <name> INSTALL_EXPORT_SET <name> )` Combines `find_package` and support to track dependencies for easy package exporting
 - `rapids_generate_module(<PackageName> HEADER_NAMES <paths...> LIBRARY_NAMES <names...> )` Generate a FindModule for the given package. Allows association to export sets so the generated FindModule can be shipped with the project
 
+### test
+
+The `rapids_test` functions simplify CTest resource allocation, allowing for
+tests to run in parallel without overallocating GPU resources.
+
+The most commonly used functions are:
+- `rapids_test_add(NAME <test_name> GPUS <N> PERCENT <N>)`: State how many GPU resources a single
+  test requires
+
+
 ## Overriding RAPIDS.cmake
 
 At times projects or developers will need to verify ``rapids-cmake`` branches. To do this you can set variables that control which repository ``RAPIDS.cmake`` downloads, which should be done like this:

diff --git a/cmake-format-rapids-cmake.json b/cmake-format-rapids-cmake.json
@@ -310,8 +310,49 @@
           "TARGET": "1",
           "ROOT_DIRECTORY": "1"
         }
+      },
+      "rapids_test_init": {
+        "pargs": {
+          "nargs": "0"
+        }
+      },
+      "rapids_test_add": {
+        "pargs": {
+          "nargs": "0"
+        },
+        "kwargs": {
+          "NAME": "1",
+          "COMMAND": "*",
+          "INSTALL_COMPONENT_SET": "1",
+          "GPUS": "1",
+          "PERCENT": "1",
+          "WORKING_DIRECTORY": "1"
+        }
+      },
+      "rapids_test_gpu_requirements": {
+        "pargs": {
+          "nargs": "1"
+        },
+        "kwargs": {
+          "GPUS": "1",
+          "PERCENT": "1"
+        }
+      },
+      "rapids_test_generate_resource_spec": {
+        "pargs": {
+          "nargs": "2"
+        }
+      },
+      "rapids_test_install_relocatable": {
+        "pargs": {
+          "nargs": "0",
+          "flags": ["EXCLUDE_FROM_ALL"]
+        },
+        "kwargs": {
+          "INSTALL_COMPONENT_SET": "1",
+          "DESTINATION": "1"
+        }
       }
-
     }
   }
 }
diff --git a/dependencies.yaml b/dependencies.yaml
@@ -41,23 +41,28 @@ dependencies:
             packages:
               - cudatoolkit=11.2
               - gcc<11.0.0
+              - sysroot_linux-64==2.17
           - matrix:
               cuda: "11.4"
             packages:
               - cudatoolkit=11.4
               - gcc<11.0.0
+              - sysroot_linux-64==2.17
           - matrix:
               cuda: "11.5"
             packages:
               - cudatoolkit=11.5
+              - sysroot_linux-64==2.17
           - matrix:
               cuda: "11.6"
             packages:
               - cudatoolkit=11.6
+              - sysroot_linux-64==2.17
           - matrix:
               cuda: "11.8"
             packages:
               - cudatoolkit=11.8
+              - sysroot_linux-64==2.17
   docs:
     common:
       - output_types: [conda]

diff --git a/docs/api.rst b/docs/api.rst
@@ -133,3 +133,18 @@ correct export generation. These should only be used when :cmake:command:`rapids
    rapids_export_find_package_file [Advanced] </command/rapids_export_find_package_file>
    rapids_export_find_package_root [Advanced] </command/rapids_export_find_package_root>
    rapids_export_package [Advanced] </command/rapids_export_package>
+
+Testing
+*******
+
+The `rapids_test` functions simplify CTest resource allocation, allowing for tests to run in parallel without over-allocating GPU resources.
+More information on resource allocation can be found in the rapids-cmake :ref:`Hardware Resources and Testing documentation <rapids_resource_allocation>`.
+
+.. toctree::
+   :titlesonly:
+
+   /command/rapids_test_init
+   /command/rapids_test_add
+   /command/rapids_test_generate_resource_spec
+   /command/rapids_test_gpu_requirements
+   /command/rapids_test_install_relocatable
diff --git a/docs/command/rapids_test_add.rst b/docs/command/rapids_test_add.rst
@@ -0,0 +1 @@
+.. cmake-module:: ../../rapids-cmake/test/add.cmake
diff --git a/docs/command/rapids_test_generate_resource_spec.rst b/docs/command/rapids_test_generate_resource_spec.rst
@@ -0,0 +1 @@
+.. cmake-module:: ../../rapids-cmake/test/generate_resource_spec.cmake
diff --git a/docs/command/rapids_test_gpu_requirements.rst b/docs/command/rapids_test_gpu_requirements.rst
@@ -0,0 +1 @@
+.. cmake-module:: ../../rapids-cmake/test/gpu_requirements.cmake
diff --git a/docs/command/rapids_test_init.rst b/docs/command/rapids_test_init.rst
@@ -0,0 +1 @@
+.. cmake-module:: ../../rapids-cmake/test/init.cmake
diff --git a/docs/command/rapids_test_install_relocatable.rst b/docs/command/rapids_test_install_relocatable.rst
@@ -0,0 +1 @@
+.. cmake-module:: ../../rapids-cmake/test/install_relocatable.cmake
diff --git a/docs/cpp_code_snippets/rapids_cmake_ctest_allocation.cpp b/docs/cpp_code_snippets/rapids_cmake_ctest_allocation.cpp
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2022-2023, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <rapids_cmake_ctest_allocation.hpp>
+
+#include <cuda_runtime_api.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <numeric>
+#include <string>
+#include <string_view>
+
+namespace rapids_cmake {
+
+namespace {
+GPUAllocation noGPUAllocation() { return GPUAllocation{-1, -1}; }
+
+GPUAllocation parseCTestAllocation(std::string_view env_variable)
+{
+  std::string gpu_resources{std::getenv(env_variable.begin())};
+  // need to handle parseCTestAllocation variable being empty
+
+  // need to handle parseCTestAllocation variable not having some
+  // of the requested components
+
+  // The string looks like "id:<number>,slots:<number>"
+  auto id_start   = gpu_resources.find("id:") + 3;
+  auto id_end     = gpu_resources.find(",");
+  auto slot_start = gpu_resources.find("slots:") + 6;
+
+  auto id    = gpu_resources.substr(id_start, id_end - id_start);
+  auto slots = gpu_resources.substr(slot_start);
+
+  return GPUAllocation{std::stoi(id), std::stoi(slots)};
+}
+
+std::vector<GPUAllocation> determineGPUAllocations()
+{
+  std::vector<GPUAllocation> allocations;
+  const auto* resource_count = std::getenv("CTEST_RESOURCE_GROUP_COUNT");
+  if (!resource_count) {
+    allocations.emplace_back();
+    return allocations;
+  }
+
+  const auto resource_max = std::stoi(resource_count);
+  for (int index = 0; index < resource_max; ++index) {
+    std::string group_env = "CTEST_RESOURCE_GROUP_" + std::to_string(index);
+    std::string resource_group{std::getenv(group_env.c_str())};
+    std::transform(resource_group.begin(), resource_group.end(), resource_group.begin(), ::toupper);
+
+    if (resource_group == "GPUS") {
+      auto resource_env = group_env + "_" + resource_group;
+      auto&& allocation = parseCTestAllocation(resource_env);
+      allocations.emplace_back(allocation);
+    }
+  }
+
+  return allocations;
+}
+}  // namespace
+
+bool using_resources()
+{
+  const auto* resource_count = std::getenv("CTEST_RESOURCE_GROUP_COUNT");
+  return resource_count != nullptr;
+}
+
+std::vector<GPUAllocation> full_allocation() { return determineGPUAllocations(); }
+
+cudaError_t bind_to_gpu(GPUAllocation const& alloc) { return cudaSetDevice(alloc.device_id); }
+
+bool bind_to_first_gpu()
+{
+  if (using_resources()) {
+    std::vector<GPUAllocation> allocs = determineGPUAllocations();
+    return (bind_to_gpu(allocs[0]) == cudaSuccess);
+  }
+  return false;
+}
+
+}  // namespace rapids_cmake
diff --git a/docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp b/docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp
@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2022-2023, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <cuda_runtime_api.h>
+#include <vector>
+
+namespace rapids_cmake {
+
+/*
+ * Represents a GPU Allocation provided by a CTest resource specification.
+ *
+ * The `device_id` maps to the CUDA gpu id required by `cudaSetDevice`.
+ * The slots represent the percentage of the GPU that this test will use.
+ * Primarily used by CTest to ensure proper load balancing of tests.
+ */
+struct GPUAllocation {
+  int device_id;
+  int slots;
+};
+
+/*
+ * Returns true when a CTest resource specification has been specified.
+ *
+ * Since the vast majority of tests should execute without a CTest resource
+ * spec (e.g. when executed manually by a developer), callers of `rapids_cmake`
+ * should first ensure that a CTestresource spec file has been provided before
+ * trying to query/bind to the allocation.
+ *
+ * ```cxx
+ *   if (rapids_cmake::using_resouces()) {
+ *     rapids_cmake::bind_to_first_gpu();
+ *   }
+ * ```
+ */
+bool using_resources();
+
+/*
+ * Returns all GPUAllocations allocated for a test
+ *
+ * To support multi-GPU tests the CTest resource specification allows a
+ * test to request multiple GPUs. As CUDA only allows binding to a
+ * single GPU at any time, this API allows tests to know what CUDA
+ * devices they should bind to.
+ *
+ * Note: The `device_id` of each allocation might not be unique.
+ * If a test says it needs 50% of two GPUs, it could be allocated
+ * the same physical GPU. If a test needs distinct / unique devices
+ * it must request 51%+ of a device.
+ *
+ * Note: rapids_cmake does no caching, so this query should be cached
+ * instead of called multiple times.
+ */
+std::vector<GPUAllocation> full_allocation();
+
+/*
+ * Have CUDA bind to a given GPUAllocation
+ *
+ * Have CUDA bind to the `device_id` specified in the CTest
+ * GPU allocation
+ *
+ * Note: Return value is the cudaError_t of `cudaSetDevice`
+ */
+cudaError_t bind_to_gpu(GPUAllocation const& alloc);
+
+/*
+ * Convenience method to bind to the first GPU that CTest has allocated
+ * Provided as most RAPIDS tests only require a single GPU
+ *
+ * Will return `false` if no GPUs have been allocated, or if setting
+ * the CUDA device failed for any reason.
+ */
+bool bind_to_first_gpu();
+
+}  // namespace rapids_cmake
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		.. cmake-module:: ../../rapids-cmake/test/add.cmake