Skip to content

Commit

Permalink
Add rapids_test allowing projects to run gpu tests in parallel (#328)
Browse files Browse the repository at this point in the history
Introduces `rapids_test` functionality to allow tests executed via `ctest -j` to properly resource share GPUs. 

This is done by having tests state how many GPUs allocations they require, and uses CTest internal job scheduler to properly load balance.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #328
  • Loading branch information
robertmaynard authored Mar 7, 2023
1 parent e6a4d70 commit f7876e6
Show file tree
Hide file tree
Showing 70 changed files with 2,492 additions and 4 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,16 @@ The most commonly used function are:
- `rapids_find_package(<project_name> BUILD_EXPORT_SET <name> INSTALL_EXPORT_SET <name> )` Combines `find_package` and support to track dependencies for easy package exporting
- `rapids_generate_module(<PackageName> HEADER_NAMES <paths...> LIBRARY_NAMES <names...> )` Generate a FindModule for the given package. Allows association to export sets so the generated FindModule can be shipped with the project

### test

The `rapids_test` functions simplify CTest resource allocation, allowing for
tests to run in parallel without overallocating GPU resources.

The most commonly used functions are:
- `rapids_test_add(NAME <test_name> GPUS <N> PERCENT <N>)`: State how many GPU resources a single
test requires


## Overriding RAPIDS.cmake

At times projects or developers will need to verify ``rapids-cmake`` branches. To do this you can set variables that control which repository ``RAPIDS.cmake`` downloads, which should be done like this:
Expand Down
43 changes: 42 additions & 1 deletion cmake-format-rapids-cmake.json
Original file line number Diff line number Diff line change
Expand Up @@ -310,8 +310,49 @@
"TARGET": "1",
"ROOT_DIRECTORY": "1"
}
},
"rapids_test_init": {
"pargs": {
"nargs": "0"
}
},
"rapids_test_add": {
"pargs": {
"nargs": "0"
},
"kwargs": {
"NAME": "1",
"COMMAND": "*",
"INSTALL_COMPONENT_SET": "1",
"GPUS": "1",
"PERCENT": "1",
"WORKING_DIRECTORY": "1"
}
},
"rapids_test_gpu_requirements": {
"pargs": {
"nargs": "1"
},
"kwargs": {
"GPUS": "1",
"PERCENT": "1"
}
},
"rapids_test_generate_resource_spec": {
"pargs": {
"nargs": "2"
}
},
"rapids_test_install_relocatable": {
"pargs": {
"nargs": "0",
"flags": ["EXCLUDE_FROM_ALL"]
},
"kwargs": {
"INSTALL_COMPONENT_SET": "1",
"DESTINATION": "1"
}
}

}
}
}
5 changes: 5 additions & 0 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,23 +41,28 @@ dependencies:
packages:
- cudatoolkit=11.2
- gcc<11.0.0
- sysroot_linux-64==2.17
- matrix:
cuda: "11.4"
packages:
- cudatoolkit=11.4
- gcc<11.0.0
- sysroot_linux-64==2.17
- matrix:
cuda: "11.5"
packages:
- cudatoolkit=11.5
- sysroot_linux-64==2.17
- matrix:
cuda: "11.6"
packages:
- cudatoolkit=11.6
- sysroot_linux-64==2.17
- matrix:
cuda: "11.8"
packages:
- cudatoolkit=11.8
- sysroot_linux-64==2.17
docs:
common:
- output_types: [conda]
Expand Down
15 changes: 15 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,18 @@ correct export generation. These should only be used when :cmake:command:`rapids
rapids_export_find_package_file [Advanced] </command/rapids_export_find_package_file>
rapids_export_find_package_root [Advanced] </command/rapids_export_find_package_root>
rapids_export_package [Advanced] </command/rapids_export_package>

Testing
*******

The `rapids_test` functions simplify CTest resource allocation, allowing for tests to run in parallel without over-allocating GPU resources.
More information on resource allocation can be found in the rapids-cmake :ref:`Hardware Resources and Testing documentation <rapids_resource_allocation>`.

.. toctree::
:titlesonly:

/command/rapids_test_init
/command/rapids_test_add
/command/rapids_test_generate_resource_spec
/command/rapids_test_gpu_requirements
/command/rapids_test_install_relocatable
1 change: 1 addition & 0 deletions docs/command/rapids_test_add.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/add.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_generate_resource_spec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/generate_resource_spec.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_gpu_requirements.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/gpu_requirements.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_init.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/init.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_install_relocatable.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/install_relocatable.cmake
96 changes: 96 additions & 0 deletions docs/cpp_code_snippets/rapids_cmake_ctest_allocation.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
/*
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <rapids_cmake_ctest_allocation.hpp>

#include <cuda_runtime_api.h>

#include <algorithm>
#include <cstdlib>
#include <numeric>
#include <string>
#include <string_view>

namespace rapids_cmake {

namespace {
GPUAllocation noGPUAllocation() { return GPUAllocation{-1, -1}; }

GPUAllocation parseCTestAllocation(std::string_view env_variable)
{
std::string gpu_resources{std::getenv(env_variable.begin())};
// need to handle parseCTestAllocation variable being empty

// need to handle parseCTestAllocation variable not having some
// of the requested components

// The string looks like "id:<number>,slots:<number>"
auto id_start = gpu_resources.find("id:") + 3;
auto id_end = gpu_resources.find(",");
auto slot_start = gpu_resources.find("slots:") + 6;

auto id = gpu_resources.substr(id_start, id_end - id_start);
auto slots = gpu_resources.substr(slot_start);

return GPUAllocation{std::stoi(id), std::stoi(slots)};
}

std::vector<GPUAllocation> determineGPUAllocations()
{
std::vector<GPUAllocation> allocations;
const auto* resource_count = std::getenv("CTEST_RESOURCE_GROUP_COUNT");
if (!resource_count) {
allocations.emplace_back();
return allocations;
}

const auto resource_max = std::stoi(resource_count);
for (int index = 0; index < resource_max; ++index) {
std::string group_env = "CTEST_RESOURCE_GROUP_" + std::to_string(index);
std::string resource_group{std::getenv(group_env.c_str())};
std::transform(resource_group.begin(), resource_group.end(), resource_group.begin(), ::toupper);

if (resource_group == "GPUS") {
auto resource_env = group_env + "_" + resource_group;
auto&& allocation = parseCTestAllocation(resource_env);
allocations.emplace_back(allocation);
}
}

return allocations;
}
} // namespace

bool using_resources()
{
const auto* resource_count = std::getenv("CTEST_RESOURCE_GROUP_COUNT");
return resource_count != nullptr;
}

std::vector<GPUAllocation> full_allocation() { return determineGPUAllocations(); }

cudaError_t bind_to_gpu(GPUAllocation const& alloc) { return cudaSetDevice(alloc.device_id); }

bool bind_to_first_gpu()
{
if (using_resources()) {
std::vector<GPUAllocation> allocs = determineGPUAllocations();
return (bind_to_gpu(allocs[0]) == cudaSuccess);
}
return false;
}

} // namespace rapids_cmake
89 changes: 89 additions & 0 deletions docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
/*
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cuda_runtime_api.h>
#include <vector>

namespace rapids_cmake {

/*
* Represents a GPU Allocation provided by a CTest resource specification.
*
* The `device_id` maps to the CUDA gpu id required by `cudaSetDevice`.
* The slots represent the percentage of the GPU that this test will use.
* Primarily used by CTest to ensure proper load balancing of tests.
*/
struct GPUAllocation {
int device_id;
int slots;
};

/*
* Returns true when a CTest resource specification has been specified.
*
* Since the vast majority of tests should execute without a CTest resource
* spec (e.g. when executed manually by a developer), callers of `rapids_cmake`
* should first ensure that a CTestresource spec file has been provided before
* trying to query/bind to the allocation.
*
* ```cxx
* if (rapids_cmake::using_resouces()) {
* rapids_cmake::bind_to_first_gpu();
* }
* ```
*/
bool using_resources();

/*
* Returns all GPUAllocations allocated for a test
*
* To support multi-GPU tests the CTest resource specification allows a
* test to request multiple GPUs. As CUDA only allows binding to a
* single GPU at any time, this API allows tests to know what CUDA
* devices they should bind to.
*
* Note: The `device_id` of each allocation might not be unique.
* If a test says it needs 50% of two GPUs, it could be allocated
* the same physical GPU. If a test needs distinct / unique devices
* it must request 51%+ of a device.
*
* Note: rapids_cmake does no caching, so this query should be cached
* instead of called multiple times.
*/
std::vector<GPUAllocation> full_allocation();

/*
* Have CUDA bind to a given GPUAllocation
*
* Have CUDA bind to the `device_id` specified in the CTest
* GPU allocation
*
* Note: Return value is the cudaError_t of `cudaSetDevice`
*/
cudaError_t bind_to_gpu(GPUAllocation const& alloc);

/*
* Convenience method to bind to the first GPU that CTest has allocated
* Provided as most RAPIDS tests only require a single GPU
*
* Will return `false` if no GPUs have been allocated, or if setting
* the CUDA device failed for any reason.
*/
bool bind_to_first_gpu();

} // namespace rapids_cmake
Loading

0 comments on commit f7876e6

Please sign in to comment.