Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rapids_test allowing projects to run gpu tests in parallel #328

Merged
merged 15 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,16 @@ The most commonly used function are:
- `rapids_find_package(<project_name> BUILD_EXPORT_SET <name> INSTALL_EXPORT_SET <name> )` Combines `find_package` and support to track dependencies for easy package exporting
- `rapids_generate_module(<PackageName> HEADER_NAMES <paths...> LIBRARY_NAMES <names...> )` Generate a FindModule for the given package. Allows association to export sets so the generated FindModule can be shipped with the project

### test

The `rapids_test` functions simplify CTest resource allocation, allowing for
tests to run in parallel without overallocating GPU resources.

The most commonly used functions are:
- `rapids_test_add(NAME <test_name> GPUS <N> PERCENT <N>)`: State how many GPU resources a single
test requires


## Overriding RAPIDS.cmake

At times projects or developers will need to verify ``rapids-cmake`` branches. To do this you can set variables that control which repository ``RAPIDS.cmake`` downloads, which should be done like this:
Expand Down
43 changes: 42 additions & 1 deletion cmake-format-rapids-cmake.json
Original file line number Diff line number Diff line change
Expand Up @@ -310,8 +310,49 @@
"TARGET": "1",
"ROOT_DIRECTORY": "1"
}
},
"rapids_test_init": {
"pargs": {
"nargs": "0"
}
},
"rapids_test_add": {
"pargs": {
"nargs": "0"
},
"kwargs": {
"NAME": "1",
"COMMAND": "*",
"INSTALL_COMPONENT_SET": "1",
"GPUS": "1",
"PERCENT": "1",
"WORKING_DIRECTORY": "1"
}
},
"rapids_test_gpu_requirements": {
"pargs": {
"nargs": "1"
},
"kwargs": {
"GPUS": "1",
"PERCENT": "1"
}
},
"rapids_test_generate_resource_spec": {
"pargs": {
"nargs": "2"
}
},
"rapids_test_install_relocatable": {
"pargs": {
"nargs": "0",
"flags": ["EXCLUDE_FROM_ALL"]
},
"kwargs": {
"INSTALL_COMPONENT_SET": "1",
"DESTINATION": "1"
}
}

}
}
}
5 changes: 5 additions & 0 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,23 +41,28 @@ dependencies:
packages:
- cudatoolkit=11.2
- gcc<11.0.0
- sysroot_linux-64==2.17
robertmaynard marked this conversation as resolved.
Show resolved Hide resolved
- matrix:
cuda: "11.4"
packages:
- cudatoolkit=11.4
- gcc<11.0.0
- sysroot_linux-64==2.17
- matrix:
cuda: "11.5"
packages:
- cudatoolkit=11.5
- sysroot_linux-64==2.17
- matrix:
cuda: "11.6"
packages:
- cudatoolkit=11.6
- sysroot_linux-64==2.17
- matrix:
cuda: "11.8"
packages:
- cudatoolkit=11.8
- sysroot_linux-64==2.17
docs:
common:
- output_types: [conda]
Expand Down
15 changes: 15 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,18 @@ correct export generation. These should only be used when :cmake:command:`rapids
rapids_export_find_package_file [Advanced] </command/rapids_export_find_package_file>
rapids_export_find_package_root [Advanced] </command/rapids_export_find_package_root>
rapids_export_package [Advanced] </command/rapids_export_package>

Testing
*******

The `rapids_test` functions simplify CTest resource allocation, allowing for tests to run in parallel without over-allocating GPU resources.
More information on resource allocation can be found in the rapids-cmake :ref:`Hardware Resources and Testing documentation <rapids_resource_allocation>`.

.. toctree::
:titlesonly:

/command/rapids_test_init
/command/rapids_test_add
/command/rapids_test_generate_resource_spec
/command/rapids_test_gpu_requirements
/command/rapids_test_install_relocatable
1 change: 1 addition & 0 deletions docs/command/rapids_test_add.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/add.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_generate_resource_spec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/generate_resource_spec.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_gpu_requirements.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/gpu_requirements.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_init.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/init.cmake
1 change: 1 addition & 0 deletions docs/command/rapids_test_install_relocatable.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. cmake-module:: ../../rapids-cmake/test/install_relocatable.cmake
96 changes: 96 additions & 0 deletions docs/cpp_code_snippets/rapids_cmake_ctest_allocation.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
/*
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <rapids_cmake_ctest_allocation.hpp>

#include <cuda_runtime_api.h>

#include <algorithm>
#include <cstdlib>
#include <numeric>
#include <string>
#include <string_view>

namespace rapids_cmake {

namespace {
GPUAllocation noGPUAllocation() { return GPUAllocation{-1, -1}; }

GPUAllocation parseCTestAllocation(std::string_view env_variable)
{
std::string gpu_resources{std::getenv(env_variable.begin())};
// need to handle parseCTestAllocation variable being empty

// need to handle parseCTestAllocation variable not having some
// of the requested components

// The string looks like "id:<number>,slots:<number>"
auto id_start = gpu_resources.find("id:") + 3;
auto id_end = gpu_resources.find(",");
auto slot_start = gpu_resources.find("slots:") + 6;

auto id = gpu_resources.substr(id_start, id_end - id_start);
auto slots = gpu_resources.substr(slot_start);

return GPUAllocation{std::stoi(id), std::stoi(slots)};
}

std::vector<GPUAllocation> determineGPUAllocations()
{
std::vector<GPUAllocation> allocations;
const auto* resource_count = std::getenv("CTEST_RESOURCE_GROUP_COUNT");
if (!resource_count) {
allocations.emplace_back();
return allocations;
}

const auto resource_max = std::stoi(resource_count);
for (int index = 0; index < resource_max; ++index) {
std::string group_env = "CTEST_RESOURCE_GROUP_" + std::to_string(index);
std::string resource_group{std::getenv(group_env.c_str())};
std::transform(resource_group.begin(), resource_group.end(), resource_group.begin(), ::toupper);

if (resource_group == "GPUS") {
auto resource_env = group_env + "_" + resource_group;
auto&& allocation = parseCTestAllocation(resource_env);
allocations.emplace_back(allocation);
}
}

return allocations;
}
} // namespace

bool using_resources()
{
const auto* resource_count = std::getenv("CTEST_RESOURCE_GROUP_COUNT");
return resource_count != nullptr;
}

std::vector<GPUAllocation> full_allocation() { return determineGPUAllocations(); }

cudaError_t bind_to_gpu(GPUAllocation const& alloc) { return cudaSetDevice(alloc.device_id); }

bool bind_to_first_gpu()
{
if (using_resources()) {
std::vector<GPUAllocation> allocs = determineGPUAllocations();
return (bind_to_gpu(allocs[0]) == cudaSuccess);
}
return false;
}

} // namespace rapids_cmake
89 changes: 89 additions & 0 deletions docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
/*
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cuda_runtime_api.h>
#include <vector>

namespace rapids_cmake {

/*
* Represents a GPU Allocation provided by a CTest resource specification.
*
* The `device_id` maps to the CUDA gpu id required by `cudaSetDevice`.
* The slots represent the percentage of the GPU that this test will use.
* Primarily used by CTest to ensure proper load balancing of tests.
*/
struct GPUAllocation {
int device_id;
int slots;
};

/*
* Returns true when a CTest resource specification has been specified.
*
* Since the vast majority of tests should execute without a CTest resource
* spec (e.g. when executed manually by a developer), callers of `rapids_cmake`
* should first ensure that a CTestresource spec file has been provided before
* trying to query/bind to the allocation.
*
* ```cxx
* if (rapids_cmake::using_resouces()) {
* rapids_cmake::bind_to_first_gpu();
* }
* ```
*/
bool using_resources();

/*
* Returns all GPUAllocations allocated for a test
*
* To support multi-GPU tests the CTest resource specification allows a
* test to request multiple GPUs. As CUDA only allows binding to a
* single GPU at any time, this API allows tests to know what CUDA
* devices they should bind to.
*
* Note: The `device_id` of each allocation might not be unique.
* If a test says it needs 50% of two GPUs, it could be allocated
* the same physical GPU. If a test needs distinct / unique devices
* it must request 51%+ of a device.
*
* Note: rapids_cmake does no caching, so this query should be cached
* instead of called multiple times.
*/
std::vector<GPUAllocation> full_allocation();

/*
* Have CUDA bind to a given GPUAllocation
*
* Have CUDA bind to the `device_id` specified in the CTest
* GPU allocation
*
* Note: Return value is the cudaError_t of `cudaSetDevice`
*/
cudaError_t bind_to_gpu(GPUAllocation const& alloc);

/*
* Convenience method to bind to the first GPU that CTest has allocated
* Provided as most RAPIDS tests only require a single GPU
*
* Will return `false` if no GPUs have been allocated, or if setting
* the CUDA device failed for any reason.
*/
bool bind_to_first_gpu();
bdice marked this conversation as resolved.
Show resolved Hide resolved

} // namespace rapids_cmake
Loading