Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rapids_test allowing projects to run gpu tests in parallel #328

Merged
merged 15 commits into from
Mar 7, 2023

Conversation

robertmaynard
Copy link
Contributor

@robertmaynard robertmaynard commented Dec 12, 2022

Description

Introduces rapids_test functionality to allow tests executed via ctest -j to properly resource share GPUs.

This is done by having tests state how many GPUs allocations they require, and uses CTest internal job scheduler to properly load balance.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The cmake-format.json is up to date with these changes.
  • I have added new files under rapids-cmake/
    • I have added include guards (include_guard(GLOBAL))
    • I have added the associated docs/ rst file and update the api.rst

@robertmaynard robertmaynard added feature request New feature or request non-breaking Introduces a non-breaking change labels Dec 12, 2022
@robertmaynard robertmaynard requested a review from a team as a code owner December 12, 2022 16:55
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool feature. I am guessing we want this to parallelize RAPIDS tests in CI? One concern is that percentages may not be the most useful metric. Some tests might allocate 8 GB of memory regardless of the GPU size, meaning that the percentage is different on a 32 GB GPU than a 48 GB GPU.

docs/cpp_code_snippets/rapids_cmake_ctest_allocation.cpp Outdated Show resolved Hide resolved
docs/cpp_code_snippets/rapids_cmake_ctest_allocation.cpp Outdated Show resolved Hide resolved
docs/cpp_code_snippets/rapids_cmake_ctest_allocation.cpp Outdated Show resolved Hide resolved
docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp Outdated Show resolved Hide resolved
docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp Outdated Show resolved Hide resolved
testing/test/generate_resource_spec-simple.cmake Outdated Show resolved Hide resolved
testing/test/init-activate_allocation_simple/main.cu Outdated Show resolved Hide resolved
testing/test/init-activate_multi_jobs_single_gpu/main.cpp Outdated Show resolved Hide resolved
@robertmaynard
Copy link
Contributor Author

robertmaynard commented Dec 12, 2022

I am guessing we want this to parallelize RAPIDS tests in CI?

Yes, and other NVIDIA projects. This first step should allow some projects like Thrust to have correct parallel support now.
But we are missing a component for RAPIDS ( see below ) CI, but this will alllow local developers using ctest.

detect_number_of_gpus.cmake why not hard errors?

So this comes back to the missing part needed for RAPIDS. There is no hard requirement when configuring and building RAPIDS projects that you have an actual GPU on the machine. Therefore we shouldn't expect that the test harness code to start hard erroring when it executes in these enviornments. RAPIDS CI runs the configure && build steps without any GPU devices so we want to gracefully fallback to presuming 1 GPU in those cases.

This comes back to the missing component. What rapids-cmake needs to offer is some executable / process for CI structures like RAPIDS to re-detect the number of GPUs on the machine before ctest is run. This way CI can build the binaries on machine A and ship to B and properly use the hardware.

I am currently trying to figure out an approach for this, but didn't want to hold up the initial PR

@robertmaynard
Copy link
Contributor Author

Some tests might allocate 8 GB of memory regardless of the GPU size, meaning that the percentage is different on a 32 GB GPU than a 48 GB GPU.

I wanted to keep the usage as simple as possible. I am worried that we use HIGH_WATER_MEMORY instead of PERCENT it will make people think they need to be super accurate.

Another issue with using MEMORY is how we describe tests that use all GPU resources. We can't say each test requires 1TB of memory as that request can't be fulfilled. Likewise we can't use a magic placeholder like ALL since the GPUs that exist at CMake configure time might not match the GPUs that exist when we run the tests / update tool.

@robertmaynard robertmaynard force-pushed the fea/rapids-cmake-test branch 4 times, most recently from d65f91c to aa41d6e Compare December 21, 2022 16:40
@robertmaynard robertmaynard requested a review from bdice January 3, 2023 16:12
@robertmaynard robertmaynard force-pushed the fea/rapids-cmake-test branch from a75c984 to a892495 Compare January 3, 2023 21:50
@robertmaynard robertmaynard changed the title Fea/rapids cmake test Add rapids_test allowing projects to run gpu tests in parallel Jan 10, 2023
@robertmaynard robertmaynard changed the base branch from branch-23.02 to branch-23.04 January 24, 2023 16:02
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertmaynard I started a review a while ago (probably a month ago) and didn't click submit. I am submitting those comments now -- forgive me if the comments are outdated.

README.md Outdated Show resolved Hide resolved
cmake-format-rapids-cmake.json Outdated Show resolved Hide resolved
docs/api.rst Outdated Show resolved Hide resolved
docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp Outdated Show resolved Hide resolved
docs/cpp_code_snippets/rapids_cmake_ctest_allocation.hpp Outdated Show resolved Hide resolved
rapids-cmake/test/detail/generate_resource_spec.cpp Outdated Show resolved Hide resolved
rapids-cmake/test/detail/run_gpu_test.cmake Outdated Show resolved Hide resolved
rapids-cmake/test/generate_resource_spec.cmake Outdated Show resolved Hide resolved
endif()

# verify that percent is inside the allowed bounds
if(percent GREATER 100 OR percent LESS 1 OR (NOT percent MATCHES "^[0-9]+$"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'm fine with whatever way you want to resolve this.

testing/test/add-single-job-multi-gpu/main.cu Outdated Show resolved Hide resolved
@robertmaynard robertmaynard requested a review from a team as a code owner February 21, 2023 19:53
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is pretty hefty. I skimmed it again (most code was familiar) and had only a few minor suggestions. I'll approve this and you can handle my suggestions as you wish.

README.md Outdated Show resolved Hide resolved
rapids-cmake/test/install_relocatable.cmake Outdated Show resolved Hide resolved
rapids-cmake/test/install_relocatable.cmake Outdated Show resolved Hide resolved
.clang-format Outdated Show resolved Hide resolved
Introduces rapids_test functionality to allow tests executed via ctest -j to properly resource share GPUs.
This is done by having tests state how many GPUs allocations they require, and uses CTest internal job scheduler to properly load balance.
@robertmaynard
Copy link
Contributor Author

Squashed everything down to a single commit and removed all the unneeded changes thanks to #378 being merged.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool!

I have a general question about properties that came up a few times in this review: can you set_property with any arbitrary property string, even if it doesn't already exist? Do you not need to define_property first in those cases?

I haven't reviewed the tests yet, will do that in a subsequent pass.

README.md Outdated Show resolved Hide resolved
dependencies.yaml Show resolved Hide resolved
rapids-cmake/rapids-test.cmake Show resolved Hide resolved
rapids-cmake/test/detail/generate_resource_spec.cpp Outdated Show resolved Hide resolved
rapids-cmake/test/detail/generate_resource_spec.cpp Outdated Show resolved Hide resolved
rapids-cmake/test/install_relocatable.cmake Outdated Show resolved Hide resolved
rapids-cmake/test/gpu_requirements.cmake Show resolved Hide resolved
rapids-cmake/test/add.cmake Show resolved Hide resolved
rapids-cmake/test/add.cmake Outdated Show resolved Hide resolved
rapids-cmake/test/install_relocatable.cmake Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! It would be nice if we could somehow template the testing files with preprocessor macros or something since there's so much duplication (50% of this PR seems like it's just the license headers) but other than that everything looks good.

rapids-cmake/test/add.cmake Outdated Show resolved Hide resolved
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
@robertmaynard robertmaynard force-pushed the fea/rapids-cmake-test branch from e53dcc0 to e6a8403 Compare March 7, 2023 15:19
@robertmaynard
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit f7876e6 into rapidsai:branch-23.04 Mar 7, 2023
@robertmaynard robertmaynard deleted the fea/rapids-cmake-test branch March 7, 2023 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Introduces a non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants