Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very Slow MPI CI for GPU backends in some settings #1174

Closed
tcojean opened this issue Nov 2, 2022 · 1 comment · Fixed by #1178
Closed

Very Slow MPI CI for GPU backends in some settings #1174

tcojean opened this issue Nov 2, 2022 · 1 comment · Fixed by #1178
Labels
mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:mpi This is related to the MPI module reg:ci-cd This is related to the continuous integration system.

Comments

@tcojean
Copy link
Member

tcojean commented Nov 2, 2022

There seems to be an issue on some machines and systems where the CI with MPI-enabled solvers become increasingly slow. Interestingly, that doesn't seem to be the case everywhere, so it may simply be a setup issue. What follows is a non exhaustive list of what I could see:

Slow jobs:

Not slow:

Questions:

  • Since this is slow only for GPU backends, is there a problem with the non-GPU-direct communication logic which make all of this terribly slow?
  • Another possibility, are we somehow oversubscribing the GPUs in some settings (multiple processes using the same GPU)? Oversubscription of the GPUs could go very bad and make it very slow.
@tcojean tcojean changed the title Very Slow MPI CI within (some) containers Very Slow MPI CI for GPU backends in some settings Nov 2, 2022
@tcojean tcojean added reg:ci-cd This is related to the continuous integration system. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:mpi This is related to the MPI module labels Nov 2, 2022
@tcojean
Copy link
Member Author

tcojean commented Nov 4, 2022

For reference,

The problem was indeed GPU oversubscription. This can be seen in the pipelines of #1178 where:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:mpi This is related to the MPI module reg:ci-cd This is related to the continuous integration system.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant