Very Slow MPI CI for GPU backends in some settings #1174

tcojean · 2022-11-02T16:58:02Z

There seems to be an issue on some machines and systems where the CI with MPI-enabled solvers become increasingly slow. Interestingly, that doesn't seem to be the case everywhere, so it may simply be a setup issue. What follows is a non exhaustive list of what I could see:

Slow jobs:

AMDCI system with ROCm 4.0 only for the HIP executor and a docker backend : https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3260809640
HoreKa with CUDA only for the CUDA executor and enroot+slurm backend (this job was later moved to the nla-gpu machine): https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3243280182

Not slow:

nla-gpu machine with ROCm and docker backend:
https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3259622866

Questions:

~~Since this is slow only for GPU backends, is there a problem with the non-GPU-direct communication logic which make all of this terribly slow?~~
Another possibility, are we somehow oversubscribing the GPUs in some settings (multiple processes using the same GPU)? Oversubscription of the GPUs could go very bad and make it very slow.

The text was updated successfully, but these errors were encountered:

tcojean · 2022-11-04T09:14:54Z

For reference,

The problem was indeed GPU oversubscription. This can be seen in the pipelines of #1178 where:

AMDCI system with ROCm 4.0 moved from amdci to nla-gpu, with much better performance. https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3270785055
A new HoreKa job was added with SLURM_GRES=gpu:4 with much better performance as well. https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3270785071

tcojean changed the title ~~Very Slow MPI CI within (some) containers~~ Very Slow MPI CI for GPU backends in some settings Nov 2, 2022

tcojean added reg:ci-cd This is related to the continuous integration system. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:mpi This is related to the MPI module labels Nov 2, 2022

pratikvn mentioned this issue Nov 3, 2022

Pipeline updates to improve MPI CI jobs. #1178

Merged

pratikvn closed this as completed in #1178 Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very Slow MPI CI for GPU backends in some settings #1174

Very Slow MPI CI for GPU backends in some settings #1174

tcojean commented Nov 2, 2022 •

edited

Loading

tcojean commented Nov 4, 2022

Very Slow MPI CI for GPU backends in some settings #1174

Very Slow MPI CI for GPU backends in some settings #1174

Comments

tcojean commented Nov 2, 2022 • edited Loading

tcojean commented Nov 4, 2022

tcojean commented Nov 2, 2022 •

edited

Loading