You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tcojean opened this issue
Nov 2, 2022
· 1 comment
· Fixed by #1178
Labels
mod:cudaThis is related to the CUDA module.mod:hipThis is related to the HIP module.mod:mpiThis is related to the MPI modulereg:ci-cdThis is related to the continuous integration system.
There seems to be an issue on some machines and systems where the CI with MPI-enabled solvers become increasingly slow. Interestingly, that doesn't seem to be the case everywhere, so it may simply be a setup issue. What follows is a non exhaustive list of what I could see:
Since this is slow only for GPU backends, is there a problem with the non-GPU-direct communication logic which make all of this terribly slow?
Another possibility, are we somehow oversubscribing the GPUs in some settings (multiple processes using the same GPU)? Oversubscription of the GPUs could go very bad and make it very slow.
The text was updated successfully, but these errors were encountered:
tcojean
changed the title
Very Slow MPI CI within (some) containers
Very Slow MPI CI for GPU backends in some settings
Nov 2, 2022
mod:cudaThis is related to the CUDA module.mod:hipThis is related to the HIP module.mod:mpiThis is related to the MPI modulereg:ci-cdThis is related to the continuous integration system.
There seems to be an issue on some machines and systems where the CI with MPI-enabled solvers become increasingly slow. Interestingly, that doesn't seem to be the case everywhere, so it may simply be a setup issue. What follows is a non exhaustive list of what I could see:
Slow jobs:
HIP
executor and a docker backend : https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3260809640CUDA
executor and enroot+slurm backend (this job was later moved to the nla-gpu machine): https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3243280182Not slow:
https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/3259622866
Questions:
Since this is slow only for GPU backends, is there a problem with the non-GPU-direct communication logic which make all of this terribly slow?The text was updated successfully, but these errors were encountered: