Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues Building Sedov with GPU on Expanse #2234

Closed
joehellmers opened this issue Jun 21, 2022 · 16 comments
Closed

Issues Building Sedov with GPU on Expanse #2234

joehellmers opened this issue Jun 21, 2022 · 16 comments

Comments

@joehellmers
Copy link
Contributor

Hello,

I'm loading the following modules on Expanse-SDSC.

    module load gpu/0.15.4
    module load nvhpc/22.2
    module load openmpi

The when building the linker gives a message

nvcc fatal   : Don't know what to do with '/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib'
make: *** [../../../external/amrex/Tools/GNUMake/Make.rules:56: Castro3d.gnu.MPI.CUDA.ex] Error 1

My make file is

PRECISION  = DOUBLE
PROFILE    = FALSE

DEBUG      = FALSE

DIM        = 3

COMP       = gnu

USE_MPI    = TRUE
USE_OMP    = FALSE
USE_CUDA   = TRUE
USE_MHD    = FALSE

USE_FORT_MICROPHYSICS := FALSE
BL_NO_FORT := TRUE

# define the location of the CASTRO top directory
CASTRO_HOME  := ../../..

# This sets the EOS directory in $(MICROPHYSICS_HOME)/EOS
EOS_DIR     := gamma_law

# This sets the network directory in $(MICROPHYSICS_HOME)/Networks
NETWORK_DIR := general_null
NETWORK_INPUTS = gammalaw.net

Bpack   := ./Make.package
Blocs   := .

include $(CASTRO_HOME)/Exec/Make.Castro

Does anybody have any recommendations?

@maximumcats
Copy link
Member

When using OpenMPI, the AMReX GNU Make build system chooses to integrate that with CUDA by doing, effectively, nvcc -ccbin=mpicxx. mpicxx will then get evaluated to whatever the real host compiler is (e.g. nvc++ or g++). There is some subtlety in making sure that all of the options set by mpicxx get correctly passed to the host compiler; sometimes various site setups can interfere with the way AMReX is doing it.

Can you share the output of mpicxx -showme:compile and mpicxx -showme:link with this module set loaded?

@joehellmers
Copy link
Contributor Author

Thanks for the help @maxpkatz

[jhellmer@login01 ~]$ module purge
[jhellmer@login01 ~]$ module list
No modules loaded
[jhellmer@login01 ~]$ module load gpu/0.15.4
[jhellmer@login01 ~]$ module load nvhpc/22.2
[jhellmer@login01 ~]$ module load openmpi
[jhellmer@login01 ~]$ mpicxx -showme:compile
-I/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/include
[jhellmer@login01 ~]$ mpicxx -showme:link
-L/proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -Wl,-rpath -Wl,/proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -Wl,-rpath -Wl,/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -L/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -lmpi_cxx -lmpi

@maximumcats
Copy link
Member

Alright, thanks. Can you share the whole build log? I would like to see some example full build and link lines to better understand the context of the error message.

@joehellmers
Copy link
Contributor Author

One message I'm seeing that is troubling is
/bin/sh: /usr/local/cuda/extras/demo_suite/deviceQuery: No such file or directory

@maximumcats
Copy link
Member

One message I'm seeing that is troubling is /bin/sh: /usr/local/cuda/extras/demo_suite/deviceQuery: No such file or directory

You can ignore that, it's not fatal to the build process.

@joehellmers
Copy link
Contributor Author

Here is the build.log

build.log

@maximumcats
Copy link
Member

OK, thanks. I think this is running up against a limitation in nvcc. nvcc doesn't know what to do with options like -rpath, those are intended for the host compiler/linker. By default nvcc will throw an error if it sees an argument that it doesn't recognize (and that isn't hidden behind an explicit command to pass the option to the host compiler with -Xcompiler). Since this can be annoying to deal with, NVIDIA added the --forward-unknown-to-host-compiler option in CUDA 11, which just passes all non-nvcc options to g++, but a limitation in the parsing technology nvcc uses is that it only knows how to forward options of the form "-foo=bar", not "-foo bar" as the "-rpath /path/to/lib" is getting injected. So basically what nvcc is doing is passing "-rpath" to the host compiler, leaving the argument to rpath for nvcc to parse, except now it's just a bare path to a directory, which isn't a valid compiler option, so it fails. (I am not sure what to do about this yet, just wanted to provide an update.)

@maximumcats
Copy link
Member

It probably also doesn't help that the AMReX build system defaults to using g++ as host compiler even if that's inconsistent with your intent (which, in this case, it is). So you could try building with NVCC_HOST_COMP=nvc++ and see if that makes any difference. Alternately, you could see if SDSC provides the standalone CUDA toolkit outside the context of NVHPC, in which case you could just use that in conjunction with gcc as host compiler.

@WeiqunZhang
Copy link
Member

What's result of mpif90 -showme:link? Looks like it does not have -Wl in front of -rpath.

@joehellmers
Copy link
Contributor Author

[jhellmer@login01 ~]$ mpif90 -showme:link
-I/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -L/proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -rpath /proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -rpath /cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -L/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi

@WeiqunZhang
Copy link
Member

I think mpif90 -showme:link is wrong. -rpath is an argument for ld, not for nvcc, gcc, or gfortran. It should be something like -Wl,-rpath -Wl,/cm/..., not -rpath /cm/....

I think a workaround for this is

make ...the_usual_arguments... MPI_OTHER_COMP=mpicxx

This will use mpicxx instead of mpif90 to find out the link options. Note that mpicxx -showme:link gives the correct link line. Since the test does not need Fortran, we don't need to link to MPI Fortran library. But if you do need Fortran in other runs, you could create your own file in amrex/Tools/GNUMake/sites/ that provides the arguments for the linker.

It's not clear whether this is a bug in spack or openmpi.

@WeiqunZhang
Copy link
Member

I have spack installed openmpi on my computer. The link flag looks right.

$ ~/mygitrepo/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/openmpi-4.1.2-flxgubtilm7mmh35rivaon3nxz4nj3ai/bin/mpif90 -showme:link
-pthread ... -Wl,-rpath -Wl,/home/wqzhang/...

@maximumcats
Copy link
Member

I have spack installed openmpi on my computer. The link flag looks right.

$ ~/mygitrepo/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/openmpi-4.1.2-flxgubtilm7mmh35rivaon3nxz4nj3ai/bin/mpif90 -showme:link
-pthread ... -Wl,-rpath -Wl,/home/wqzhang/...

Right, I also see valid link flags on the OpenMPI 3.1.5 that comes with NVHPC 22.3. So it may be specific to how the SDSC OpenMPI module that @joehellmers is using was configured.

@WeiqunZhang
Copy link
Member

AMReX-Codes/amrex#2852

@maximumcats
Copy link
Member

@joehellmers if you could rebuild with the above PR (or latest AMReX development if it's merged before you try it) that will hopefully work around the issue in your case.

@joehellmers
Copy link
Contributor Author

I was able to build after making the identical change to the Make.unknown file.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants