Issues Building Sedov with GPU on Expanse #2234

joehellmers · 2022-06-21T02:55:26Z

Hello,

I'm loading the following modules on Expanse-SDSC.

    module load gpu/0.15.4
    module load nvhpc/22.2
    module load openmpi

The when building the linker gives a message

nvcc fatal   : Don't know what to do with '/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib'
make: *** [../../../external/amrex/Tools/GNUMake/Make.rules:56: Castro3d.gnu.MPI.CUDA.ex] Error 1

My make file is

PRECISION  = DOUBLE
PROFILE    = FALSE

DEBUG      = FALSE

DIM        = 3

COMP       = gnu

USE_MPI    = TRUE
USE_OMP    = FALSE
USE_CUDA   = TRUE
USE_MHD    = FALSE

USE_FORT_MICROPHYSICS := FALSE
BL_NO_FORT := TRUE

# define the location of the CASTRO top directory
CASTRO_HOME  := ../../..

# This sets the EOS directory in $(MICROPHYSICS_HOME)/EOS
EOS_DIR     := gamma_law

# This sets the network directory in $(MICROPHYSICS_HOME)/Networks
NETWORK_DIR := general_null
NETWORK_INPUTS = gammalaw.net

Bpack   := ./Make.package
Blocs   := .

include $(CASTRO_HOME)/Exec/Make.Castro

Does anybody have any recommendations?

The text was updated successfully, but these errors were encountered:

maximumcats · 2022-06-21T03:02:34Z

When using OpenMPI, the AMReX GNU Make build system chooses to integrate that with CUDA by doing, effectively, nvcc -ccbin=mpicxx. mpicxx will then get evaluated to whatever the real host compiler is (e.g. nvc++ or g++). There is some subtlety in making sure that all of the options set by mpicxx get correctly passed to the host compiler; sometimes various site setups can interfere with the way AMReX is doing it.

Can you share the output of mpicxx -showme:compile and mpicxx -showme:link with this module set loaded?

joehellmers · 2022-06-21T23:17:23Z

Thanks for the help @maxpkatz

[jhellmer@login01 ~]$ module purge
[jhellmer@login01 ~]$ module list
No modules loaded
[jhellmer@login01 ~]$ module load gpu/0.15.4
[jhellmer@login01 ~]$ module load nvhpc/22.2
[jhellmer@login01 ~]$ module load openmpi
[jhellmer@login01 ~]$ mpicxx -showme:compile
-I/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/include
[jhellmer@login01 ~]$ mpicxx -showme:link
-L/proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -Wl,-rpath -Wl,/proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -Wl,-rpath -Wl,/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -L/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -lmpi_cxx -lmpi

maximumcats · 2022-06-22T02:08:39Z

Alright, thanks. Can you share the whole build log? I would like to see some example full build and link lines to better understand the context of the error message.

joehellmers · 2022-06-22T22:29:55Z

One message I'm seeing that is troubling is
/bin/sh: /usr/local/cuda/extras/demo_suite/deviceQuery: No such file or directory

maximumcats · 2022-06-22T22:30:39Z

One message I'm seeing that is troubling is /bin/sh: /usr/local/cuda/extras/demo_suite/deviceQuery: No such file or directory

You can ignore that, it's not fatal to the build process.

joehellmers · 2022-06-22T22:36:55Z

Here is the build.log

build.log

maximumcats · 2022-06-22T22:55:05Z

OK, thanks. I think this is running up against a limitation in nvcc. nvcc doesn't know what to do with options like -rpath, those are intended for the host compiler/linker. By default nvcc will throw an error if it sees an argument that it doesn't recognize (and that isn't hidden behind an explicit command to pass the option to the host compiler with -Xcompiler). Since this can be annoying to deal with, NVIDIA added the --forward-unknown-to-host-compiler option in CUDA 11, which just passes all non-nvcc options to g++, but a limitation in the parsing technology nvcc uses is that it only knows how to forward options of the form "-foo=bar", not "-foo bar" as the "-rpath /path/to/lib" is getting injected. So basically what nvcc is doing is passing "-rpath" to the host compiler, leaving the argument to rpath for nvcc to parse, except now it's just a bare path to a directory, which isn't a valid compiler option, so it fails. (I am not sure what to do about this yet, just wanted to provide an update.)

maximumcats · 2022-06-22T23:16:59Z

It probably also doesn't help that the AMReX build system defaults to using g++ as host compiler even if that's inconsistent with your intent (which, in this case, it is). So you could try building with NVCC_HOST_COMP=nvc++ and see if that makes any difference. Alternately, you could see if SDSC provides the standalone CUDA toolkit outside the context of NVHPC, in which case you could just use that in conjunction with gcc as host compiler.

WeiqunZhang · 2022-06-22T23:41:38Z

What's result of mpif90 -showme:link? Looks like it does not have -Wl in front of -rpath.

joehellmers · 2022-06-23T03:00:17Z

[jhellmer@login01 ~]$ mpif90 -showme:link
-I/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -L/proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -rpath /proj/nv/libraries/Linux_x86_64/22.2/openmpi4/209566-rel-1/lib -rpath /cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -L/cm/shared/apps/spack/gpu/opt/spack/linux-centos8-skylake_avx512/gcc-8.3.1/nvhpc-22.2/Linux_x86_64/22.2/comm_libs/openmpi4/openmpi-4.0.5/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi

WeiqunZhang · 2022-06-23T04:15:28Z

I think mpif90 -showme:link is wrong. -rpath is an argument for ld, not for nvcc, gcc, or gfortran. It should be something like -Wl,-rpath -Wl,/cm/..., not -rpath /cm/....

I think a workaround for this is

make ...the_usual_arguments... MPI_OTHER_COMP=mpicxx

This will use mpicxx instead of mpif90 to find out the link options. Note that mpicxx -showme:link gives the correct link line. Since the test does not need Fortran, we don't need to link to MPI Fortran library. But if you do need Fortran in other runs, you could create your own file in amrex/Tools/GNUMake/sites/ that provides the arguments for the linker.

It's not clear whether this is a bug in spack or openmpi.

WeiqunZhang · 2022-06-23T04:24:06Z

I have spack installed openmpi on my computer. The link flag looks right.

$ ~/mygitrepo/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/openmpi-4.1.2-flxgubtilm7mmh35rivaon3nxz4nj3ai/bin/mpif90 -showme:link
-pthread ... -Wl,-rpath -Wl,/home/wqzhang/...

maximumcats · 2022-06-23T13:13:51Z

I have spack installed openmpi on my computer. The link flag looks right.

$ ~/mygitrepo/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.3.0/openmpi-4.1.2-flxgubtilm7mmh35rivaon3nxz4nj3ai/bin/mpif90 -showme:link
-pthread ... -Wl,-rpath -Wl,/home/wqzhang/...

Right, I also see valid link flags on the OpenMPI 3.1.5 that comes with NVHPC 22.3. So it may be specific to how the SDSC OpenMPI module that @joehellmers is using was configured.

WeiqunZhang · 2022-06-23T17:13:03Z

AMReX-Codes/amrex#2852

maximumcats · 2022-06-23T17:15:04Z

@joehellmers if you could rebuild with the above PR (or latest AMReX development if it's merged before you try it) that will hopefully work around the issue in your case.

joehellmers · 2022-06-24T00:54:20Z

I was able to build after making the identical change to the Make.unknown file.
Thanks!

joehellmers closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues Building Sedov with GPU on Expanse #2234

Issues Building Sedov with GPU on Expanse #2234

joehellmers commented Jun 21, 2022

maximumcats commented Jun 21, 2022

joehellmers commented Jun 21, 2022

maximumcats commented Jun 22, 2022

joehellmers commented Jun 22, 2022

maximumcats commented Jun 22, 2022

joehellmers commented Jun 22, 2022

maximumcats commented Jun 22, 2022

maximumcats commented Jun 22, 2022

WeiqunZhang commented Jun 22, 2022

joehellmers commented Jun 23, 2022

WeiqunZhang commented Jun 23, 2022

WeiqunZhang commented Jun 23, 2022

maximumcats commented Jun 23, 2022

WeiqunZhang commented Jun 23, 2022

maximumcats commented Jun 23, 2022

joehellmers commented Jun 24, 2022

Issues Building Sedov with GPU on Expanse #2234

Issues Building Sedov with GPU on Expanse #2234

Comments

joehellmers commented Jun 21, 2022

maximumcats commented Jun 21, 2022

joehellmers commented Jun 21, 2022

maximumcats commented Jun 22, 2022

joehellmers commented Jun 22, 2022

maximumcats commented Jun 22, 2022

joehellmers commented Jun 22, 2022

maximumcats commented Jun 22, 2022

maximumcats commented Jun 22, 2022

WeiqunZhang commented Jun 22, 2022

joehellmers commented Jun 23, 2022

WeiqunZhang commented Jun 23, 2022

WeiqunZhang commented Jun 23, 2022

maximumcats commented Jun 23, 2022

WeiqunZhang commented Jun 23, 2022

maximumcats commented Jun 23, 2022

joehellmers commented Jun 24, 2022