Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smilei v5.0 problems with compiling on GPU on new HPC #674

Closed
spadova-a opened this issue Nov 27, 2023 · 58 comments
Closed

smilei v5.0 problems with compiling on GPU on new HPC #674

spadova-a opened this issue Nov 27, 2023 · 58 comments
Labels
installation compilation, installation

Comments

@spadova-a
Copy link

Hello there,

I would like to use Smilei with GPU on Karolina cluster at IT4I in Ostrava and I am not sure how to compile it. So, I asked the administrator to help me with it, but he encoutered the following problem - GPU compilation for A100 fails with:

src/Diagnostic/DiagnosticScalar.cpp(802): error: expected a ";"
                                        maxval = fieldval;
                                        ^
src/Diagnostic/DiagnosticScalar.cpp(803): error #547: nonstandard form for taking the address of a member function
                              ATOMIC(write)
                                     ^
src/Diagnostic/DiagnosticScalar.cpp(804): error: expected a ";"
                              i_max=i;
                              ^
src/Diagnostic/DiagnosticScalar.cpp(805): error #547: nonstandard form for taking the address of a member function
                              ATOMIC(write)
                                     ^
src/Diagnostic/DiagnosticScalar.cpp(806): error: expected a ";"
                                        j_max=j;
                                        ^
src/Diagnostic/DiagnosticScalar.cpp(807): error #547: nonstandard form for taking the address of a member function
                              ATOMIC(write)
                                     ^
src/Diagnostic/DiagnosticScalar.cpp(808): error: expected a ";"
                                        k_max=k;

I will share this issue with the administrator, since I don't know any details of his procedure. Could you please help us find the problem? Note, that there were no problems with the CPU compilation.

@spadova-a spadova-a added the installation compilation, installation label Nov 27, 2023
@charlesprouveur
Copy link
Contributor

charlesprouveur commented Nov 27, 2023

Hello,
We will need a bit more information to help you, ie what make command did you use? Did you try to use a machine file? The documentation will be updated in the near future to better guide SMILEI's users through the compilation targeting GPU acceleration. In the meantime there has been a discussion for V100 on the element channel and you can see on this github an issue detailing the compilation process to target AMD GPU which should inspire you nonetheless.

Typically for a A100 you can look at the machine file:

smilei/scripts/compile_tools/machine/jean_zay_gpu_A100  

Your make command would look something like:

make -j 12 machine="jean_zay_gpu_A100" config="gpu_nvidia noopenmp verbose"

An example of a working environment we can recommend would be:

module purge
module load anaconda-py3/2020.11
module load nvidia-compilers/23.1
module load cuda/11.2
module load openmpi/4.1.1-cuda
module load hdf5/1.12.0-mpi-cuda
# For HDF5, note that module show can give you the right path
export HDF5_ROOT_DIR=/DIRECTORY_NAME/hdf5/1.12.0/pgi-20.4-HASH/

Regarding your specific error, it looks like you did not compile with nvc++, likely because no machine file for GPU was specified in the make command.
Edit: actually this is because you did not use a machine file: the compilation flags are missing "-DSMILEI_OPENACC_MODE"

@spadova-a
Copy link
Author

Hi, sorry for the late answer, I will try to summarise what we tried.

Loaded modules:

Python/3.10.8-GCCcore-12.2.0 
HDF5/1.14.0-gompi-2022b 
NVHPC/23.7
CUDA/11.7

Created new machine file containing:

SMILEICXX.DEPS = nvcc
THRUSTCXX = nvcc

ACCELERATOR_GPU_FLAGS += -DSMILEI_OPENACC_MODE
ACCELERATOR_GPU_KERNEL_FLAGS += -DSMILEI_OPENACC_MODE

LDFLAGS += -ta=tesla:cc70 -std=c++14 -Mcudalib=curand -lcudart -lcurand -lacccuda -L${EBROOTCUDA}lib64/
CXXFLAGS +=  -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1
LDFLAGS = $LDFLAGS:$LD_LIBRARY_PATH
HDF5_ROOT_DIR = ${EBROOTHDF5}

and used the make command:
make -j 12 machine="karolina_IT4I" config="gpu_nvidia noopenmp verbose"

But still didn't have luck with the compilation, the error we are getting is:

src/Params/Params.h: In static member function ‘static constexpr int Params::getGPUClusterWidth(int)’:
src/Params/Params.h:421:5: error: body of ‘constexpr’ function ‘static constexpr int Params::getGPUClusterWidth(int)’ not a return-statement
  421 |     }
      |     ^

Do you have an idea, what could be the problem? wrong compilation flags, missing or incorrect module?

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Dec 4, 2023

Hi,
Looking at the modules i do see a couple issues:

HDF5/1.14.0-gompi-2022b 

means your hdf5 module was not compiled with your nvhpc module / the nvidia compiler nvc++

Now i can already predict some issues with your nvhpc module that is recent:

NVHPC/23.7

it will require using '-gpu=cc70' as -ta=tesla:cc70 is deprecated after nvhpc 23.5, -Mcudalib=curand should also be removed as it is deprecated, we also know that we have an issue with the newest curand library therefore you will need a fix for the header file gpuRandom.h in src/tools/ ...
As for the cuda version i would recommend either 11.2 or 11.8.. We have currently an issue with CUDA >12.0

Finally for your specific error, could you print the command make is trying to execute (that you should be seeing thanks to the "verbose" configuration ) to be sure there is nothing else?

To sum things up: the quickest way for you to use smilei on gpu would be to:

  1. use cuda <= 11.8 and nvhpc <= 23.1 (23.2 and 23.3 may work with no changes, i just have not tested this specific configuration before)
  2. compile a hdf5 module with nvidia compiler

@spadova-a
Copy link
Author

So, concerning the HDF5 - there is no module compiled with nvidia compiler and enable the paralell option at the cluster, this means I have to download and compile it myself, right?

About the CUDA - there is no CUDA 11.8 nor 11.2, will 11.3, 11.4 or 11.7 do?

Concerning the error, I am not sure, where I can find this. But these are the last lines and the error occurred, in fact, multiple times
make: *** [build/src/Diagnostic/DiagnosticFieldsAM.o] Error 1
make: *** [build/src/Diagnostic/DiagnosticFields2D.o] Error 1
make: *** [build/src/Diagnostic/DiagnosticTrack.o] Error 1
make: *** [build/src/Diagnostic/DiagnosticParticleList.o] Error 1
make: *** [build/src/Checkpoint/Checkpoint.o] Error 1
make: *** [build/src/Collisions/BinaryProcesses.o] Error 1

@beck-llr
Copy link
Contributor

beck-llr commented Dec 4, 2023

The way to go is normally to ask the administrator to make it available to you. It will benefit other potential users too.

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Dec 4, 2023

Regarding hdf5, as beck-llr said, that should be the job of your support team/admins. (you would do something like this comment: https://forums.developer.nvidia.com/t/how-to-build-parallel-hdf5-with-nvhpc/181361/4 )

For the cuda version you mention it should not be a problem. You have to watch out for the nvhpc module though, do you have anything <= 23.1 ?

Finally, the errors you mentioned are due to make terminating because of the error you showed part of previously.
Can you show what

 make -j 1 machine="karolina_IT4I" config="gpu_nvidia noopenmp verbose"

returns you?

@spadova-a
Copy link
Author

spadova-a commented Dec 5, 2023

Hi,
ok, I will ask the support teams to compile the HDF5.

Yes, there is NVHPC/23.1, 22.7 and 22.2.

Still not sure, what exactly do you want me to show... this is everything that is written by the terminal:

Compiling src/Checkpoint/Checkpoint.cpp
mpicxx -Wno-reorder -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -D__VERSION=\"5.0-13-g4f145b3-master\" -DOMPI_SKIP_MPICXX -std=c++11 -Wall -Wextra -I/apps/all/HDF5/1.14.0-gompi-2022b/include -Isrc -Isrc/Profiles -Isrc/Params -Isrc/Projector -Isrc/Checkpoint -Isrc/picsar_interface -Isrc/ElectroMagnBC -Isrc/ElectroMagn -Isrc/Tools -Isrc/Patch -Isrc/Diagnostic -Isrc/PartCompTime -Isrc/ParticleBC -Isrc/Radiation -Isrc/Merging -Isrc/Interpolator -Isrc/DomainDecomposition -Isrc/Collisions -Isrc/MultiphotonBreitWheeler -Isrc/Pusher -Isrc/MovWindow -Isrc/Field -Isrc/Particles -Isrc/SmileiMPI -Isrc/ElectroMagnSolver -Isrc/Python -Isrc/Ionization -Isrc/Species -Isrc/ParticleInjector -Ibuild/src/Python -I/apps/all/Anaconda3/2023.09-0/include/python3.11 -I/apps/all/Anaconda3/2023.09-0/include/python3.11 -I/apps/all/Anaconda3/2023.09-0/lib/python3.11/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g -DSMILEI_OPENACC_MODE -DSMILEI_ACCELERATOR_MODE -c src/Checkpoint/Checkpoint.cpp -o build/src/Checkpoint/Checkpoint.o

Edit: only the correct part of the terminal message was kept, so the post is not too long.

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Dec 5, 2023

For future reference, this is what i meant:

mpicxx -Wno-reorder -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -D__VERSION=\"5.0-13-g4f145b3-master\" -DOMPI_SKIP_MPICXX -std=c++11 -Wall -Wextra -I/apps/all/HDF5/1.14.0-gompi-2022b/include -Isrc -Isrc/Profiles -Isrc/Params -Isrc/Projector -Isrc/Checkpoint -Isrc/picsar_interface -Isrc/ElectroMagnBC -Isrc/ElectroMagn -Isrc/Tools -Isrc/Patch -Isrc/Diagnostic -Isrc/PartCompTime -Isrc/ParticleBC -Isrc/Radiation -Isrc/Merging -Isrc/Interpolator -Isrc/DomainDecomposition -Isrc/Collisions -Isrc/MultiphotonBreitWheeler -Isrc/Pusher -Isrc/MovWindow -Isrc/Field -Isrc/Particles -Isrc/SmileiMPI -Isrc/ElectroMagnSolver -Isrc/Python -Isrc/Ionization -Isrc/Species -Isrc/ParticleInjector -Ibuild/src/Python -I/apps/all/Anaconda3/2023.09-0/include/python3.11 -I/apps/all/Anaconda3/2023.09-0/include/python3.11 -I/apps/all/Anaconda3/2023.09-0/lib/python3.11/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g -DSMILEI_OPENACC_MODE -DSMILEI_ACCELERATOR_MODE -c src/Checkpoint/Checkpoint.cpp -o build/src/Checkpoint/Checkpoint.o

Because you were using recent module (nvhpc 23.7) you were missing flags such as -gpu=cc70,cc80 -acc etc.

To simplifiy your first compilation, please use nvhpc23.1 and cuda 11.3 as you have these, and have support compile hdf5 with the compiler that comes with it.

Finally your machine file should look like this (i saw that the Karolina cluster is using AMD CPU + A100) :

SMILEICXX.DEPS = nvcc
THRUSTCXX = nvcc

ACCELERATOR_GPU_FLAGS += -w
ACCELERATOR_GPU_FLAGS += -tp=zen3 -ta=tesla:cc80 -std=c++14  -lcurand

ACCELERATOR_GPU_KERNEL_FLAGS += -O3 --std c++14 $(DIRS:%=-I%)
ACCELERATOR_GPU_KERNEL_FLAGS += --expt-relaxed-constexpr
ACCELERATOR_GPU_KERNEL_FLAGS += $(shell $(PYTHONCONFIG) --includes)
ACCELERATOR_GPU_KERNEL_FLAGS += -arch=sm_80
ACCELERATOR_GPU_FLAGS        += -Minfo=accel # what is offloaded/copied 
ACCELERATOR_GPU_FLAGS += -DSMILEI_OPENACC_MODE
ACCELERATOR_GPU_KERNEL_FLAGS += -DSMILEI_OPENACC_MODE

LDFLAGS += -ta=tesla:cc80 -std=c++14  -lcudart -lcurand -lacccuda -L${EBROOTCUDA}lib64/
CXXFLAGS +=  -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1
LDFLAGS = $LDFLAGS:$LD_LIBRARY_PATH
HDF5_ROOT_DIR = ${EBROOTHDF5}

If you want to try nvhpc 23.7, it should look like this:

SMILEICXX.DEPS = nvcc
THRUSTCXX = nvcc

ACCELERATOR_GPU_FLAGS += -w
ACCELERATOR_GPU_FLAGS += -tp=zen3 -gpu=cc80 -acc  -std=c++14  -lcurand

ACCELERATOR_GPU_KERNEL_FLAGS += -O3 --std c++14 $(DIRS:%=-I%)
ACCELERATOR_GPU_KERNEL_FLAGS += --expt-relaxed-constexpr
ACCELERATOR_GPU_KERNEL_FLAGS += $(shell $(PYTHONCONFIG) --includes)
ACCELERATOR_GPU_KERNEL_FLAGS += -arch=sm_80
ACCELERATOR_GPU_FLAGS        += -Minfo=accel # what is offloaded/copied 
ACCELERATOR_GPU_FLAGS += -DSMILEI_OPENACC_MODE
ACCELERATOR_GPU_KERNEL_FLAGS += -DSMILEI_OPENACC_MODE

LDFLAGS += -gpu=cc80 -std=c++14 -acc -cuda  -lcudart -lcurand -lacccuda -L${EBROOTCUDA}lib64/
CXXFLAGS +=  -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1
LDFLAGS = $LDFLAGS:$LD_LIBRARY_PATH
HDF5_ROOT_DIR = ${EBROOTHDF5}

@spadova-a
Copy link
Author

Ok, thank you. I will let you know, once the proper HDF5 module is ready and I try the compilation again.

@spadova-a
Copy link
Author

Hi,
so I finally got the right HDF5 module available. Nonetheless, I still wasn't successful with the compilation. The latest error is this:
Linking smilei . . . -L/apps/all/HDF5/1.14.0-nvompi-2022.07/lib DFLAGS:D_LIBRARY_PATH -lhdf5 -L/apps/all/Python/3.10.4-GCCcore-11.3.0/lib -lpython3.10 -lcrypt -ldl -lm -lpthread -lutil -lm -lm -Xlinker -export-dynamic /apps/all/binutils/2.38-GCCcore-11.3.0/bin/ld: cannot find DFLAGS:D_LIBRARY_PATH: No such file or directory make: *** [smilei] Error 2
I guess the problem is that it is looking for DFLAGS:D_LIBRARY_PATH instead of LDFLAGS:LD_LIBRARY_PATH. However, I don't know neither why of how to fix it. Any ideas?

@mccoys
Copy link
Contributor

mccoys commented Jan 17, 2024

Something is very wrong in your setup. Can you show the result of make env

@iltommi
Copy link
Contributor

iltommi commented Jan 17, 2024

also a make config=verbose can help

@spadova-a
Copy link
Author

make env:
VERSION : 5.0-57-gc23dd35-master SMILEICXX : mpicxx OPENMP_FLAG : -fopenmp -D_OMP HDF5_ROOT_DIR : FFTW3_LIB_DIR : SITEDIR : /home/spadoalz/.local/lib/python3.10/site-packages PYTHONEXE : python PY_CXXFLAGS : -I/apps/all/Python/3.10.4-GCCcore-11.3.0/include/python3.10 -I/apps/all/Python/3.10.4-GCCcore-11.3.0/include/python3.10 -I/apps/all/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION PY_LDFLAGS : -L/apps/all/Python/3.10.4-GCCcore-11.3.0/lib -lpython3.10 -lcrypt -ldl -lm -lpthread -lutil -lm -lm -Xlinker -export-dynamic CXXFLAGS : -D__VERSION=\"5.0-57-gc23dd35-master\" -DOMPI_SKIP_MPICXX -std=c++14 -Isrc -Isrc/Profiles -Isrc/Params -Isrc/Projector -Isrc/Checkpoint -Isrc/picsar_interface -Isrc/ElectroMagnBC -Isrc/ElectroMagn -Isrc/Tools -Isrc/Patch -Isrc/Diagnostic -Isrc/PartCompTime -Isrc/ParticleBC -Isrc/Radiation -Isrc/Merging -Isrc/Interpolator -Isrc/DomainDecomposition -Isrc/Collisions -Isrc/MultiphotonBreitWheeler -Isrc/Pusher -Isrc/MovWindow -Isrc/Field -Isrc/Particles -Isrc/SmileiMPI -Isrc/ElectroMagnSolver -Isrc/Python -Isrc/Ionization -Isrc/Species -Isrc/ParticleInjector -Ibuild/src/Python -I/apps/all/Python/3.10.4-GCCcore-11.3.0/include/python3.10 -I/apps/all/Python/3.10.4-GCCcore-11.3.0/include/python3.10 -I/apps/all/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g -fopenmp -D_OMP LDFLAGS : -lhdf5 -L/apps/all/Python/3.10.4-GCCcore-11.3.0/lib -lpython3.10 -lcrypt -ldl -lm -lpthread -lutil -lm -lm -Xlinker -export-dynamic -lm -fopenmp -D_OMP COMPILER_INFO : pgc++

and I used the machine file, that was recommended a few comments above

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Jan 17, 2024

remove the line:

LDFLAGS = $LDFLAGS:$LD_LIBRARY_PATH

make clean, and try again.
To add some details, LDFLAGS is supposed to be only flags that are added at the linking step, which was redefined in your script with an attempt appending it with the LD_LIBRARY_PATH environment variable. the compiler was looking for -L(OPTION_NAME) and therefore took out the L from LDFLAGS and from LD_LIBRARY_PATH, giving you your error message with "DFLAGS:D_LIBRARY_PATH".

I should have removed this line from your script when i adapted it.

@spadova-a
Copy link
Author

spadova-a commented Jan 18, 2024

Hello,
so I was able to compile smilei, but a test run failed

error.txt

Do you think there is a problem with some of the modules I used to compile the code? These are the modules I used:

  1. GCCcore/11.3.0 12) libevent/2.1.12-GCCcore-11.3.0 23) Szip/2.1.1-GCCcore-11.3.0
  2. zlib/1.2.12-GCCcore-11.3.0 13) UCX/1.12.1-GCCcore-11.3.0 24) HDF5/1.14.0-nvompi-2022.07
  3. binutils/2.38-GCCcore-11.3.0 14) GDRCopy/2.3-GCCcore-11.3.0 25) bzip2/1.0.8-GCCcore-11.3.0
  4. numactl/2.0.14-GCCcore-11.3.0 15) UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 26) ncurses/6.3-GCCcore-11.3.0
  5. CUDA/11.7.0 16) libfabric/1.15.1-GCCcore-11.3.0 27) libreadline/8.1.2-GCCcore-11.3.0
  6. NVHPC/22.7-CUDA-11.7.0 17) PMIx/4.1.2-GCCcore-11.3.0 28) Tcl/8.6.12-GCCcore-11.3.0
  7. XZ/5.2.5-GCCcore-11.3.0 18) UCC/1.0.0-GCCcore-11.3.0 29) SQLite/3.38.3-GCCcore-11.3.0
  8. libxml2/2.9.13-GCCcore-11.3.0 19) NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0 30) GMP/6.2.1-GCCcore-11.3.0
  9. libpciaccess/0.16-GCCcore-11.3.0 20) UCC-CUDA/1.0.0-GCCcore-11.3.0-CUDA-11.7.0 31) libffi/3.4.2-GCCcore-11.3.0
  10. hwloc/2.7.1-GCCcore-11.3.0 21) OpenMPI/4.1.4-NVHPC-22.7-CUDA-11.7.0 32) Python/3.10.4-GCCcore-11.3.0

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Jan 18, 2024

There should be nothing wrong with your modules. We are encountering a completely different class of problems which are runtime issues. from your message (please try to format it if you can, EDIT: thanks for the formatting) it crashes while computing a scalar diag.

First, what test case are you trying to run? What diags are in the namelist?
Post the ouput file, we are missing a lot of info

EDIT: are you using the latest version of smilei? Post november we added some fixes.

@spadova-a
Copy link
Author

I tried to run two of the basic tutorials - thermal plasma (this is the mistake in the comment) and laser propagation in vacuum (this one failed at Fields diagnostics).
I git cloned the new version yesterday, so it should be the latest version.
Now, this is the output file:
smilei.out.txt

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Jan 18, 2024

In the smilei.out.txt you just provided the reason for the failure is clear: you do not have the package numpy in the python module that is loaded. Make sure you have the packages required as in the doc :

sphinx, h5py, numpy, matplotlib, pint
you can also add scipy.
You can do that with pip install sphinx h5py numpy matplotlib pint ffmpe if your cluster allows it or ask your support (they may have an anaconda package already with everything)

For the other tutorial that failed (thermal plasma i think), please provide the exact input and output file. You may want to do that after you installed the python packages and run it again.

@spadova-a
Copy link
Author

yeah sorry, I loaded the wrong module. Sending the current error file
smilei.out.txt

@beck-llr
Copy link
Contributor

Does it still occur with non frozen species ? It could be that the time frozen option i not supported on gpu.

@charlesprouveur
Copy link
Contributor

yeah sorry, I loaded the wrong module. Sending the current error file smilei.out.txt

I'd like to look at your input file as well to check.
Also, the machine file you used was for an execution on A100, can you confirm this is the hardware you are trying to execute smilei on?
Finally, what is your slurm script looking like?

@spadova-a
Copy link
Author

input file: input.txt
yes, the cluster has NVIDIA A100 (link to the website: https://docs.it4i.cz/karolina/compute-nodes/)
slurm script: srun.txt (I also tried to run it as an interactive job allocating one gpu node)

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Jan 18, 2024

You are running a test case in 1D when it is not currently supported on GPU :) (it might be soon-ish) (check the list of currently supported features here )
Edit: an additional comment, trying such a small test case on 8 GPUs might be an issue (here you would have 4 points plus the ghostcells for each patch, with one patch per GPU) , in theory it should be ok but ...

@spadova-a
Copy link
Author

spadova-a commented Jan 18, 2024

ok, that was a pretty silly mistake... I tried another case (input file: [input.txt])(https://github.com/SmileiPIC/Smilei/files/13980089/input.txt)
but it is still not working (output: out.txt)

@charlesprouveur
Copy link
Contributor

charlesprouveur commented Jan 18, 2024

so we are back to the cuda device error.

In your slurm script i don't see you loading the environment you used at compile time. typically mine looks like this:

#!/bin/bash
#SBATCH --job-name=smilei            # Job name
#SBATCH -A account
#SBATCH --partition=YOUR_GPU_PARTITION_NAME            # Partition to use
##SBATCH --qos=YOUR_QUEUE
#SBATCH --ntasks=8                   # total Number of MPI processes (= total number of GPU)
#SBATCH --ntasks-per-node=8    # number of MPI rank per node 
#SBATCH --gres=gpu:8                 # GPU number per node
#SBATCH --cpus-per-task=6           
#SBATCH --hint=nomultithread         
#SBATCH --time=00:10:00             
#SBATCH --output=output        # Name of the output file
#SBATCH --error=error         # Name of the error file

# Smilei specific env
source smilei_gpu_env_23.1.sh

set -x

# execution with binding via bind_gpu.sh : 1 GPU per MPI.
srun /gpfslocalsup/pub/idrtools/bind_gpu.sh  ./smilei input.py

while bind_gpu.sh (might not be required here though):

#!/bin/bash

LOCAL_RANK=${MPI_LOCALRANKID} # mpirun Intel MPI
if [ -z "${LOCAL_RANK}" ]; then LOCAL_RANK=${OMPI_COMM_WORLD_LOCAL_RANK}; fi # mpirun OpenMPI
if [ -z "${LOCAL_RANK}" ]; then LOCAL_RANK=${SLURM_LOCALID}; fi  # srun 

export CUDA_VISIBLE_DEVICES=${LOCAL_RANK}

"$@"

Try again with sourcing the compilation environment in your slurm script, you might just be missing that.
If that does not work:
Doing a bit of googling, (https://forums.developer.nvidia.com/t/cudalaunchkernel-returned-status-98-invalid-device-function/169958), this seems to confirm my suspicion that something could have gone wrong with the machine file. Can you do make clean and recompile + execute with the new binary just to be sure.
If that does not work share the machine file you are currently using and also look in nvcc -h what --gpu-architecture shows you ( as in, what are the sm option available)

@spadova-a
Copy link
Author

Hi, sorry I have never used an environment for the compilation before, I just loaded the modules and I did the same thing in the submission script. Therefore, I don't really know how an environment should look like, I did some googling but it did not help me much... Could you please provide me with an example or some guideline?

@charlesprouveur
Copy link
Contributor

In your slurm script i can only see:

ml purge
ml load HDF5/1.14.0-nvompi-2022.07
ml Python/3.10.4-GCCcore-11.3.0

ergo, unless by default the running environment includes nvhpc, cuda & openmpi, i don't see how your executable can access its dependencies.

Can you add "module list" in your slurm script and run it so we can see what is available at runtime?
What i call an environment is simply the module & environment variables available to your executable. Usually one uses a script to load the appropriate modules at runtime (or lists the 'module load' commands in the slurm script)

Also, in your latest output i see one mpi process and 8 patches. Are you trying to run on 1 or 8 GPUs?

@spadova-a
Copy link
Author

Here is the output file with module list (the HDF5 module loads a lot of other modules as its dependencies) out.txt

I am trying to run on 8 GPUs as I am able to only allocate a full node, which has 8 GPUs. I also had 8 mpi processes in the slurm script, and the error I got was the same, while every process printed the same error message in the output file, so for testing purposes I set only 1 mpi process so the output file wouldn't be so long.

@mccoys
Copy link
Contributor

mccoys commented Jan 25, 2024

You should load NVHPC when you compile Smilei

@charlesprouveur
Copy link
Contributor

That seems to be the case. Although the fact that there is another cuda module loaded is not great.

@spadova-a Can you do make clean and recompile + execute with the new binary just to be sure.
If that does not work share the machine file you are currently using and also look in nvcc -h what --gpu-architecture shows you ( as in, what are the sm option available)

@spadova-a
Copy link
Author

Hi, sorry for the inactivity, right now I have a lot of work to do. I will give the installation a new try soon.

@mccoys
Copy link
Contributor

mccoys commented Feb 29, 2024

@Horymir001
Copy link

Dear colleagues,
I have observed this discussion for some time. As Karolina undertook an upgrade recently and there are new modules available now, so I tried to compile Smilei with GPU accelerator as well. However, I did not succeed so far.

I tried the compilation on an accelerated node with 8 A100 GPUs.

(base) [it4i-vojtech@login2.karolina Smilei]$ salloc -A DD-23-157 -p qgpu_exp -N 1 --ntasks-per-node 16 --gpus 8 -t 00:40:00
salloc: Granted job allocation 1032620
salloc: Waiting for resource configuration
salloc: Nodes acn17 are ready for job
(base) [it4i-vojtech@acn17.karolina Smilei]$ 

There, I tried to load the proper modules, and there is a good candidate indeed.


(base) [it4i-vojtech@acn17.karolina Smilei]$ module spider HDF5

...
     Versions:
        HDF5/1.12.1-gompi-2021b
        HDF5/1.12.2-gompi-2022a
        HDF5/1.12.2-iimpi-2022a
        HDF5/1.14.0-gompi-2023a
        HDF5/1.14.0-iimpi-2022b-serial
        HDF5/1.14.0-iimpi-2022b
        HDF5/1.14.3-gompi-2023b
        HDF5/1.14.3-iimpi-2023b
        HDF5/1.14.3-NVHPC-24.1-CUDA-12.4.0
        HDF5/1.14.3-NVHPC-24.3-CUDA-12.3.0

...

Let's try the last one then.

(base) [it4i-vojtech@acn17.karolina Smilei]$ ml HDF5/1.14.3-NVHPC-24.3-CUDA-12.3.0

These are all the loaded modules:

(base) [it4i-vojtech@l@acn17.karolina Smilei]$ module list

Currently Loaded Modules:
  1) GCCcore/12.2.0
  2) zlib/1.2.12-GCCcore-12.2.0
  3) binutils/2.39-GCCcore-12.2.0
  4) numactl/2.0.16-GCCcore-12.2.0
  5) CUDA/12.3.0
  6) NVHPC/24.3-CUDA-12.3.0
  7) XZ/5.2.7-GCCcore-12.2.0
  8) libxml2/2.10.3-GCCcore-12.2.0
  9) libpciaccess/0.17-GCCcore-12.2.0
 10) hwloc/2.8.0-GCCcore-12.2.0
 11) libevent/2.1.12-GCCcore-12.2.0
 12) UCX/1.16.0-GCCcore-12.2.0
 13) GDRCopy/2.4.1-GCCcore-12.2.0
 14) UCX-CUDA/1.16.0-GCCcore-12.2.0-CUDA-12.3.0
 15) libfabric/1.16.1-GCCcore-12.2.0
 16) PMIx/4.2.2-GCCcore-12.2.0
 17) UCC/1.3.0-GCCcore-12.2.0
 18) NCCL/2.21.5-GCCcore-12.2.0-CUDA-12.3.0
 19) UCC-CUDA/1.3.0-GCCcore-12.2.0-CUDA-12.3.0
 20) OpenMPI/4.1.6-NVHPC-24.3-CUDA-12.3.0
 21) Szip/2.1.1-GCCcore-12.2.0
 22) HDF5/1.14.3-NVHPC-24.3-CUDA-12.3.0

We have OpenMPI,

 (base) [it4i-vojtech@acn17.karolina Smilei]$ which mpicc
/apps/all/OpenMPI/4.1.6-NVHPC-24.3-CUDA-12.3.0/bin/mpicc
(base) [it4i-vojtech@login2.karolina Smilei]$ ls /apps/all/OpenMPI/4.1.6-NVHPC-24.3-CUDA-12.3.0/bin/
aggregate_profile.pl  mpif90        ortecc       oshcc           shmemc++
mpic++                mpifort       orte-clean   oshCC           shmemcc
mpicc                 mpirun        orted        oshcxx          shmemCC
mpiCC                 ompi-clean    orte-info    oshfort         shmemcxx
mpicxx                ompi_info     orterun      oshmem_info     shmemfort
mpiexec               ompi-server   orte-server  oshrun          shmemrun
mpif77                opal_wrapper  oshc++       profile2mat.pl

and proper python

 (base) [it4i-vojtech@acn17.karolina Smilei]$ python
Python 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>>

I created a primitive machine file "karolina" including just two lines

SMILEICXX_DEPS = g++
CXXFLAGS += -gpu=cc80 -acc

Then I tried

(base) [it4i-vojtech@acn17.karolina Smilei]$ make machine="karolina" config="gpu_nvidia" 
Compiling src/Checkpoint/Checkpoint.cpp
"src/Tools/H5.h", line 11: catastrophic error: #error directive: "HDF5 was not built with --enable-parallel option"
  #error "HDF5 was not built with --enable-parallel option"
   ^

1 catastrophic error detected in the compilation of "src/Checkpoint/Checkpoint.cpp".
Compilation terminated.
make: *** [makefile:369: build/src/Checkpoint/Checkpoint.o] Error 2

It seems to me that despite the name of the module HDF5/1.14.3-NVHPC-24.3-CUDA-12.3.0, HDF5 was not built properly. Do you think that it is possible?

I might try to compile HDF5 myself according to your instructions eventually as well.

@charlesprouveur
Copy link
Contributor

charlesprouveur commented May 8, 2024

Hi,
It is very likely HDF5 was not properly built.

Preface: no test has been done with the latest nvhpc version (ie 24.0 and above) but it "should" work.

Here is an example of how I do it on my machine with nvhpc 23.11 that you can use as a reference:
( note that in your case "/.../YOUR_DIRECTORY/modulefiles/nvhpc/23.11" should be replaced with NVHPC/24.3-CUDA-12.3.0 )

cd YOUR_DIRECTORY
mkdir tools
cd tools
 
wget https://github.com/HDFGroup/hdf5/releases/download/hdf5-1_14_2/hdf5-1_14_2.tar.gz
 
tar xzfv hdf5-1_14_2.tar.gz
cd hdfsrc/
mkdir build
cd build
module load /.../YOUR_DIRECTORY/modulefiles/nvhpc/23.11 cmake
cmake -DCMAKE_C_COMPILER=`which mpicc` -DCMAKE_INSTALL_PREFIX=/gpfswork/rech/YOUR_DIRECTORY/tools/hdfsrc/install -DHDF5_ENABLE_PARALLEL=ON ..
make
make install

It seems the person who installed your HDF5 module did not include the "-DHDF5_ENABLE_PARALLEL=ON" option in
his install script.

Once your hdf5 install is finished you should

export HDF5_ROOT_DIR=YOUR_DIRECTORY/tools/hdfsrc/install
export LD_LIBRARY_PATH=YOUR_DIRECTORY/tools/hdfsrc/install/lib/:$LD_LIBRARY_PATH

at compile time and runtime.
At compile time you might need to change your machine file:

SMILEICXX_DEPS = g++ -I/YOUR_DIRECTORY/tools/hdfsrc/install/include/

GPU_COMPILER = nvcc -I/YOUR_DIRECTORY/tools/hdfsrc/install/include/ 

@Horymir001
Copy link

Hi Charles,
thanks for your advice.
I compiled HDF5 in the following way:

salloc -A DD-23-157 -p qgpu_exp -N 1 --ntasks-per-node 16 --gpus 8 -t 00:40:00
ml OpenMPI/4.1.6-NVHPC-23.11-CUDA-12.2.0 # SAME NVHPC YOU RECOMMEND
ml CMake/3.24.3-GCCcore-12.2.0

cd
mkdir myHDF5
cd HDF5
mkdir tools
cd tools
 wget https://github.com/HDFGroup/hdf5/releases/download/hdf5-1_14_2/hdf5-1_14_2.tar.gz
 tar xzfv hdf5-1_14_2.tar.gz
cd hdfsrc/
mkdir build
cd build
cmake -DCMAKE_C_COMPILER=`which mpicc` -DCMAKE_INSTALL_PREFIX=/home/it4i-vojtech/myHDF5/tools/hdfsrc/install/ -DHDF5_ENABLE_PARALLEL=ON ..
make -j 50
make install
export HDF5_ROOT_DIR=/home/it4i-vojtech/myHDF5/tools/hdfsrc/install
export LD_LIBRARY_PATH=/home/it4i-vojtech/myHDF5/tools/hdfsrc/install/lib/:$LD_LIBRARY_PATH

This installation was successful.

Then I prepared this machine file karolina.

SMILEICXX_DEPS = g++ -I//home/it4i-vojtech/myHDF5/tools/hdfsrc/install/include/
GPU_COMPILER = nvcc -I//home/it4i-vojtech/myHDF5/tools/hdfsrc/install/include/ 
CXXFLAGS += -gpu=cc80 -acc

Then I attempted to compile Smilei

make clean
make -j 50 machine="karolina" config="gpu_nvidia" > output.log 2> error.log

error.log
output.log

Typical errors are:

src/Projector/Projector2D2OrderGPUKernelCUDAHIP.cu(1186): error: calling a constexpr __device__ function("Params::getGPUClusterWidth(int)") from a __host__ function("currentDepositionKernel2D") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
          do { if( !( Params::getGPUClusterWidth( 2 ) != -1 && Params::getGPUClusterGhostCellBorderWidth( 2 ) != -1 ) ) { {{{std::string line = " "; for (int __ic =0; __ic < 80 ; __ic++) line += "-"; std::cerr << "\033[1;31m" << line << "\n [" << "ERROR" << "] " << "src/Projector/Projector2D2OrderGPUKernelCUDAHIP.cu" << ":" << 1186 << " (" << __FUNCTION__ << ") " << "Params::getGPUClusterWidth( 2 ) != -1 && Params::getGPUClusterGhostCellBorderWidth( 2 ) != -1" << "\n" << line << "\033[0m" << std::endl;}; raise(

and

src/Particles/nvidiaParticles.cu(697): error: calling a constexpr __device__ function("_ZN6Params18getGPUClusterWidthE1?") from a __host__ function("computeParticleClusterKey") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
                                       Cluster3D<Params::getGPUClusterWidth( 3 )>{ parameters.res_space[0],
                                                 ^

"/apps/all/CUDA/12.2.0/include/crt/host_defines.h", line 86: warning: incompatible redefinition of macro "__forceinline__" (declared at line 39 of "/apps/all/CUDA/12.2.0/include/cuda/std/detail/__config") [bad_macro_redef]
  #define __forceinline__ \
          ^

8 errors detected in the compilation of "src/Particles/nvidiaParticles.cu".
make: *** [makefile:374: build/src/Particles/nvidiaParticles.o] Error 2
NVC++-W-1053-External and Static

I think I need to specify flags better. However, I do not know how.

@mccoys
Copy link
Contributor

mccoys commented May 9, 2024

Try to add --expt-relaxed-constexpr in the variable GPU_COMPILER_FLAGS

@Horymir001
Copy link

Done. Different errors popped up:
error.log
output.log

They are of this kind

src/Projector/Projector3D2OrderGPUKernelCUDAHIP.cu(85): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (double *, double)
                      ::atomicAdd( a_pointer, a_value );
src/Profiles/Function.h(655): warning #611-D: overloaded virtual function "Function::valueAt" is only partially overridden in class "Function_Polygonal2D"
  class Function_Polygonal2D : public Function
        ^

make: *** [makefile:374: build/src/Projector/Projector3D2OrderGPUKernelCUDAHIP.o] Error 1

@charlesprouveur
Copy link
Contributor

Assuming you did a "make clean" before compiling again, i am thinking you do not have the -arch option specified in your machine file for GPU_COMPILER_FLAGS.
As an example in my machine file:

(...)
CXXFLAGS += -w
CXXFLAGS += -acc=gpu -gpu=cc86,fastmath -std=c++14  -lcurand # do not put -cuda here

GPU_COMPILER_FLAGS += -O2 --std c++14 $(DIRS:%=-I%) 

GPU_COMPILER_FLAGS += --expt-relaxed-constexpr
GPU_COMPILER_FLAGS += $(shell $(PYTHONCONFIG) --includes)
GPU_COMPILER_FLAGS += -arch=sm_86 #native #--generate-code arch=compute_86,code=sm_86  
CXXFLAGS        += -Minfo=accel # what is offloaded/copied

LDFLAGS += -acc=gpu -gpu=cc86  -cudalib=curand  # ccnative also works
CXXFLAGS += -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1  -std=c++14 

@Horymir001
Copy link

Horymir001 commented May 9, 2024

Thank you both. We can move forward as the compilation was successful. It failed by running though.

  1. Compilation of HDF5 as in my post 4 hours ago.
  2. Machine file karolina
SMILEICXX_DEPS = g++ -I//home/it4i-vojtech/myHDF5/tools/hdfsrc/install/include/
GPU_COMPILER = nvcc -I//home/it4i-vojtech/myHDF5/tools/hdfsrc/install/include/ --expt-relaxed-constexpr
CXXFLAGS += -gpu=cc80 -acc
CXXFLAGS += -w
CXXFLAGS += -acc=gpu -gpu=cc80,fastmath -std=c++14  -lcurand # do not put -cuda here
GPU_COMPILER_FLAGS += -O2 --std c++14 $(DIRS:%=-I%) 
GPU_COMPILER_FLAGS += --expt-relaxed-constexpr
GPU_COMPILER_FLAGS += $(shell $(PYTHONCONFIG) --includes)
GPU_COMPILER_FLAGS += -arch=sm_80 #native #--generate-code arch=compute_80,code=sm_80  
CXXFLAGS        += -Minfo=accel # what is offloaded/copied
LDFLAGS += -acc=gpu -gpu=cc80  -cudalib=curand  # ccnative also works
CXXFLAGS += -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1  -std=c++14 
  1. Compilation of Smilei GPU
export HDF5_ROOT_DIR=/home/it4i-vojtech/myHDF5/tools/hdfsrc/install
export LD_LIBRARY_PATH=/home/it4i-vojtech/myHDF5/tools/hdfsrc/install/lib/:$LD_LIBRARY_PATH
make clean
make -j 50 machine="karolina" config="gpu_nvidia"
  1. Launching attempt

I took a slightly modified example file for 2D LWFA with GPU computing on.
tst2d_04_laser_wake.py.txt

salloc -A DD-23-157 -p qgpu_exp -N 1 --ntasks-per-node 16 --gpus 8 -t 00:40:00

ml OpenMPI/4.1.6-NVHPC-23.11-CUDA-12.2.0 # SAME NVHPC YOU RECOMMEND
ml CMake/3.24.3-GCCcore-12.2.0
export HDF5_ROOT_DIR=/home/it4i-vojtech/myHDF5/tools/hdfsrc/install
export LD_LIBRARY_PATH=/home/it4i-vojtech/myHDF5/tools/hdfsrc/install/lib/:$LD_LIBRARY_PATH
srun /home/it4i-vojtech/Smilei/smilei tst2d_04_laser_wake.py > output_smilei.log 2> error_smilei.log

It runs for half a minute, writes some outputs, and then fails. I watched nvidia-smi output in time, it run at up to three GPUs (of 8). Here are the outputs:

error_smilei.log
output_smilei.log

I think it is only a question of a proper submission now. Accelerated nodes at Karolina have 128 cores and 8 x NVIDIA A100, i.e. 16 cores per GPU. For some reason, 16 processes runs with a running command shown above.

@mccoys
Copy link
Contributor

mccoys commented May 9, 2024

You probably want -arch=sm_80 instead of -arch=sm_86, as suggested from the error

@Horymir001
Copy link

You probably want -arch=sm_80 instead of -arch=sm_86, as suggested from the error

I edited the previous reply.

@mccoys
Copy link
Contributor

mccoys commented May 9, 2024

It could be a memory issue. Try with less particles ?

@Horymir001
Copy link

I tried now even with one particle per cell. The same error still.

@charlesprouveur
Copy link
Contributor

As far as i can see the input file contains not yet supported features such as the filter and the load balancing for instance

@Horymir001
Copy link

Oh, I did not think about it! Could you please recommend some save input for a test, please?

@charlesprouveur
Copy link
Contributor

charlesprouveur commented May 9, 2024

here is a namelist that i used to benchmark an A100 (note that this is in 3D with no moving window, also we use one patch as it is best for GPUs, for multiple GPUs you have to increase the number of patches proportionaly):


import math as m
import numpy as np
import os

c = 299792458
lambdar = 1e-6                  # reference wavelength
wr = 2*m.pi*c/lambdar

temperature   = 100./511.                               # electron & ion temperature in me c^2

density  = 0.01

# plasma wavelength
lambdap = 2*m.pi/density

# Debye length in units of c/\omega_{pe}
Lde = m.sqrt(temperature)

dx = 0.5*Lde
dy = dx
dz = dx

dt  = 0.5 * dx /m.sqrt(3.)              # timestep (0.95 x CFL)

Lx = 128*dx
Ly = 128*dy
Lz = 128*dz

# Simulation time
simulation_time  = 100*dt

particles_per_cell = 8

number_of_patches = [1,1,1]

position_initialization = 'random'

gpu_computing = True
vectorization = "off"

Main(
    geometry = "3Dcartesian",

    interpolation_order = 2,

    timestep = dt,
    simulation_time = simulation_time,

    cell_length  = [dx,dy,dz],
    grid_length = [Lx,Ly,Lz],

    number_of_patches = number_of_patches,

    EM_boundary_conditions = [ ["periodic"] ],

    print_every = 100,

    gpu_computing = gpu_computing,

    random_seed = smilei_mpi_rank,
)

Vectorization(
   mode=vectorization,
)

Species(
    name = "proton",
    position_initialization = position_initialization,
    momentum_initialization = "mj",
    particles_per_cell = particles_per_cell,
    c_part_max = 1.0,
    mass = 1836.0,
    charge = 1.0,
    charge_density = density,
    mean_velocity = [0., 0.0, 0.0],
    temperature = [temperature],
    pusher = "boris",
    boundary_conditions = [
        ["periodic", "periodic"],
        ["periodic", "periodic"],
        ["periodic", "periodic"],
    ],
)
Species(
    name = "electron",
    position_initialization = "proton",
    momentum_initialization = "mj",
    particles_per_cell = particles_per_cell,
    c_part_max = 1.0,
    mass = 1.0,
    charge = -1.0,
    charge_density = density,
    mean_velocity = [0., 0.0, 0.0],
    temperature = [temperature],
    pusher = "boris",
    boundary_conditions = [
        ["periodic", "periodic"],
        ["periodic", "periodic"],
        ["periodic", "periodic"],
    ],
)

DiagScalar(every = 10)

fields = ["Ex", "Ey", "Ez", "Jx","Jy","Jz","Rho"]

diag_species_list = ["Jx","Jy","Jz","Rho"]
species_list = ["electron", "proton"]

for diag in diag_species_list:
    for species in species_list:
        fields.append(diag + "_" + species)

DiagFields(
    #name = "my field diag",
    every = 50,
    fields = fields,
    #subgrid = None
)

DiagParticleBinning(
    deposited_quantity = "weight",
    every = 50,
    time_average = 1,
    species = ["electron"],
    axes = [
        ["x", 0., Lx, 128],
        ["y", 0., Ly, 128],
        ["z", 0., Lz, 128]
    ]
)

DiagParticleBinning(
    deposited_quantity = "weight",
    every = 50,
    time_average = 1,
    species = ["proton"],
    axes = [
        ["x", 0., Lx, 128],
        ["y", 0., Ly, 128],
        ["z", 0., Lz, 128]
    ]
)

DiagParticleBinning(
    deposited_quantity = "weight_ekin",
    every = 50,
    time_average = 1,
    species = ["electron"],
    axes = [
        ["x", 0., Lx, 128],
        ["y", 0., Ly, 128]
    ]
)

DiagProbe(
    #name = "my_probe",
    every    = 50,
    origin   = [0., 0., 0.5*Lz],
    corners  = [
        [Lx,0.,0.5*Lz],
        [0.,Ly,0.5*Lz],
    ],
    number   = [32, 32],
    fields   = fields,
)

@Horymir001
Copy link

Great, this one runs till the end!
I increased the number of timesteps and observed the output of nvidia-smi. It seems it uses only one GPU out of 8 available. Do you have an idea how to improve it?

(base) [it4i-vojtech@acn17.karolina ~]$ nvidia-smi 
Thu May  9 16:15:43 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:07:00.0 Off |                    0 |
| N/A   46C    P0            182W /  400W |    7347MiB /  40960MiB |     93%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off |   00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0             65W /  400W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off |   00000000:48:00.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off |   00000000:4C:00.0 Off |                    0 |
| N/A   31C    P0             67W /  400W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          Off |   00000000:88:00.0 Off |                    0 |
| N/A   28C    P0             62W /  400W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          Off |   00000000:8B:00.0 Off |                    0 |
| N/A   31C    P0             64W /  400W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          Off |   00000000:C8:00.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          Off |   00000000:CB:00.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |     425MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei             7338MiB |
|    1   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A     66146      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
+-----------------------------------------------------------------------------------------+
(base) [it4i-vojtech@acn17.karolina ~]$ 

@mccoys
Copy link
Contributor

mccoys commented May 9, 2024

You must define a binding between processes and gpus, typically using a binding file, or using the proper options for your queue manager (such as slurm)

@Horymir001
Copy link

Well, thank you. I am not sure if I am capable of figuring it out myself. I guess I should try to discuss it with cluster user support.

@charlesprouveur
Copy link
Contributor

charlesprouveur commented May 9, 2024

The fact that it ran on one GPU was what we asked for in the input file since there was only one patch.

As for the binding script it may not be necessary in your case, simply change


Lx = 128*dx
Ly = 128*dy
Lz = 128*dz

# Simulation time
simulation_time  = 100*dt

particles_per_cell = 8

number_of_patches = [1,1,1]

to


Lx = 256*dx
Ly = 256*dy
Lz = 256*dz

# Simulation time
simulation_time  = 100*dt

particles_per_cell = 8

number_of_patches = [2,2,2]

(increasing the size of the problem and the number of patch to have an equivalent charge on each GPU)

and in your slurm command you would specify something like:

#SBATCH --ntasks=8                   # Number of MPI processes (= total number of GPU)
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8          #  MPI tasks  per node (= number of GPU per node)
#SBATCH --gres=gpu:8                 # number of GPU per node
#SBATCH --cpus-per-task=4           # number of  CPU core per task

See if that crashes / works

@Horymir001
Copy link

I could not do #SBATCH --cpus-per-task=4 . But otherwise, it seems fine to me so far!

I will try to do some more testing tomorrow! Thanks.

(base) [it4i-vojtech@acn33.karolina ~]$ nvidia-smi
Thu May  9 16:46:11 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:07:00.0 Off |                    0 |
| N/A   35C    P0             81W /  400W |   11067MiB /  40960MiB |     88%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off |   00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0             68W /  400W |   11067MiB /  40960MiB |     88%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off |   00000000:48:00.0 Off |                    0 |
| N/A   33C    P0            142W /  400W |   11067MiB /  40960MiB |     94%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off |   00000000:4C:00.0 Off |                    0 |
| N/A   36C    P0             70W /  400W |   11067MiB /  40960MiB |     94%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          Off |   00000000:88:00.0 Off |                    0 |
| N/A   34C    P0             67W /  400W |   10757MiB /  40960MiB |     94%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          Off |   00000000:8B:00.0 Off |                    0 |
| N/A   36C    P0             66W /  400W |   11067MiB /  40960MiB |     93%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          Off |   00000000:C8:00.0 Off |                    0 |
| N/A   35C    P0             93W /  400W |   11067MiB /  40960MiB |     63%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          Off |   00000000:CB:00.0 Off |                    0 |
| N/A   36C    P0            160W /  400W |   11067MiB /  40960MiB |     89%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei             8104MiB |
|    0   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    0   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    0   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    0   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    0   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    0   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    0   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    1   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    1   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei             8104MiB |
|    1   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    1   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    1   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    1   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    1   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    1   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei             8104MiB |
|    2   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    2   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei             8104MiB |
|    3   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    3   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei             7794MiB |
|    4   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    4   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei             8104MiB |
|    5   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    5   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    6   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei             8104MiB |
|    6   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879260      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879261      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879262      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879263      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879264      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879265      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879266      C   /home/it4i-vojtech/Smilei/smilei              416MiB |
|    7   N/A  N/A    879267      C   /home/it4i-vojtech/Smilei/smilei             8104MiB |
+-----------------------------------------------------------------------------------------+
(base) [it4i-vojtech@acn33.karolina ~]$ 

@charlesprouveur
Copy link
Contributor

Glad we could help :)
Considering that the original issue is solved, i think we can close this issue unless @spadova-a has further questions.

@mccoys
Copy link
Contributor

mccoys commented May 9, 2024

Side note: we really need to explain, in the documentation for gpu, that there should be 1 process per gpu, and that it is better to have 1 patch per gpu or so

@charlesprouveur
Copy link
Contributor

For one patch per gpu: it is here (hidden in Parallelization & optimization)
For the "one rank mpi per GPU" it should indeed be added there

@mccoys
Copy link
Contributor

mccoys commented May 9, 2024

Ok I think we really need one page dedicated to gpu with links to other places if necessary

@charlesprouveur
Copy link
Contributor

Agreed

@mccoys mccoys closed this as completed May 9, 2024
@Horymir001
Copy link

Hi, could you include the machine file in the code? Here are my suggestions, the comments include the installation description.
karolina.txt

@charlesprouveur
Copy link
Contributor

We might add it in /scripts/compile_tools/machine/ with the other machine scripts , likely under "karolina_gpu"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
installation compilation, installation
Projects
None yet
Development

No branches or pull requests

6 participants