Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC3 (UCI): Fix ADIOS2 HDF5 Build #4836

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Apr 8, 2024

Disable building examples and tests for ADIOS2 for speed. Do not build HDF5 bindings of ADIOS2 due an incompatibility in this version.

cc @erny123 @floresv299 @Aquios7 @jinze-liu

Disable building examples and tests for ADIOS2 for speed.
Do not build HDF5 bindings of ADIOS2 due an incompatibility in
this version.
@ax3l ax3l added bug Something isn't working install component: third party Changes in WarpX that reflect a change in a third-party library machine / system Machine or system-specific issue labels Apr 8, 2024
@ax3l
Copy link
Member Author

ax3l commented Apr 8, 2024

@erny123 @floresv299 @Aquios7 @jinze-liu please let me know if anything else needs an update in the HPC3 (UCI) documentation. I do not personally have access to this machine and rely on your updates so you can share a well working solution with each other using our docs. Thank you! :)

@ax3l ax3l merged commit 9f5be94 into ECP-WarpX:development Apr 8, 2024
42 of 45 checks passed
@ax3l ax3l deleted the fix-hpc3-uci-adios2-no-hdf5 branch April 8, 2024 22:34
@jinze-liu
Copy link

@ax3l My cluster consists of 9 NVIDIA DGX-A100 high-performance computing servers. Each server is equipped with dual AMDROME 7742 64C128T processors, 1TB DDR4 memory, 8 NVIDIA TESLA A100 40GB SMX4 acceleration cards, 8 single-port 200Gb HDR high-speed network interfaces, 1 dual-port 100Gb EDR high-speed network interface, and 19TB of all-SSD storage space. The platform in total has 1152 CPU cores, 72 GPUs, and theoretical FP32 and FP64 computing capabilities exceeding 1404 TFLOPS and 702 TFLOPS, respectively, with a total storage capacity of over 170TB.

You recommended against binding HDF5 with ADIOS2, however, I did not follow that advice. I modified my script based on the HPC3 (UCL) example, and my script is:

#!/bin/bash
export proj="ljz_gpu"


export MY_PROFILE=$(cd $(dirname $BASH_SOURCE) && pwd)/$(basename $BASH_SOURCE)


module load gcc/11.3.0-gcc-9.4.0
module load cmake/3.25.2-gcc-4.8.5  
module load cuda/11.8.0-gcc-4.8.5  
#module load openmpi/4.1.5-gcc-9.4.0  
module load intel-oneapi-mpi/2021.8.0-gcc-4.8.5
module load intel-oneapi-compilers/2021.4.0-gcc-4.8.5
module load intel-oneapi-mkl/2021.4.0-gcc-4.8.5
#module load nvhpc/22.11-gcc-4.8.5

module load boost/1.80.0-gcc-9.4.0 

# optional: for openPMD and PSATD+RZ support
module load openblas/0.3.21-gcc-9.4.0
#module load  hdf5/1.14.0-gcc-9.4.0  
export PATH=/ShareData1/App/abinit-dependence/hdf5-1.10.6/bin:$PATH

export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/c-blosc-1.21.1:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/adios2-2.8.3:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/blaspp-master:$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=${HOME}/sw/hpc3/gpu/lapackpp-master:$CMAKE_PREFIX_PATH

export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/c-blosc-1.21.1/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/adios2-2.8.3/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/blaspp-master/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${HOME}/sw/hpc3/gpu/lapackpp-master/lib64:$LD_LIBRARY_PATH

export PATH=${HOME}/sw/hpc3/gpu/adios2-2.8.3/bin:${PATH}


module load python/3.10.6-gcc-4.8.5  


if [ -d "${HOME}/sw/hpc3/gpu/venvs/warpx-gpu" ]
then
  source ${HOME}/sw/hpc3/gpu/venvs/warpx-gpu/bin/activate
fi

# an alias to request an interactive batch node for one hour
#   for parallel execution, start on the batch node: srun <command>
alias getNode="salloc -N 1 -t 0:30:00 --gres=gpu:A100:1 -p free-gpu"
# an alias to run a command on a batch node for up to 30min
#   usage: runNode <command>
alias runNode="srun -N 1 -t 0:30:00 --gres=gpu:A100:1 -p free-gpu"


export AMREX_CUDA_ARCH=8.0

# compiler environment hints
export CXX=$(which g++)
export CC=$(which gcc)
export FC=$(which gfortran)
export CUDACXX=$(which nvcc)
export CUDAHOSTCXX=${CXX}

This did not result in any errors during the compilation process. Afterwards, I fully ran the HPC3 document's script to install dependencies and also installed the Python module. However, when I tested running the Ohm Solver: Magnetic Reconnection, I encountered issues such as insufficient memory. My job submission script is:

#!/bin/bash -l

# Copyright 2023 The WarpX Community
#
# This file is part of WarpX.
#
# Authors: Axel Huebl, Victor Flores
# License: BSD-3-Clause-LBNL

#SBATCH --time=08:00:00
#SBATCH --nodes=1
##SBATCH --nodelist=gpu010
#SBATCH -J WarpX
#S BATCH -A <proj>
#SBATCH -p gpup1
# use all four GPUs per node
##SBATCH --ntasks-per-node=8
##SBATCH --gres=gpu:A100:1
##SBATCH --cpus-per-task=10
#SBATCH -o WarpX.o%j
#SBATCH -e WarpX.e%j
ulimit -m unlimited
ulimit -d unlimited
ulimit -s unlimited
#ulimit -p unlimited
cd /public/home/ljz_gpu/warpx_sim
# executable & inputs file or python interpreter & PICMI script here
EXE=/public/home/ljz_gpu/src/warpx/build/bin/warpx.2d
INPUTS=PICMI_inputs.py

# OpenMP threads
#export OMP_NUM_THREADS=16

# run
#srun --ntasks=4 bash -c "
#mpirun --oversubscribe -np 28  bash -c "
#    export CUDA_VISIBLE_DEVICES=\${SLURM_LOCALID};
#    ${EXE} ${INPUTS}" \
#  > output.txt
mpirun --oversubscribe -np 28  /public/home/ljz_gpu/src/warpx/build/bin/warpx.2d PICMI_inputs.py > output.txt

The error file is:
error.txt

@Aquios7
Copy link

Aquios7 commented Apr 10, 2024

I set DADIOS2_USE_HDF5=OFF and recompiled warpx to set up build_py using the instructions on the readthedocs.
I have tried to install the PICMI WarpX version onto HPC3(UCI) using the base hpc3_gpu_warpx.profile without modifications (other than the project name) and I can compile the section for pywarpx:

rm -rf build_py

cmake -S . -B build_py -DWarpX_COMPUTE=CUDA -DWarpX_PSATD=ON -DWarpX_QED_TABLE_GEN=ON -DWarpX_APP=OFF -DWarpX_PYTHON=ON -DWarpX_DIMS="1;2;RZ;3"
cmake --build build_py -j 8 --target pip_install

But I get an error on installation:

[  2%] Building CXX object _deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/src/IterationEncoding.cpp.o
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
gmake[3]: *** [_deps/fetchedamrex-build/Src/CMakeFiles/amrex_3d.dir/build.make:90: _deps/fetchedamrex-build/Src/CMakeFiles/amrex_3d.dir/Base/AMReX.cpp.o] Error 9
gmake[3]: *** Waiting for unfinished jobs....
[  3%] Building CXX object _deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/src/Mesh.cpp.o
gmake[2]: *** [CMakeFiles/Makefile2:1988: _deps/fetchedamrex-build/Src/CMakeFiles/amrex_3d.dir/all] Error 2
gmake[2]: *** Waiting for unfinished jobs....
[  3%] Building CXX object _deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/src/Record.cpp.o
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
gmake[3]: *** [_deps/fetchedamrex-build/Src/CMakeFiles/amrex_1d.dir/build.make:90: _deps/fetchedamrex-build/Src/CMakeFiles/amrex_1d.dir/Base/AMReX.cpp.o] Error 9
gmake[2]: *** [CMakeFiles/Makefile2:1936: _deps/fetchedamrex-build/Src/CMakeFiles/amrex_1d.dir/all] Error 2
[  3%] Building CXX object _deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/src/RecordComponent.cpp.o
[  3%] Building CUDA object _deps/fetchedamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_Utility.cpp.o
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
gmake[3]: *** [_deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/build.make:188: _deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/src/Mesh.cpp.o] Error 1
gmake[3]: *** Waiting for unfinished jobs....
[  3%] Building CUDA object _deps/fetchedamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_VisMF.cpp.o
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
gmake[3]: *** [_deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/build.make:258: _deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/src/RecordComponent.cpp.o] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:2742: _deps/fetchedopenpmd-build/CMakeFiles/openPMD.dir/all] Error 2
[  6%] Building CUDA object _deps/fetchedamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_FabArrayBase.cpp.o
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
gmake[3]: *** [_deps/fetchedamrex-build/Src/CMakeFiles/amrex_2d.dir/build.make:678: _deps/fetchedamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_MultiFab.cpp.o] Error 9
gmake[3]: *** Waiting for unfinished jobs....
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
gmake[3]: *** [_deps/fetchedamrex-build/Src/CMakeFiles/amrex_2d.dir/build.make:636: _deps/fetchedamrex-build/Src/CMakeFiles/amrex_2d.dir/Base/AMReX_FArrayBox.cpp.o] Error 9

I deleted and recompiled build_py again, but I am getting the same error code.

@ax3l
Copy link
Member Author

ax3l commented Apr 15, 2024

@Aquios7 Thank you very much for testing the HPC3 updates!

nvcc error   : 'cicc' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)

Luckily only means that we are using too many resources during compilation.

Reduce the parallelism -j to, e.g., 4 processes:

cmake --build build_py -j 4 --target pip_install

or even less to fix.

@ax3l
Copy link
Member Author

ax3l commented Apr 15, 2024

My cluster consists of 9 NVIDIA DGX-A100 [...] I modified my script based on the HPC3 (UCL) example
@jinze-liu Oh, I see. You just used HPC3 as a template for your system.

Your error mostly shows me a segfault without a backtrace file, etc.

What I would start with: Note that WarpX uses 1 MPI rank per GPU. So for your job script above, where you use 1 node, this should read:

mpirun -np 8  /public/home/ljz_gpu/src/warpx/build/bin/warpx.2d PICMI_inputs.py > output.txt

Do not oversubscribe, we do not support that.

If this still segfaults, then please repeat with a single MPI rank and also post the backtrace files. Please comment on your original discussion with further updates and I will respond there: #4845

@Aquios7
Copy link

Aquios7 commented Apr 16, 2024

@Aquios7 Thank you very much for testing the HPC3 updates!

nvcc error   : 'cicc' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)

Luckily only means that we are using too many resources during compilation.

Reduce the parallelism -j to, e.g., 4 processes:

cmake --build build_py -j 4 --target pip_install

or even less to fix.

Thanks for the reply! I'll be trying this out today and come back with any more issues that pop up.

@Aquios7
Copy link

Aquios7 commented Apr 18, 2024

Running with less parallel install to 2 is working, as I am no longer getting the nvcc error.
I am now having a crash when the install script attempts to make the executable in build/bin:

[ 30%] Built target lib_1d
[ 30%] Building CUDA object CMakeFiles/app_1d.dir/Source/main.cpp.o
[ 31%] Linking CXX executable bin/warpx.1d.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.GENQEDTABLES
/opt/apps/gcc/11.2.0/lib/gcc/x86_64-pc-linux-gnu/11.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: libcuda.so.1, needed by /data/homezvol2/~/sw/hpc3/gpu/adios2-2.8.3/lib64/libadios2_core.so.2, not found (try using -rpath or -rpath-link)
/opt/apps/gcc/11.2.0/lib/gcc/x86_64-pc-linux-gnu/11.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: lib/libwarpx.1d.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.GENQEDTABLES.a(BreitWheelerEngineWrapper.cpp.o): in function `long double boost::math::detail::gamma_imp<long double, boost::math::policies::policy<boost::math::policies::promote_float<false>, boost::math::policies::promote_double<false>, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy>, boost::math::lanczos::lanczos17m64>(long double, boost::math::policies::policy<boost::math::policies::promote_float<false>, boost::math::policies::promote_double<false>, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> const&, boost::math::lanczos::lanczos17m64 const&) [clone .isra.0]':
tmpxft_00374626_00000000-6_BreitWheelerEngineWrapper.cudafe1.cpp:(.text+0x4b36): undefined reference to `boost::assertion_failed_msg(char const*, char const*, char const*, char const*, long)'
/opt/apps/gcc/11.2.0/lib/gcc/x86_64-pc-linux-gnu/11.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: lib/libwarpx.1d.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.GENQEDTABLES.a(QuantumSyncEngineWrapper.cpp.o): in function `long double boost::math::detail::gamma_imp<long double, boost::math::policies::policy<boost::math::policies::promote_float<false>, boost::math::policies::promote_double<false>, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy>, boost::math::lanczos::lanczos17m64>(long double, boost::math::policies::policy<boost::math::policies::promote_float<false>, boost::math::policies::promote_double<false>, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> const&, boost::math::lanczos::lanczos17m64 const&) [clone .isra.0]':
tmpxft_00374645_00000000-6_QuantumSyncEngineWrapper.cudafe1.cpp:(.text+0x4e36): undefined reference to `boost::assertion_failed_msg(char const*, char const*, char const*, char const*, long)'
collect2: error: ld returned 1 exit status
gmake[2]: *** [CMakeFiles/app_1d.dir/build.make:110: bin/warpx.1d.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.GENQEDTABLES] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1288: CMakeFiles/app_1d.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

Do you think it is a problem with the boost module, or have I skipped something in the install?

@Aquios7
Copy link

Aquios7 commented Apr 26, 2024

I've gotten help from HPC3, and they sent me a guide on how to set up WarpX. Turns out I wasn't moving to an active node when I logged in remotely, and I needed to set the Boost Directory manually. Here is the guide I was sent via Nadya.
Screenshot from 2024-04-25 14-50-27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component: third party Changes in WarpX that reflect a change in a third-party library install machine / system Machine or system-specific issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants