-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPC3 (UCI): Fix ADIOS2 HDF5 Build #4836
Conversation
Disable building examples and tests for ADIOS2 for speed. Do not build HDF5 bindings of ADIOS2 due an incompatibility in this version.
@erny123 @floresv299 @Aquios7 @jinze-liu please let me know if anything else needs an update in the HPC3 (UCI) documentation. I do not personally have access to this machine and rely on your updates so you can share a well working solution with each other using our docs. Thank you! :) |
@ax3l My cluster consists of 9 NVIDIA DGX-A100 high-performance computing servers. Each server is equipped with dual AMDROME 7742 64C128T processors, 1TB DDR4 memory, 8 NVIDIA TESLA A100 40GB SMX4 acceleration cards, 8 single-port 200Gb HDR high-speed network interfaces, 1 dual-port 100Gb EDR high-speed network interface, and 19TB of all-SSD storage space. The platform in total has 1152 CPU cores, 72 GPUs, and theoretical FP32 and FP64 computing capabilities exceeding 1404 TFLOPS and 702 TFLOPS, respectively, with a total storage capacity of over 170TB. You recommended against binding HDF5 with ADIOS2, however, I did not follow that advice. I modified my script based on the HPC3 (UCL) example, and my script is:
This did not result in any errors during the compilation process. Afterwards, I fully ran the HPC3 document's script to install dependencies and also installed the Python module. However, when I tested running the Ohm Solver: Magnetic Reconnection, I encountered issues such as insufficient memory. My job submission script is:
The error file is: |
I set DADIOS2_USE_HDF5=OFF and recompiled warpx to set up build_py using the instructions on the readthedocs.
But I get an error on installation:
I deleted and recompiled build_py again, but I am getting the same error code. |
@Aquios7 Thank you very much for testing the HPC3 updates!
Luckily only means that we are using too many resources during compilation. Reduce the parallelism
or even less to fix. |
Your error mostly shows me a segfault without a backtrace file, etc. What I would start with: Note that WarpX uses 1 MPI rank per GPU. So for your job script above, where you use 1 node, this should read:
Do not oversubscribe, we do not support that. If this still segfaults, then please repeat with a single MPI rank and also post the backtrace files. Please comment on your original discussion with further updates and I will respond there: #4845 |
Thanks for the reply! I'll be trying this out today and come back with any more issues that pop up. |
Running with less parallel install to 2 is working, as I am no longer getting the nvcc error.
Do you think it is a problem with the boost module, or have I skipped something in the install? |
Disable building examples and tests for ADIOS2 for speed. Do not build HDF5 bindings of ADIOS2 due an incompatibility in this version.
cc @erny123 @floresv299 @Aquios7 @jinze-liu