Skip to content

Commit

Permalink
Lassen: Work-Around MPI Allgatherv
Browse files Browse the repository at this point in the history
The implementation of AllGatherv in IBM's MPI optimization library
"libcollectives" is broken and leads to HDF5 crashes for multi-node
runs.

IBM Spectrum MPI, Version 10 Release 1, User's Guide:
  https://www.ibm.com/docs/en/SSZTET_EOS/eos/guide_101.pdf
  • Loading branch information
ax3l committed Feb 18, 2022
1 parent bc6991e commit fdb887c
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 0 deletions.
17 changes: 17 additions & 0 deletions Docs/source/install/hpc/lassen.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,20 @@ regime), the following set of parameters provided good performance:
node)

* **Two `128x128x128` grids per GPU**, or **one `128x128x256` grid per GPU**.


.. _building-lassen-issues:

Known System Issues
-------------------

.. warning::

Feb 17th, 2022 (INC0211698):
The implementation of AllGatherv in IBM's MPI optimization library "libcollectives" is broken and leads to HDF5 crashes for multi-node runs.

Our batch script templates above `apply this work-around <https://github.com/ECP-WarpX/WarpX/pull/...>`__ *before* the call to ``jsrun``, which avoids the broken routines from IBM and trades them for an OpenMPI implementation of collectives:

.. code-block:: bash
export OMPI_MCA_coll_ibm_skip_allgatherv=true
4 changes: 4 additions & 0 deletions Tools/machines/lassen-llnl/lassen.bsub
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,9 @@
# https://github.com/open-mpi/ompi/issues/7795
export OMPI_MCA_io=ompio

# Work-around for broken IBM "libcollectives" MPI_Allgatherv
# ...
export OMPI_MCA_coll_ibm_skip_allgatherv=true

export OMP_NUM_THREADS=1
jsrun -r 4 -a 1 -g 1 -c 7 -l GPU-CPU -d packed -b rs -M "-gpu" <path/to/executable> <input file> > output.txt

0 comments on commit fdb887c

Please sign in to comment.