Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omega run fails with >1 node when using OMEGA_MPI_ON_DEVICE=ON on Frontier GPUs #196

Closed
mark-petersen opened this issue Jan 29, 2025 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@mark-petersen
Copy link

mark-petersen commented Jan 29, 2025

I get numerous cxil_map: write error and then seg fault when -DOMEGA_MPI_ON_DEVICE=ON for >1 node using Frontier GPUs with today's head of omega develop (cc4eb05) and COMPILER=crayclanggpu.

All cases work for -DOMEGA_MPI_ON_DEVICE=OFF. My run command is

srun -N 2 -n 8 --ntasks-per-gpu=1 --gpu-bind=closest -c 1 ./omega.exe

where -N 1 always works. But I tried a variety of values for -n and --ntasks-per-gpu and it didn't seem to matter.

@mark-petersen mark-petersen added the bug Something isn't working label Jan 29, 2025
@mark-petersen
Copy link
Author

Note, I also tested by changing

--- a/components/omega/src/base/IO.cpp
+++ b/components/omega/src/base/IO.cpp
@@ -198,7 +198,7 @@ int init(const MPI_Comm &InComm // [in] MPI communicator to use
    // extern int SysID;

    FileFmt DefaultFileFmt = FileFmtFromString("netcdf4c");
-   int NumIOTasks         = 1;
+   int NumIOTasks         = 56;

which caused problems in the past. This did not affect the problem above.

@mark-petersen
Copy link
Author

mark-petersen commented Jan 29, 2025

Here is my exact test sequence. Here I am using polaris to set up the test case.

details:


# choose one of:
COMPILER=gnu        # CPU
COMPILER=crayclang  # CPU
export COMPILER=gnugpu
export COMPILER=crayclanggpu

CODEDIR=opr

export DATE=`date +"%y%m%d"`
export r=/lustre/orion/cli115/scratch/mpetersen/runs
export RUNDIR=$r/${DATE}_omega_${CODEDIR}_${COMPILER}

source /ccs/home/mpetersen/repos/polaris/main/load_dev_polaris_0.5.0-alpha.2_frontier_${COMPILER}_mpich.sh
export PARMETIS_ROOT=/ccs/proj/cli115/software/polaris/frontier/spack/dev_polaris_0_5_0_${COMPILER}_mpich/var/spack/environments/dev_polaris_0_5_0_${COMPILER}_mpich/.spack-env/view

rm -rf $RUNDIR 
mkdir -p ${RUNDIR}/build
cd $RUNDIR/build

module load cmake
cmake \
   -DOMEGA_CIME_COMPILER=${COMPILER} \
   -DOMEGA_BUILD_TYPE=Release \
   -DOMEGA_CIME_MACHINE=frontier \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -DOMEGA_MPI_ON_DEVICE=ON \
   -Wno-dev \
   -S /ccs/home/mpetersen/repos/E3SM/${CODEDIR}/components/omega \
   -B .
./omega_build.sh

polaris --list
  59: ocean/planar/manufactured_solution/convergence_space/default
  60: ocean/planar/manufactured_solution/convergence_time/default
  61: ocean/planar/manufactured_solution/convergence_both/default
  62: ocean/planar/manufactured_solution/convergence_both/del2
  63: ocean/planar/manufactured_solution/convergence_both/del4

polaris setup -p $RUNDIR/build  --model=omega -w $RUNDIR -n 61 

# choose one of:
salloc -A cli115 -J inter -t 40:00 -q debug -N 1 -S 0  # CPU
salloc -A cli115 -J inter -t 1:00:00 -q debug -N 4 -p batch  #GPU


source /ccs/home/mpetersen/repos/polaris/main/load_dev_polaris_0.5.0-alpha.2_frontier_${COMPILER}_mpich.sh
cd $RUNDIR
polaris serial # runs the full suite

# test individually:
cd ocean/planar/manufactured_solution/default/forward/100km_150s/
srun -N 2 -n 8 --ntasks-per-gpu=1 --gpu-bind=closest -c 1 ./omega.exe

@mark-petersen
Copy link
Author

mark-petersen commented Jan 29, 2025

This exact error is discussed on a user page here:
"cxil_map: write error" when doing inter-node GPU-aware MPI communication,

They are also using Cray MPICH. For their specific application, they advised a workaround to not use inter-node GPU-aware MPI. But they say it's no longer needed, so this might be a bug in a specific version of Cray MPICH.

@brian-oneill
Copy link

GPU-aware MPI with MPICH requiresMPICH_GPU_SUPPORT_ENABLED=1 in the environment during execution. In our Omega build, we append this to the environment script since it is not in the cime machine configs for Frontier. The full suite for the manufactured solution completes successfully if export MPICH_GPU_SUPPORT_ENABLED=1 is added before running.

@mark-petersen
Copy link
Author

Thank you @brian-oneill. I ran the same Polaris test and added export MPICH_GPU_SUPPORT_ENABLED=1 by hand, and it passes. I am requesting this in E3SM-Project/polaris#275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants