Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit tests failing on NOAA Acorn #3048

Open
AlexanderRichert-NOAA opened this issue Sep 21, 2024 · 6 comments
Open

Unit tests failing on NOAA Acorn #3048

AlexanderRichert-NOAA opened this issue Sep 21, 2024 · 6 comments
Assignees

Comments

@AlexanderRichert-NOAA
Copy link

I'm trying to run MAPL unit tests via Spack installation on Acorn (WCOSS2 TDS). I'm happy to provide whatever details are helpful; for now I'll upload the CTest log. It's failing on tests 12 and 24, for both the 1g and 2g cases. I've tried it with 2.46.2, 2.47.2, and head of develop (e600653).
mapl_acorn_LastTest.log

@tclune
Copy link
Collaborator

tclune commented Sep 23, 2024

@AlexanderRichert-NOAA Please let us know which compiler, MPI stack, and version of ESMF you are using.

A quick investigation shows that the code is failing an an ALLOCATE statement. Probably a zero-sized allocation as case12 is intentionally using a coarse grid that results in 0 DEs on some PETs. Often case12 will fail because the test environment does not support 216 PETs, but the error would be different and this would not explain the issue with case 24.

@tclune tclune added 🪲 Bugfix This fixes a bug! ❗ High Priority This is a high priority PR and removed ❗ High Priority This is a high priority PR labels Sep 23, 2024
@mathomp4
Copy link
Member

@AlexanderRichert-NOAA Yeah, as @tclune says, case12 is one we don't regularly run because, as Tom says, it uses 216 processes.

As for Case 24, I have seen that have issues with ifx as the Fortran compiler (see #2880 and #2881) but from your log, I'm not sure you are running that.

Now, I do see you might be building with Intel 19:

/lfs/h1/emc/nceplibs/noscrub/alexander.richert/spack-stack-1.8.0/envs/mapl-unit-tests/install/intel/19.1.3.304/cmake-3.27.9-efdkmum/bin/cmake

if so...I'm not sure MAPL has been built with that in a looooong time by us. I am honestly impressed more tests didn't fail if that was the ifort version.

@AlexanderRichert-NOAA
Copy link
Author

@tclune I'm using Intel Classic 19.1.3.304 (with Cray wrappers), Cray MPICH 8.1.9, and ESMF 8.6.1.

If I can work through some issues of it not finding mpirun/mpiexec I can try with another compiler version...

@mathomp4
Copy link
Member

mathomp4 commented Sep 24, 2024

Forgot to put this here, but for testing doing either make tests or ctest -L ESSENTIAL should be plenty. These are our "quick" tests and should avoid the big ones that need more than 6 processes (most MAPL tests are MPI).

Now case24 will still be part of this, but at least big momma case12 will be avoided 😄

ETA: Note that make tests is sort of a shorthand to run the essential tests. It's an additional target we defined for that since we weren't too familiar with ctest back in the day.

@mathomp4
Copy link
Member

Well, I was able to build Baselibs with Intel 19 as well as build MAPL2 with it (which surprised me!)

However, building might work, but it was NOT happy with our unit tests:

0% tests passed, 33 tests failed out of 33

I'm not sure that compiler and the associated MPI stack like our system or network anymore. Indeed, for me it looks like Case 24 does run, but then goes nuts at Finalize:

63:    EXTDATA: VAR3D updated R bracket with: case2.2004.nc4 at time index  12
63:  TestDriver Date: 2004/11/25  Time: 21:00:00    3.5%Memory Committed
63:  TestDriver Date: 2004/11/26  Time: 21:00:00    3.5%Memory Committed
63:   profiler: Report on process: 0
63:   profiler:                                                                Inclusive        Exclusive
63:   profiler:                                                             ================ ================
63:   profiler: Name                                               #-cycles  T (sec)    %     T (sec)    %
63:   profiler:                                                    -------- --------- ------ --------- ------
63:   profiler: All                                                       1     0.453 100.00     0.248  54.68
63:   profiler: --Root                                                   20     0.049  10.92     0.049  10.92
63:   profiler: --HIST                                                   20     0.100  22.00     0.100  22.00
63:   profiler: --EXTDATA                                                20     0.056  12.40     0.056  12.40
63: Abort(806969615) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
63: PMPI_Finalize(214)...............: MPI_Finalize failed
63: PMPI_Finalize(159)...............:
63: MPID_Finalize(1280)..............:
63: MPIDI_OFI_mpi_finalize_hook(1882): OFI domain close failed (ofi_init.c:1882:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)
63: Abort(806969615) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
63: PMPI_Finalize(214)...............: MPI_Finalize failed
63: PMPI_Finalize(159)...............:
63: MPID_Finalize(1280)..............:
63: MPIDI_OFI_mpi_finalize_hook(1882): OFI domain close failed (ofi_init.c:1882:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)
63: CMake Error at /discover/swdev/mathomp4/Models/MAPL2-SLES12-Intel19/Tests/ExtData_Testing_Framework/run_extdata.cmake:33 (message):
63:   Error running case24
63: Call Stack (most recent call first):
63:   /discover/swdev/mathomp4/Models/MAPL2-SLES12-Intel19/Tests/ExtData_Testing_Framework/run_extdata.cmake:36 (run_case)
63:
63:
2/2 Test #63: ExtData2G_case24 .................***Failed    0.95 sec

I looked with @bena-nasa and we do call MPI_Finalize in these tests, so this is something to do with old MPI stacks I guess.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 60 days. If there are no updates within 7 days, it will be closed. You can add the ":hourglass: Long Term" label to prevent the stale action from closing this issue.

@github-actions github-actions bot added the ❄️ Stale This issue has been marked stale label Nov 27, 2024
@mathomp4 mathomp4 removed the ❄️ Stale This issue has been marked stale label Nov 27, 2024
@mathomp4 mathomp4 removed the 🪲 Bugfix This fixes a bug! label Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants