Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime issues on anvil #1184

Closed
jonbob opened this issue Dec 15, 2016 · 22 comments
Closed

Runtime issues on anvil #1184

jonbob opened this issue Dec 15, 2016 · 22 comments

Comments

@jonbob
Copy link
Contributor

jonbob commented Dec 15, 2016

I have been testing a new mpas grid on anvil and experienced non-replicable runtime errors. I had initially thought they could be due to the new grid, but now am concerned that there are issues with anvil.

My tests have used both the intel and gnu compilers. I'll outline my tests and results below:

A_WCYCL1850S.ne30_oECv2_ICG_gnu (new mpas grid)

  • dies at 06-02-3_10 (replicable)
  • original error reported by mpas-cice state checker
  • however, source of error tracked to CAM with warnings in CLUBB:
    xm_wpxp band solver: singular matrix
    wp2_wp3 band solver: singular matrix
  • after help from @wlin7 , I turned on CAM state checking and tracked the source of this error to:
    ERROR: shr_assert_in_domain: state%t has invalid value NaN at location: 14 1
    Expected value to be a number.
    ERROR: NaN produced in physics_state by package micro_mg.
  • I could never find anything in cpl history files that showed NaN's being passed to CAM, though some large positive evaporative fluxes from lnd?
  • Ran through in DEBUG mode

A_WCYCL2000S.ne30_oECv2_ICG_gnu (new mpas grid)

A_WCYCL2000S.ne30_oECv2_ICG_intel (new mpas grid)

  • non-replicable failure, though first three were all due to NaN's caught by the mpas-o state checker:
    05-08-01_01
    05-06-27_19
    05-04-01_01
  • after failure following month boundaries, I wondered if there were issues in the mpas-o analysis member output, so I turned all analysis member output off
  • with analysis member output off, the following run successfully completed two years
  • however, that run would not restart from year 07 and failed during initialization in CAM:
    *** halting in modal_aero_lw after nerr_dopaer = 1000
    after many warnings about "Aerosol optical depth is unreasonably high in this layer."
  • restarted successfully from year 06
  • ran to year 10 and resubmitted
  • died writing the CAM restart file at the end of year 11:
    cesm.exe 0000000002D6ED83 nf_mod_mp_inq_var 921 nf_mod.F90
    cesm.exe 0000000002E7613D piodarray_mp_writ 499 piodarray.F90.in
    cesm.exe 0000000002E76059 piodarray_mp_writ 223 piodarray.F90.in
    cesm.exe 0000000002E75AC6 piodarray_mp_writ 293 piodarray.F90.in
    cesm.exe 000000000065AC3A restart_physics_m 402 restart_physics.F90
    cesm.exe 000000000051E14F cam_restart_mp_ca 244 cam_restart.F90
    cesm.exe 00000000004D90B9 cam_comp_mp_cam_r 394 cam_comp.F90
    cesm.exe 00000000004C9317 atm_comp_mct_mp_a 509 atm_comp_mct.F90
    cesm.exe 000000000042D834 component_mod_mp_ 1049 component_mod.F90
    cesm.exe 0000000000417AD7 cesm_comp_mod_mp_ 3266 cesm_comp_mod.F90
    cesm.exe 000000000042B083 MAIN__ 107 cesm_driver.F90

A_WCYCL2000S.ne30_oEC_ICG_gnu (old mpas grid)

  • decide to see if I could run the water cycle experiment that @golaz pushed out past 100 years
  • failed at 01-10-13_00 in CAM with warnings about
    xm_wpxp band solver: singular matrix
    wp2_wp3 band solver: singular matrix
    and
    ab matrix
    1 0.0000000 0.0000000 NaN NaN 0.0000000
    2 0.0000000 NaN NaN -0.0675287 0.0000000
    3 0.0000000 -0.1158149 1.1139185 -0.0176432 0.0000000
    4 0.0000000 -0.0463899 1.0277619 -0.0025146 0.0000000
    5 0.0000000 -0.0101187 1.0064288 0.0000000 -0.0368544
    6 0.0000000 0.0000000 NaN 0.0000000 0.0000000
    7 -0.0039143 -33971.2521576 NaN -1.1737137 0.0000000

A_WCYCL2000S.ne30_oEC_ICG_intel (old mpas grid)

  • died at 01-05-19_10 with system error?:
    rank 865: MPI error (MPI_File_sync) : Other I/O error , error stack:
    ADIOI_GEN_FLUSH(26): Other I/O error Host is down

A_WCYCL2000S.ne30_oEC_ICG_gnu_blues (old mpas grid)

  • died at 01-12-27_16 with system IO error?:
    [mpiexec@b5] HYDU_sock_write (utils/sock/sock.c:286): write error (Bad file descriptor)
    [mpiexec@b5] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:169): unable to write data to proxy
    [mpiexec@b5] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
    [mpiexec@b5] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
    [mpiexec@b5] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
    [mpiexec@b5] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

A_WCYCL2000S.ne30_oECv2_ICG_intel-openmpi (new mpas grid)

  • failing during ocn initialization with MPI errors:

A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: b562
Local device: mlx5_0
Queue pair type: Reliable connected (RC)

[b562:46696] *** An error occurred in MPI_Isend
[b562:46696] *** reported by process [47788120604673,47785806136274]
[b562:46696] *** on communicator MPI COMMUNICATOR 33 DUP FROM 0
[b562:46696] *** MPI_ERR_OTHER: known error not in list
[b562:46696] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b562:46696] *** and potentially your MPI job)

  • This issue is being tracked as LCRC INC0037350
@jayeshkrishna
Copy link
Contributor

This looks like a CAM issue.

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Dec 15, 2016

Pinging @wlin7 / @worleyph

@jonbob
Copy link
Contributor Author

jonbob commented Dec 15, 2016

Looks like a whole bunch of issues!

@jayeshkrishna
Copy link
Contributor

All of the issues above have (NaNs + CAM). @jonbob : can you also include information on the version of code that you are running these experiments?

@jonbob
Copy link
Contributor Author

jonbob commented Dec 15, 2016

These tests were all run with v1.0.0-beta-3-g6fb9bf7 -- which should be a branch off the beta0 tag to include some new grids

@jayeshkrishna
Copy link
Contributor

Is the branch on remote, ACME-Climate/ACME (if so, what is the name of the branch)?

@jonbob
Copy link
Contributor Author

jonbob commented Dec 15, 2016

it's a branch on ACME-Climate/ACME -- mark-petersen/mpas/new_grids_EC60to30v2_RRS30to10v2

and thanks

@rljacob
Copy link
Member

rljacob commented Dec 15, 2016

@jonbob can you try these on blues itself? (-mach blues). Same gnu version but the intel compiler is older. Turnaround will be slower.

@jonbob
Copy link
Contributor Author

jonbob commented Dec 15, 2016

Sure - does it make any sense to try gnu then? Or just intel?

@rljacob
Copy link
Member

rljacob commented Dec 15, 2016

Try intel to start.

@jonbob
Copy link
Contributor Author

jonbob commented Dec 15, 2016

OK, thanks

@rljacob
Copy link
Member

rljacob commented Dec 15, 2016

Since you're using a branch off of beta0, you could do "-mach blues -compiler gcc-5.2" anvil has 5.3.

@jonbob
Copy link
Contributor Author

jonbob commented Dec 15, 2016

It looks like blues has both gcc-5.2 and gcc-5.3. My first test with gcc-5.2 won't even build gptl, but it's working with 5.3. Is there a reason to push on 5.2?

@rljacob
Copy link
Member

rljacob commented Dec 16, 2016

No. That's it for 5.2.

@jayeshkrishna
Copy link
Contributor

FYI: In Issue #1181 we found that there are issues with mvapich installation on Anvil (In this issue we found issues with intel+mvapich) and had no issues with openmpi .
So it might be worthwhile trying out your case with openmpi to see if you have the same issue.

@jonbob
Copy link
Contributor Author

jonbob commented Dec 19, 2016

Thanks @jayeshkrishna . I just looked and it has been pretty well hard-wired to mvapich for all runs, so I'll try openmpi. I'm not using any threading -- can you tell if the problem has been limited to threaded runs? I'll try openmpi regardless.

@jayeshkrishna
Copy link
Contributor

Yes, the issues identified in #1181 are all related to threaded runs (single threaded runs completed successfully in our tests without any crashes). However we are not yet sure of why mvapich would fail for threaded runs (so cannot say for sure if your problem is related or not)

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Dec 20, 2016

Were you able to get these cases running on blues?
(Update: Ok, just saw the output from A_WCYCL2000S.ne30_oEC_ICG_gnu_blues at the top.)

@jayeshkrishna
Copy link
Contributor

@jonbob : It would be really useful to try your cases on another machine (other than Anvil/blues with the same PE layout). The issues that you see with these new configs (compset+grid) may or may not be related to issues with mvapich on Anvil (PR #1181).

@jonbob
Copy link
Contributor Author

jonbob commented Dec 21, 2016

@jayeshkrishna - exactly -- I started one of the runs up on our local machines yesterday. Great minds and all...

@jayeshkrishna
Copy link
Contributor

@jonbob : Were you able to debug this issue further (running on a different machine etc)?

@jonbob
Copy link
Contributor Author

jonbob commented Mar 3, 2017

@jayeshkrishna - I haven't had these issues on anvil for some time, and never saw them on any other platform. So I think it's been fixed in the anvil environment and this issue can be closed.

@jonbob jonbob closed this as completed Mar 3, 2017
agsalin pushed a commit that referenced this issue Apr 13, 2017
Permits each component to have its own config_archive.xml file in cime_config.
This implementation permits this to be done incrementally.
Also Adds the dart component to all cesm cases. Since it's external and not a component this is the safest way to assure that it's available.

Test suite: scripts_regression_tests.py several tests of create_newcase with comparison of env_archive.xml to previous versions. ERR.f09_g16.B1850 using cesm2_0_alpha06d
A, B, and F compsets were created and the env_archive.xml file was checked for correctness
several rounds of introducing errors to env_run.xml, env_case.xml, env_build.xml and env_archive.xml were conducted assuring that the schema check was working.

Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes

User interface changes?:

Code review: mvertens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants