Runtime issues on anvil #1184

jonbob · 2016-12-15T18:44:50Z

I have been testing a new mpas grid on anvil and experienced non-replicable runtime errors. I had initially thought they could be due to the new grid, but now am concerned that there are issues with anvil.

My tests have used both the intel and gnu compilers. I'll outline my tests and results below:

A_WCYCL1850S.ne30_oECv2_ICG_gnu (new mpas grid)

dies at 06-02-3_10 (replicable)
original error reported by mpas-cice state checker
however, source of error tracked to CAM with warnings in CLUBB:
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
after help from @wlin7 , I turned on CAM state checking and tracked the source of this error to:
ERROR: shr_assert_in_domain: state%t has invalid value NaN at location: 14 1
Expected value to be a number.
ERROR: NaN produced in physics_state by package micro_mg.
I could never find anything in cpl history files that showed NaN's being passed to CAM, though some large positive evaporative fluxes from lnd?
Ran through in DEBUG mode

A_WCYCL2000S.ne30_oECv2_ICG_gnu (new mpas grid)

dies at 02-09-16:11 (did not test if this was replicable)
with CAM state checker on, warnings again about:
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
although the run died with the following error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
and trace:
Backtrace for this error:
#0 0x2B5A1FEACC17
Importing CLM 4.5.72 #1 0x2B5A1FEABE10
Added an empty file so that could test commit (and subsequent push) to a... #2 0x2B5A208CF65F
Worleyph/acme/c1 performance instrumentation #3 0x7F13D2 in __rrtmg_sw_taumol_MOD_taumol_sw
Merge branch worleyph/maint 0.0/c1-performance-instrumentation #4 0x7EBE9B in __rrtmg_sw_spcvmc_MOD_spcvmc_sw
Added machine files for PNNL clusters: Cascade, Sooty and Olympus #5 0x7E8323 in __rrtmg_sw_rad_MOD_rrtmg_sw
Fixed support for the $PROJECT variable in create_test (as well as other minor fixes in the scripts directory) #6 0x55B196 in __radsw_MOD_rad_rrtmg_sw
CESMSCRATCHROOT on Mira #7 0x54F992 in __radiation_MOD_radiation_tend
azamat/clm4_5/mira_fix-CNSummaryMod.F90 #8 0x52C1B6 in __physpkg_MOD_tphysbc at physpkg.F90:0
Fixing bug that can cause job scripts on Mira to exit before finishing postprocessing #9 0x53199E in __physpkg_MOD_phys_run1
Fixing bug that can cause job scripts on Mira to exit before finishing postprocessing #10 0x483201 in __cam_comp_MOD_cam_run1
Eliminating 'failure' message generated by calling 'svn info' #11 0x47DEA5 in __atm_comp_mct_MOD_atm_run_mct
Eliminating 'failure' message generated by calling 'svn info' #12 0x41EA90 in __component_mod_MOD_component_run
Eliminating remaining issues with $PROJECT in test scripts #13 0x40F344 in __cesm_comp_mod_MOD_cesm_run
at this point, I decided to try the intel compiler instead

A_WCYCL2000S.ne30_oECv2_ICG_intel (new mpas grid)

non-replicable failure, though first three were all due to NaN's caught by the mpas-o state checker:
05-08-01_01
05-06-27_19
05-04-01_01
after failure following month boundaries, I wondered if there were issues in the mpas-o analysis member output, so I turned all analysis member output off
with analysis member output off, the following run successfully completed two years
however, that run would not restart from year 07 and failed during initialization in CAM:
*** halting in modal_aero_lw after nerr_dopaer = 1000
after many warnings about "Aerosol optical depth is unreasonably high in this layer."
restarted successfully from year 06
ran to year 10 and resubmitted
died writing the CAM restart file at the end of year 11:
cesm.exe 0000000002D6ED83 nf_mod_mp_inq_var 921 nf_mod.F90
cesm.exe 0000000002E7613D piodarray_mp_writ 499 piodarray.F90.in
cesm.exe 0000000002E76059 piodarray_mp_writ 223 piodarray.F90.in
cesm.exe 0000000002E75AC6 piodarray_mp_writ 293 piodarray.F90.in
cesm.exe 000000000065AC3A restart_physics_m 402 restart_physics.F90
cesm.exe 000000000051E14F cam_restart_mp_ca 244 cam_restart.F90
cesm.exe 00000000004D90B9 cam_comp_mp_cam_r 394 cam_comp.F90
cesm.exe 00000000004C9317 atm_comp_mct_mp_a 509 atm_comp_mct.F90
cesm.exe 000000000042D834 component_mod_mp_ 1049 component_mod.F90
cesm.exe 0000000000417AD7 cesm_comp_mod_mp_ 3266 cesm_comp_mod.F90
cesm.exe 000000000042B083 MAIN__ 107 cesm_driver.F90

A_WCYCL2000S.ne30_oEC_ICG_gnu (old mpas grid)

decide to see if I could run the water cycle experiment that @golaz pushed out past 100 years
failed at 01-10-13_00 in CAM with warnings about
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
and
ab matrix
1 0.0000000 0.0000000 NaN NaN 0.0000000
2 0.0000000 NaN NaN -0.0675287 0.0000000
3 0.0000000 -0.1158149 1.1139185 -0.0176432 0.0000000
4 0.0000000 -0.0463899 1.0277619 -0.0025146 0.0000000
5 0.0000000 -0.0101187 1.0064288 0.0000000 -0.0368544
6 0.0000000 0.0000000 NaN 0.0000000 0.0000000
7 -0.0039143 -33971.2521576 NaN -1.1737137 0.0000000

A_WCYCL2000S.ne30_oEC_ICG_intel (old mpas grid)

died at 01-05-19_10 with system error?:
rank 865: MPI error (MPI_File_sync) : Other I/O error , error stack:
ADIOI_GEN_FLUSH(26): Other I/O error Host is down

A_WCYCL2000S.ne30_oEC_ICG_gnu_blues (old mpas grid)

died at 01-12-27_16 with system IO error?:
[mpiexec@b5] HYDU_sock_write (utils/sock/sock.c:286): write error (Bad file descriptor)
[mpiexec@b5] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:169): unable to write data to proxy
[mpiexec@b5] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
[mpiexec@b5] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@b5] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@b5] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

A_WCYCL2000S.ne30_oECv2_ICG_intel-openmpi (new mpas grid)

failing during ocn initialization with MPI errors:

A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: b562
Local device: mlx5_0
Queue pair type: Reliable connected (RC)

[b562:46696] *** An error occurred in MPI_Isend
[b562:46696] *** reported by process [47788120604673,47785806136274]
[b562:46696] *** on communicator MPI COMMUNICATOR 33 DUP FROM 0
[b562:46696] *** MPI_ERR_OTHER: known error not in list
[b562:46696] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b562:46696] *** and potentially your MPI job)

This issue is being tracked as LCRC INC0037350

jayeshkrishna · 2016-12-15T18:48:30Z

This looks like a CAM issue.

jayeshkrishna · 2016-12-15T18:50:22Z

Pinging @wlin7 / @worleyph

jonbob · 2016-12-15T18:54:48Z

Looks like a whole bunch of issues!

jayeshkrishna · 2016-12-15T19:00:15Z

All of the issues above have (NaNs + CAM). @jonbob : can you also include information on the version of code that you are running these experiments?

jonbob · 2016-12-15T19:10:15Z

These tests were all run with v1.0.0-beta-3-g6fb9bf7 -- which should be a branch off the beta0 tag to include some new grids

jayeshkrishna · 2016-12-15T19:12:59Z

Is the branch on remote, ACME-Climate/ACME (if so, what is the name of the branch)?

jonbob · 2016-12-15T19:19:39Z

it's a branch on ACME-Climate/ACME -- mark-petersen/mpas/new_grids_EC60to30v2_RRS30to10v2

and thanks

rljacob · 2016-12-15T20:31:41Z

@jonbob can you try these on blues itself? (-mach blues). Same gnu version but the intel compiler is older. Turnaround will be slower.

jonbob · 2016-12-15T20:49:36Z

Sure - does it make any sense to try gnu then? Or just intel?

rljacob · 2016-12-15T20:57:21Z

Try intel to start.

jonbob · 2016-12-15T20:57:56Z

OK, thanks

rljacob · 2016-12-15T21:25:31Z

Since you're using a branch off of beta0, you could do "-mach blues -compiler gcc-5.2" anvil has 5.3.

jonbob · 2016-12-15T23:12:03Z

It looks like blues has both gcc-5.2 and gcc-5.3. My first test with gcc-5.2 won't even build gptl, but it's working with 5.3. Is there a reason to push on 5.2?

rljacob · 2016-12-16T04:51:09Z

No. That's it for 5.2.

jayeshkrishna · 2016-12-19T18:25:22Z

FYI: In Issue #1181 we found that there are issues with mvapich installation on Anvil (In this issue we found issues with intel+mvapich) and had no issues with openmpi .
So it might be worthwhile trying out your case with openmpi to see if you have the same issue.

jonbob · 2016-12-19T18:47:07Z

Thanks @jayeshkrishna . I just looked and it has been pretty well hard-wired to mvapich for all runs, so I'll try openmpi. I'm not using any threading -- can you tell if the problem has been limited to threaded runs? I'll try openmpi regardless.

jayeshkrishna · 2016-12-19T18:56:07Z

Yes, the issues identified in #1181 are all related to threaded runs (single threaded runs completed successfully in our tests without any crashes). However we are not yet sure of why mvapich would fail for threaded runs (so cannot say for sure if your problem is related or not)

jayeshkrishna · 2016-12-20T19:32:48Z

Were you able to get these cases running on blues?
(Update: Ok, just saw the output from A_WCYCL2000S.ne30_oEC_ICG_gnu_blues at the top.)

jayeshkrishna · 2016-12-21T16:25:07Z

@jonbob : It would be really useful to try your cases on another machine (other than Anvil/blues with the same PE layout). The issues that you see with these new configs (compset+grid) may or may not be related to issues with mvapich on Anvil (PR #1181).

jonbob · 2016-12-21T16:40:32Z

@jayeshkrishna - exactly -- I started one of the runs up on our local machines yesterday. Great minds and all...

jayeshkrishna · 2017-01-20T16:13:45Z

@jonbob : Were you able to debug this issue further (running on a different machine etc)?

jonbob · 2017-03-03T19:00:37Z

@jayeshkrishna - I haven't had these issues on anvil for some time, and never saw them on any other platform. So I think it's been fixed in the anvil environment and this issue can be closed.

Permits each component to have its own config_archive.xml file in cime_config. This implementation permits this to be done incrementally. Also Adds the dart component to all cesm cases. Since it's external and not a component this is the safest way to assure that it's available. Test suite: scripts_regression_tests.py several tests of create_newcase with comparison of env_archive.xml to previous versions. ERR.f09_g16.B1850 using cesm2_0_alpha06d A, B, and F compsets were created and the env_archive.xml file was checked for correctness several rounds of introducing errors to env_run.xml, env_case.xml, env_build.xml and env_archive.xml were conducted assuring that the schema check was working. Test baseline: Test namelist changes: Test status: bit for bit Fixes User interface changes?: Code review: mvertens

jonbob added Coupled Model PotentialBug labels Dec 15, 2016

jonbob assigned jayeshkrishna and jonbob Dec 15, 2016

jonbob assigned wlin7 Dec 15, 2016

jonbob mentioned this issue Dec 21, 2016

TEMPEST regridding tool errors with new oEC60to30 grids MPAS-Dev/MPAS#1181

Closed

jonbob closed this as completed Mar 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime issues on anvil #1184

Runtime issues on anvil #1184

jonbob commented Dec 15, 2016 •

edited

Loading

jayeshkrishna commented Dec 15, 2016

jayeshkrishna commented Dec 15, 2016 •

edited

Loading

jonbob commented Dec 15, 2016

jayeshkrishna commented Dec 15, 2016

jonbob commented Dec 15, 2016

jayeshkrishna commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 16, 2016

jayeshkrishna commented Dec 19, 2016

jonbob commented Dec 19, 2016

jayeshkrishna commented Dec 19, 2016

jayeshkrishna commented Dec 20, 2016 •

edited

Loading

jayeshkrishna commented Dec 21, 2016

jonbob commented Dec 21, 2016

jayeshkrishna commented Jan 20, 2017

jonbob commented Mar 3, 2017

Runtime issues on anvil #1184

Runtime issues on anvil #1184

Comments

jonbob commented Dec 15, 2016 • edited Loading

Local host: b562 Local device: mlx5_0 Queue pair type: Reliable connected (RC)

jayeshkrishna commented Dec 15, 2016

jayeshkrishna commented Dec 15, 2016 • edited Loading

jonbob commented Dec 15, 2016

jayeshkrishna commented Dec 15, 2016

jonbob commented Dec 15, 2016

jayeshkrishna commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 15, 2016

jonbob commented Dec 15, 2016

rljacob commented Dec 16, 2016

jayeshkrishna commented Dec 19, 2016

jonbob commented Dec 19, 2016

jayeshkrishna commented Dec 19, 2016

jayeshkrishna commented Dec 20, 2016 • edited Loading

jayeshkrishna commented Dec 21, 2016

jonbob commented Dec 21, 2016

jayeshkrishna commented Jan 20, 2017

jonbob commented Mar 3, 2017

jonbob commented Dec 15, 2016 •

edited

Loading

Local host: b562
Local device: mlx5_0
Queue pair type: Reliable connected (RC)

jayeshkrishna commented Dec 15, 2016 •

edited

Loading

jayeshkrishna commented Dec 20, 2016 •

edited

Loading