-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime issues on anvil #1184
Comments
This looks like a CAM issue. |
Looks like a whole bunch of issues! |
All of the issues above have (NaNs + CAM). @jonbob : can you also include information on the version of code that you are running these experiments? |
These tests were all run with v1.0.0-beta-3-g6fb9bf7 -- which should be a branch off the beta0 tag to include some new grids |
Is the branch on remote, ACME-Climate/ACME (if so, what is the name of the branch)? |
it's a branch on ACME-Climate/ACME -- mark-petersen/mpas/new_grids_EC60to30v2_RRS30to10v2 and thanks |
@jonbob can you try these on blues itself? (-mach blues). Same gnu version but the intel compiler is older. Turnaround will be slower. |
Sure - does it make any sense to try gnu then? Or just intel? |
Try intel to start. |
OK, thanks |
Since you're using a branch off of beta0, you could do "-mach blues -compiler gcc-5.2" anvil has 5.3. |
It looks like blues has both gcc-5.2 and gcc-5.3. My first test with gcc-5.2 won't even build gptl, but it's working with 5.3. Is there a reason to push on 5.2? |
No. That's it for 5.2. |
FYI: In Issue #1181 we found that there are issues with mvapich installation on Anvil (In this issue we found issues with intel+mvapich) and had no issues with openmpi . |
Thanks @jayeshkrishna . I just looked and it has been pretty well hard-wired to mvapich for all runs, so I'll try openmpi. I'm not using any threading -- can you tell if the problem has been limited to threaded runs? I'll try openmpi regardless. |
Yes, the issues identified in #1181 are all related to threaded runs (single threaded runs completed successfully in our tests without any crashes). However we are not yet sure of why mvapich would fail for threaded runs (so cannot say for sure if your problem is related or not) |
Were you able to get these cases running on blues? |
@jayeshkrishna - exactly -- I started one of the runs up on our local machines yesterday. Great minds and all... |
@jonbob : Were you able to debug this issue further (running on a different machine etc)? |
@jayeshkrishna - I haven't had these issues on anvil for some time, and never saw them on any other platform. So I think it's been fixed in the anvil environment and this issue can be closed. |
Permits each component to have its own config_archive.xml file in cime_config. This implementation permits this to be done incrementally. Also Adds the dart component to all cesm cases. Since it's external and not a component this is the safest way to assure that it's available. Test suite: scripts_regression_tests.py several tests of create_newcase with comparison of env_archive.xml to previous versions. ERR.f09_g16.B1850 using cesm2_0_alpha06d A, B, and F compsets were created and the env_archive.xml file was checked for correctness several rounds of introducing errors to env_run.xml, env_case.xml, env_build.xml and env_archive.xml were conducted assuring that the schema check was working. Test baseline: Test namelist changes: Test status: bit for bit Fixes User interface changes?: Code review: mvertens
I have been testing a new mpas grid on anvil and experienced non-replicable runtime errors. I had initially thought they could be due to the new grid, but now am concerned that there are issues with anvil.
My tests have used both the intel and gnu compilers. I'll outline my tests and results below:
A_WCYCL1850S.ne30_oECv2_ICG_gnu (new mpas grid)
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
ERROR: shr_assert_in_domain: state%t has invalid value NaN at location: 14 1
Expected value to be a number.
ERROR: NaN produced in physics_state by package micro_mg.
A_WCYCL2000S.ne30_oECv2_ICG_gnu (new mpas grid)
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
although the run died with the following error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
and trace:
Backtrace for this error:
#0 0x2B5A1FEACC17
Importing CLM 4.5.72 #1 0x2B5A1FEABE10
Added an empty file so that could test commit (and subsequent push) to a... #2 0x2B5A208CF65F
Worleyph/acme/c1 performance instrumentation #3 0x7F13D2 in __rrtmg_sw_taumol_MOD_taumol_sw
Merge branch worleyph/maint 0.0/c1-performance-instrumentation #4 0x7EBE9B in __rrtmg_sw_spcvmc_MOD_spcvmc_sw
Added machine files for PNNL clusters: Cascade, Sooty and Olympus #5 0x7E8323 in __rrtmg_sw_rad_MOD_rrtmg_sw
Fixed support for the $PROJECT variable in create_test (as well as other minor fixes in the scripts directory) #6 0x55B196 in __radsw_MOD_rad_rrtmg_sw
CESMSCRATCHROOT on Mira #7 0x54F992 in __radiation_MOD_radiation_tend
azamat/clm4_5/mira_fix-CNSummaryMod.F90 #8 0x52C1B6 in __physpkg_MOD_tphysbc at physpkg.F90:0
Fixing bug that can cause job scripts on Mira to exit before finishing postprocessing #9 0x53199E in __physpkg_MOD_phys_run1
Fixing bug that can cause job scripts on Mira to exit before finishing postprocessing #10 0x483201 in __cam_comp_MOD_cam_run1
Eliminating 'failure' message generated by calling 'svn info' #11 0x47DEA5 in __atm_comp_mct_MOD_atm_run_mct
Eliminating 'failure' message generated by calling 'svn info' #12 0x41EA90 in __component_mod_MOD_component_run
Eliminating remaining issues with $PROJECT in test scripts #13 0x40F344 in __cesm_comp_mod_MOD_cesm_run
A_WCYCL2000S.ne30_oECv2_ICG_intel (new mpas grid)
05-08-01_01
05-06-27_19
05-04-01_01
*** halting in modal_aero_lw after nerr_dopaer = 1000
after many warnings about "Aerosol optical depth is unreasonably high in this layer."
cesm.exe 0000000002D6ED83 nf_mod_mp_inq_var 921 nf_mod.F90
cesm.exe 0000000002E7613D piodarray_mp_writ 499 piodarray.F90.in
cesm.exe 0000000002E76059 piodarray_mp_writ 223 piodarray.F90.in
cesm.exe 0000000002E75AC6 piodarray_mp_writ 293 piodarray.F90.in
cesm.exe 000000000065AC3A restart_physics_m 402 restart_physics.F90
cesm.exe 000000000051E14F cam_restart_mp_ca 244 cam_restart.F90
cesm.exe 00000000004D90B9 cam_comp_mp_cam_r 394 cam_comp.F90
cesm.exe 00000000004C9317 atm_comp_mct_mp_a 509 atm_comp_mct.F90
cesm.exe 000000000042D834 component_mod_mp_ 1049 component_mod.F90
cesm.exe 0000000000417AD7 cesm_comp_mod_mp_ 3266 cesm_comp_mod.F90
cesm.exe 000000000042B083 MAIN__ 107 cesm_driver.F90
A_WCYCL2000S.ne30_oEC_ICG_gnu (old mpas grid)
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
and
ab matrix
1 0.0000000 0.0000000 NaN NaN 0.0000000
2 0.0000000 NaN NaN -0.0675287 0.0000000
3 0.0000000 -0.1158149 1.1139185 -0.0176432 0.0000000
4 0.0000000 -0.0463899 1.0277619 -0.0025146 0.0000000
5 0.0000000 -0.0101187 1.0064288 0.0000000 -0.0368544
6 0.0000000 0.0000000 NaN 0.0000000 0.0000000
7 -0.0039143 -33971.2521576 NaN -1.1737137 0.0000000
A_WCYCL2000S.ne30_oEC_ICG_intel (old mpas grid)
rank 865: MPI error (MPI_File_sync) : Other I/O error , error stack:
ADIOI_GEN_FLUSH(26): Other I/O error Host is down
A_WCYCL2000S.ne30_oEC_ICG_gnu_blues (old mpas grid)
[mpiexec@b5] HYDU_sock_write (utils/sock/sock.c:286): write error (Bad file descriptor)
[mpiexec@b5] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:169): unable to write data to proxy
[mpiexec@b5] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
[mpiexec@b5] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@b5] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@b5] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
A_WCYCL2000S.ne30_oECv2_ICG_intel-openmpi (new mpas grid)
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.
For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: b562
Local device: mlx5_0
Queue pair type: Reliable connected (RC)
[b562:46696] *** An error occurred in MPI_Isend
[b562:46696] *** reported by process [47788120604673,47785806136274]
[b562:46696] *** on communicator MPI COMMUNICATOR 33 DUP FROM 0
[b562:46696] *** MPI_ERR_OTHER: known error not in list
[b562:46696] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b562:46696] *** and potentially your MPI job)
The text was updated successfully, but these errors were encountered: