Yellowstone tests often abort with affinity oversubscribe message #2059

billsacks · 2017-11-14T15:01:02Z

Single-node tests on yellowstone often abort with the following message:

Execute poe command line: poe  /glade/p/cesmdata/cseg/tools/bin/launch /glade/scratch/sacks/scripts_regression_test.20171113_211728/IRT_N2.f19_g16_rx1.A.yellowstone_intel.fake_testing_only_20171113_220546/bld/cesm.exe
ATTENTION: 0031-393  Ignoring -resd/MP_RESD specified for batch job
ATTENTION: 0031-408  15 tasks allocated by Resource Manager, continuing...
ATTENTION: 0031-606 Unrecognized environment variable, MP_EAGER_LIMIT_LOCAL.
ERROR: 0031-758 AFFINITY: [pronghorn14] Oversubscribe: 15 tasks in total, each task requires 1 resource, but there are only 1 available resource. Affinity can not be applied
ERROR: 0031-161  EOF on socket connection with node pronghorn14-ib
INFO: 0031-639  Exit status from pm_respond = -1

I'm not sure if there's something that could be changed in the job setup to prevent this. Alternatively: We used to only assign jobs to the caldera queue if they used 8 or fewer cores; I don't see a way to specify this with the new config_batch (it seems to only let you specify requirements in terms of full nodes, not number of cores).

A workaround is to get rid of this line in config_batch.xml when running scripts_regression_tests:

diff --git a/config/cesm/machines/config_batch.xml b/config/cesm/machines/config_batch.xml
index 945c443..8be0c9a 100644
--- a/config/cesm/machines/config_batch.xml
+++ b/config/cesm/machines/config_batch.xml
@@ -444,7 +444,6 @@

   <batch_system MACH="yellowstone" type="lsf" version="9.1">
     <queues>
-      <queue walltimemax="24:00" nodemin="1" nodemax="1" >caldera</queue>
       <queue walltimemax="12:00" nodemin="9" nodemax="546" default="true">regular</queue>
       <queue walltimemax="12:00" nodemin="16385" nodemax="2184">capability</queue>
       <queue walltimemax="12:00" nodemin="1" nodemax="546">premium</queue>

But I'm not sure if we want to actually commit that change: I'm not sure if it would mess some things up that really want to run on caldera?

The text was updated successfully, but these errors were encountered:

billsacks · 2017-11-14T15:01:34Z

Since yellowstone won't be around much longer, I'm not sure it's worth fixing this. I wanted to open it to document the workaround, but now I'm closing it as a wontfix.

…ests Add unit tests of compare_two's handling of pass/fail in comparison Previously, I had figured that it was sufficient to ensure that _component_compare_test was actually being called, figuring that the tests of _component_compare_test belong elsewhere. But, since this is such a critical aspect of the system_test_compare_two infrastructure, I'm adding some tests covering _component_compare_test here. These new tests cover the logic related to in-test comparisons in system_tests_compare_two and system_tests_common. However, they will NOT cover the logic in hist_utils: I'm stubbing out the actual call into hist_utils, under the assumption that this is - or should be - covered by other tests. But I'm not sure if hist_utils is actually covered sufficiently by unit tests. If it isn't, it should be. This also required some minor refactoring of system_tests_common in order to allow stubbing out the call into hist_utils. (This refactoring would not have been needed if we allowed use of the mock module: see #2056.) Test suite: scripts_regression_tests on yellowstone Passes other than the issues documented in #2057 and #2059 Also ensured that comparison failures are still reported correctly by running a REP test that forces a comparison failure (due to missing cprnc). Test baseline: n/a Test namelist changes: none Test status: bit for bit Fixes: Partially addresses #1640 User interface changes?: none Update gh-pages html (Y/N)?: N Code review:

billsacks added Responsibility: CESM Responsibility to manage and accomplish this issue is through CESM tp: config ty: Bug st: wontfix labels Nov 14, 2017

billsacks closed this as completed Nov 14, 2017

billsacks mentioned this issue Nov 14, 2017

Add unit tests of compare_two's handling of pass/fail in comparison #2060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yellowstone tests often abort with affinity oversubscribe message #2059

Yellowstone tests often abort with affinity oversubscribe message #2059

billsacks commented Nov 14, 2017

billsacks commented Nov 14, 2017

Yellowstone tests often abort with affinity oversubscribe message #2059

Yellowstone tests often abort with affinity oversubscribe message #2059

Comments

billsacks commented Nov 14, 2017

billsacks commented Nov 14, 2017