Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yellowstone tests often abort with affinity oversubscribe message #2059

Closed
billsacks opened this issue Nov 14, 2017 · 1 comment
Closed

Yellowstone tests often abort with affinity oversubscribe message #2059

billsacks opened this issue Nov 14, 2017 · 1 comment
Labels
Responsibility: CESM Responsibility to manage and accomplish this issue is through CESM st: wontfix tp: config ty: Bug

Comments

@billsacks
Copy link
Member

Single-node tests on yellowstone often abort with the following message:

Execute poe command line: poe  /glade/p/cesmdata/cseg/tools/bin/launch /glade/scratch/sacks/scripts_regression_test.20171113_211728/IRT_N2.f19_g16_rx1.A.yellowstone_intel.fake_testing_only_20171113_220546/bld/cesm.exe
ATTENTION: 0031-393  Ignoring -resd/MP_RESD specified for batch job
ATTENTION: 0031-408  15 tasks allocated by Resource Manager, continuing...
ATTENTION: 0031-606 Unrecognized environment variable, MP_EAGER_LIMIT_LOCAL.
ERROR: 0031-758 AFFINITY: [pronghorn14] Oversubscribe: 15 tasks in total, each task requires 1 resource, but there are only 1 available resource. Affinity can not be applied
ERROR: 0031-161  EOF on socket connection with node pronghorn14-ib
INFO: 0031-639  Exit status from pm_respond = -1

I'm not sure if there's something that could be changed in the job setup to prevent this. Alternatively: We used to only assign jobs to the caldera queue if they used 8 or fewer cores; I don't see a way to specify this with the new config_batch (it seems to only let you specify requirements in terms of full nodes, not number of cores).

A workaround is to get rid of this line in config_batch.xml when running scripts_regression_tests:

diff --git a/config/cesm/machines/config_batch.xml b/config/cesm/machines/config_batch.xml
index 945c443..8be0c9a 100644
--- a/config/cesm/machines/config_batch.xml
+++ b/config/cesm/machines/config_batch.xml
@@ -444,7 +444,6 @@

   <batch_system MACH="yellowstone" type="lsf" version="9.1">
     <queues>
-      <queue walltimemax="24:00" nodemin="1" nodemax="1" >caldera</queue>
       <queue walltimemax="12:00" nodemin="9" nodemax="546" default="true">regular</queue>
       <queue walltimemax="12:00" nodemin="16385" nodemax="2184">capability</queue>
       <queue walltimemax="12:00" nodemin="1" nodemax="546">premium</queue>

But I'm not sure if we want to actually commit that change: I'm not sure if it would mess some things up that really want to run on caldera?

@billsacks billsacks added Responsibility: CESM Responsibility to manage and accomplish this issue is through CESM tp: config ty: Bug st: wontfix labels Nov 14, 2017
@billsacks
Copy link
Member Author

Since yellowstone won't be around much longer, I'm not sure it's worth fixing this. I wanted to open it to document the workaround, but now I'm closing it as a wontfix.

jgfouca added a commit that referenced this issue Nov 14, 2017
…ests

Add unit tests of compare_two's handling of pass/fail in comparison

Previously, I had figured that it was sufficient to ensure that
_component_compare_test was actually being called, figuring that the
tests of _component_compare_test belong elsewhere. But, since this is
such a critical aspect of the system_test_compare_two infrastructure,
I'm adding some tests covering _component_compare_test here.

These new tests cover the logic related to in-test comparisons in
system_tests_compare_two and system_tests_common. However, they will
NOT cover the logic in hist_utils: I'm stubbing out the actual call into
hist_utils, under the assumption that this is - or should be - covered
by other tests. But I'm not sure if hist_utils is actually covered
sufficiently by unit tests. If it isn't, it should be.

This also required some minor refactoring of system_tests_common in
order to allow stubbing out the call into hist_utils. (This refactoring
would not have been needed if we allowed use of the mock module: see
#2056.)

Test suite: scripts_regression_tests on yellowstone
Passes other than the issues documented in #2057 and #2059

Also ensured that comparison failures are still reported correctly by
running a REP test that forces a comparison failure (due to missing
cprnc).
Test baseline: n/a
Test namelist changes: none
Test status: bit for bit

Fixes: Partially addresses #1640

User interface changes?: none

Update gh-pages html (Y/N)?: N

Code review:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Responsibility: CESM Responsibility to manage and accomplish this issue is through CESM st: wontfix tp: config ty: Bug
Projects
None yet
Development

No branches or pull requests

1 participant