Implement NCR with SystemTestsCompareTwo #1744

mfdeakin-sandia · 2017-07-17T16:57:36Z

This reimplements the NCR test with the SystemTestsCompareTwo infrastructure to significantly simplify the test. ACME currently fails the diff test; but at least it builds unlike on master.

Test suite: scripts_regression_tests - does not actually test NCR (AFAICT); so just going by the B_CheckCode test
Test baseline:
Test namelist changes:
Test status: PASS

Fixes #444

User interface changes?: N

Update gh-pages html (Y/N)?: N

Code review: @jgfouca @jedwards4b @billsacks

mfdeakin-sandia · 2017-07-17T16:59:07Z

Note that I was unable to compare the output to the original NCR test, as I was getting sharedlib build failures in that one.
Also note that ACME currently fails this one for me; though other actually nightly tested system tests fail for me on master as well

jgfouca

Much better, thanks!

jedwards4b

minor change in code order, otherwise looks fine

jedwards4b · 2017-07-17T19:09:05Z

scripts/lib/CIME/SystemTests/ncr.py

+                self._case.set_value("NTASKS_{}".format(comp), ntasks // 2)
+        case_setup(self._case, test_mode = True, reset = True)
+
+    def _case_one_setup(self):


Can you put case_one first - just for continuity when you read the file.

…ase_two_setup

billsacks

Unless I'm just being dense, it looks to me like there is complete inconsistency between various parts of this implementation, in terms of which case does what.

First, both cases set NINST to 1. Shouldn't one of them set NINST to 2? (And as an aside, I don't think you need str(1) here - simply 1 should suffice, I think.)

In addition, I see:

The docstring says that the second case is the one that runs 2 instances
In __init__, the description of each case agrees with the docstring, but the run_two_suffix is "singleinst" rather than "multiinst"
The implementation of case_one_setup and case_two_setup make case 1 look like the one that is supposed to be multi-inst - disagreeing with the above documentation (though agreeing with run_two_suffix)

If you want to be consistent with NCK, then case 1 should be single-inst, case 2 multi-inst.

See also some my inline comments.

billsacks · 2017-07-17T21:03:52Z

scripts/lib/CIME/SystemTests/ncr.py

+            ntasks = self._case.get_value("NTASKS_{}".format(comp))
+            ntasks_sum += ntasks * 2
+            self._case.set_value("NTASKS_{}".format(comp), ntasks * 2)
+        case_setup(self._case, test_mode = False, reset = True)


Why is test_mode False here? Please add a comment describing this unintuitive setting.

test_mode is False here because the test_mode flag indicates whether or not the case.test file should be updated (if test_mode is true it is not updated)

@jedwards4b Could you clarify this for me? All I know is that it was running with the wrong number of nodes when it was True.

@jedwards4b is there a general rule for when the case.test file should or should not be updated?

I'm especially wondering why test_mode is False here despite being True for both cases in the NCK test.

If the two test runs have different pe counts as this one does then the run with the larger value should write the case.test file and the other one should not overwrite it. That way the batch system will allocate enough resources for the test.

Ah, great, thanks @jedwards4b . @mfdeakin-sandia can you please add Jim's comments as comments in the code?

billsacks · 2017-07-17T21:04:54Z

scripts/lib/CIME/SystemTests/ncr.py

+Build two exectuables for this test:
+The first is a default build
+The second runs two instances for each component with the same total number of tasks,
+and runs each of them concurrently


Given that (as you said) you haven't been able to fully test this, can you please add a comment here - and ideally also in the config_tests.xml file - noting that this is untested and may not be working quite right?

billsacks · 2017-07-17T21:16:06Z

scripts/lib/CIME/SystemTests/ncr.py

+            self._case.set_value("ROOTPE_{}".format(comp), 0)
+            ntasks = self._case.get_value("NTASKS_{}".format(comp))
+            if ntasks > 1:
+                self._case.set_value("NTASKS_{}".format(comp), ntasks // 2)


I think it's wrong to both halve the number of tasks here and double the number of tasks in case_one. I think you want to do one or the other. My preference would be for following what's done in the NCK test: halving ntasks in both initially, then doubling them in the case that runs 2 instances - so that the case that runs 2 instances ends up with the original task count (though slightly different, by design, if the original task count was odd).

Ah, I misunderstood the old code - I was treating it like it had two separate env_run.xml files with the same initial paramters, as is the case with SystemTestsCompareTwo. But this makes much more sense and explains my confusion.

rljacob · 2017-07-17T21:34:19Z

This comment deserves more discussion: " ACME currently fails the diff test; but at least it builds unlike on master."

Does CESM run the NCR test in its suites? Because it sounds like it never worked prior to this PR.

mfdeakin-sandia · 2017-07-17T21:47:01Z

@rljacob The error I was getting was that acme.exe didn't exist in the build directory after SHAREDLIBBUILD, which seems like it would fail for CESM as well.
But given I'm having issues with other tests (though these actually pass this step), it might?

billsacks · 2017-07-17T21:52:13Z

@rljacob and @mfdeakin-sandia as far as I can tell, the NCR test is currently unused.

rljacob · 2017-07-17T22:05:48Z

How does CESM test multi-instance? NCK?

billsacks · 2017-07-17T22:11:38Z

How does CESM test multi-instance? NCK?

Yes. NCK tested a different multi-instance configuration, which would probably be useful to test, but I guess we hadn't been testing it for a while.

mfdeakin-sandia · 2017-07-18T16:59:06Z

I think I've fixed everything, is there anything else?

billsacks

Thanks for these fixes. One remaining inconsistency, noted inline.

However, I'd prefer if you reversed case one and case two here, following my earlier comment:

If you want to be consistent with NCK, then case 1 should be single-inst, case 2 multi-inst.

-- it seems confusing to have NCR work in the reverse way from NCK in this respect. But I won't hold up this PR on that account.

billsacks · 2017-07-19T15:46:58Z

scripts/lib/CIME/SystemTests/ncr.py

+        SystemTestsCompareTwo.__init__(self, case,
+                                       separate_builds = True,
+                                       run_two_suffix = "singleinst",
+                                       run_one_description = "default build",


run_one_description and run_two_description are reversed, I think

I have them reversed because I'm uncertain it'll work otherwise; but I'll try it and see if it works

Nope; it crashes if I have case_two setup this way. I'll just fix the decriptions

I wonder why this doesn't work: In general, tests implemented on top of system_test_compare_two shouldn't care which is case one and which is case two. The one thing I can think of is that maybe this is tied in with the need to set test_mode to False in case_one - e.g., maybe case one needs to be the one with more tasks.

Anyway, I'm fine with your just fixing the description here.

Yes; that's exactly the reason. I've added a comment explaining this

Add comments warning that this test is currently unused and may not work Add clarifying comments

mfdeakin-sandia added 3 commits July 12, 2017 17:17

Initial implementation of the NCR test with SystemTestsCompareTwo

00c106a

Initial work on converting NCR to use the SystemTestsCompareTwo object

278baf7

Fix pylint errors

d74308f

mfdeakin-sandia self-assigned this Jul 17, 2017

mfdeakin-sandia requested review from jedwards4b, billsacks and jgfouca July 17, 2017 16:57

mfdeakin-sandia added the in progress label Jul 17, 2017

jgfouca approved these changes Jul 17, 2017

View reviewed changes

jedwards4b requested changes Jul 17, 2017

View reviewed changes

Remove empty _common_setup. Also change order of case_one_setup and c…

8847693

…ase_two_setup

jedwards4b approved these changes Jul 17, 2017

View reviewed changes

billsacks requested changes Jul 17, 2017

View reviewed changes

mfdeakin-sandia force-pushed the mfdeakin-sandia/systemtestscomparetwo/ncr branch from ad810a2 to 8b6e3d5 Compare July 18, 2017 16:57

billsacks requested changes Jul 19, 2017

View reviewed changes

Fix number of instances and number of tasks for the test.

8fbbb03

Add comments warning that this test is currently unused and may not work Add clarifying comments

mfdeakin-sandia force-pushed the mfdeakin-sandia/systemtestscomparetwo/ncr branch from 8b6e3d5 to 8fbbb03 Compare July 19, 2017 18:27

mfdeakin-sandia merged commit e1db2f1 into master Jul 19, 2017

mfdeakin-sandia removed the in progress label Jul 19, 2017

mfdeakin-sandia deleted the mfdeakin-sandia/systemtestscomparetwo/ncr branch July 19, 2017 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement NCR with SystemTestsCompareTwo #1744

Implement NCR with SystemTestsCompareTwo #1744

mfdeakin-sandia commented Jul 17, 2017

mfdeakin-sandia commented Jul 17, 2017 •

edited

Loading

jgfouca left a comment

jedwards4b left a comment

jedwards4b Jul 17, 2017

mfdeakin-sandia Jul 17, 2017

billsacks left a comment •

edited

Loading

billsacks Jul 17, 2017

jedwards4b Jul 17, 2017

mfdeakin-sandia Jul 17, 2017

billsacks Jul 17, 2017

jedwards4b Jul 17, 2017

billsacks Jul 17, 2017

billsacks Jul 17, 2017

billsacks Jul 17, 2017

mfdeakin-sandia Jul 17, 2017

rljacob commented Jul 17, 2017

mfdeakin-sandia commented Jul 17, 2017 •

edited

Loading

billsacks commented Jul 17, 2017

rljacob commented Jul 17, 2017

billsacks commented Jul 17, 2017

mfdeakin-sandia commented Jul 18, 2017

billsacks left a comment •

edited

Loading

billsacks Jul 19, 2017

mfdeakin-sandia Jul 19, 2017

mfdeakin-sandia Jul 19, 2017

billsacks Jul 19, 2017

billsacks Jul 19, 2017

mfdeakin-sandia Jul 19, 2017

Implement NCR with SystemTestsCompareTwo #1744

Implement NCR with SystemTestsCompareTwo #1744

Conversation

mfdeakin-sandia commented Jul 17, 2017

mfdeakin-sandia commented Jul 17, 2017 • edited Loading

jgfouca left a comment

Choose a reason for hiding this comment

jedwards4b left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

billsacks left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rljacob commented Jul 17, 2017

mfdeakin-sandia commented Jul 17, 2017 • edited Loading

billsacks commented Jul 17, 2017

rljacob commented Jul 17, 2017

billsacks commented Jul 17, 2017

mfdeakin-sandia commented Jul 18, 2017

billsacks left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfdeakin-sandia commented Jul 17, 2017 •

edited

Loading

billsacks left a comment •

edited

Loading

mfdeakin-sandia commented Jul 17, 2017 •

edited

Loading

billsacks left a comment •

edited

Loading