-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement NCR with SystemTestsCompareTwo #1744
Implement NCR with SystemTestsCompareTwo #1744
Conversation
Note that I was unable to compare the output to the original NCR test, as I was getting sharedlib build failures in that one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor change in code order, otherwise looks fine
scripts/lib/CIME/SystemTests/ncr.py
Outdated
self._case.set_value("NTASKS_{}".format(comp), ntasks // 2) | ||
case_setup(self._case, test_mode = True, reset = True) | ||
|
||
def _case_one_setup(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put case_one first - just for continuity when you read the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm just being dense, it looks to me like there is complete inconsistency between various parts of this implementation, in terms of which case does what.
First, both cases set NINST to 1. Shouldn't one of them set NINST to 2? (And as an aside, I don't think you need str(1)
here - simply 1
should suffice, I think.)
In addition, I see:
-
The docstring says that the second case is the one that runs 2 instances
-
In
__init__
, the description of each case agrees with the docstring, but the run_two_suffix is "singleinst" rather than "multiinst" -
The implementation of case_one_setup and case_two_setup make case 1 look like the one that is supposed to be multi-inst - disagreeing with the above documentation (though agreeing with run_two_suffix)
If you want to be consistent with NCK, then case 1 should be single-inst, case 2 multi-inst.
See also some my inline comments.
scripts/lib/CIME/SystemTests/ncr.py
Outdated
ntasks = self._case.get_value("NTASKS_{}".format(comp)) | ||
ntasks_sum += ntasks * 2 | ||
self._case.set_value("NTASKS_{}".format(comp), ntasks * 2) | ||
case_setup(self._case, test_mode = False, reset = True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is test_mode False here? Please add a comment describing this unintuitive setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_mode is False here because the test_mode flag indicates whether or not the case.test file should be updated (if test_mode is true it is not updated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedwards4b Could you clarify this for me? All I know is that it was running with the wrong number of nodes when it was True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedwards4b is there a general rule for when the case.test file should or should not be updated?
I'm especially wondering why test_mode is False here despite being True for both cases in the NCK test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the two test runs have different pe counts as this one does then the run with the larger value should write the case.test file and the other one should not overwrite it. That way the batch system will allocate enough resources for the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great, thanks @jedwards4b . @mfdeakin-sandia can you please add Jim's comments as comments in the code?
scripts/lib/CIME/SystemTests/ncr.py
Outdated
Build two exectuables for this test: | ||
The first is a default build | ||
The second runs two instances for each component with the same total number of tasks, | ||
and runs each of them concurrently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that (as you said) you haven't been able to fully test this, can you please add a comment here - and ideally also in the config_tests.xml file - noting that this is untested and may not be working quite right?
scripts/lib/CIME/SystemTests/ncr.py
Outdated
self._case.set_value("ROOTPE_{}".format(comp), 0) | ||
ntasks = self._case.get_value("NTASKS_{}".format(comp)) | ||
if ntasks > 1: | ||
self._case.set_value("NTASKS_{}".format(comp), ntasks // 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's wrong to both halve the number of tasks here and double the number of tasks in case_one. I think you want to do one or the other. My preference would be for following what's done in the NCK test: halving ntasks in both initially, then doubling them in the case that runs 2 instances - so that the case that runs 2 instances ends up with the original task count (though slightly different, by design, if the original task count was odd).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misunderstood the old code - I was treating it like it had two separate env_run.xml files with the same initial paramters, as is the case with SystemTestsCompareTwo. But this makes much more sense and explains my confusion.
This comment deserves more discussion: " ACME currently fails the diff test; but at least it builds unlike on master." Does CESM run the NCR test in its suites? Because it sounds like it never worked prior to this PR. |
@rljacob The error I was getting was that acme.exe didn't exist in the build directory after SHAREDLIBBUILD, which seems like it would fail for CESM as well. |
@rljacob and @mfdeakin-sandia as far as I can tell, the NCR test is currently unused. |
How does CESM test multi-instance? NCK? |
Yes. NCK tested a different multi-instance configuration, which would probably be useful to test, but I guess we hadn't been testing it for a while. |
ad810a2
to
8b6e3d5
Compare
I think I've fixed everything, is there anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these fixes. One remaining inconsistency, noted inline.
However, I'd prefer if you reversed case one and case two here, following my earlier comment:
If you want to be consistent with NCK, then case 1 should be single-inst, case 2 multi-inst.
-- it seems confusing to have NCR work in the reverse way from NCK in this respect. But I won't hold up this PR on that account.
scripts/lib/CIME/SystemTests/ncr.py
Outdated
SystemTestsCompareTwo.__init__(self, case, | ||
separate_builds = True, | ||
run_two_suffix = "singleinst", | ||
run_one_description = "default build", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_one_description and run_two_description are reversed, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have them reversed because I'm uncertain it'll work otherwise; but I'll try it and see if it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope; it crashes if I have case_two setup this way. I'll just fix the decriptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this doesn't work: In general, tests implemented on top of system_test_compare_two shouldn't care which is case one and which is case two. The one thing I can think of is that maybe this is tied in with the need to set test_mode to False in case_one - e.g., maybe case one needs to be the one with more tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I'm fine with your just fixing the description here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; that's exactly the reason. I've added a comment explaining this
Add comments warning that this test is currently unused and may not work Add clarifying comments
8b6e3d5
to
8fbbb03
Compare
This reimplements the NCR test with the SystemTestsCompareTwo infrastructure to significantly simplify the test. ACME currently fails the diff test; but at least it builds unlike on master.
Test suite: scripts_regression_tests - does not actually test NCR (AFAICT); so just going by the B_CheckCode test
Test baseline:
Test namelist changes:
Test status: PASS
Fixes #444
User interface changes?: N
Update gh-pages html (Y/N)?: N
Code review: @jgfouca @jedwards4b @billsacks