Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NCR with SystemTestsCompareTwo #1744

Merged
merged 5 commits into from
Jul 19, 2017

Conversation

mfdeakin-sandia
Copy link
Contributor

This reimplements the NCR test with the SystemTestsCompareTwo infrastructure to significantly simplify the test. ACME currently fails the diff test; but at least it builds unlike on master.

Test suite: scripts_regression_tests - does not actually test NCR (AFAICT); so just going by the B_CheckCode test
Test baseline:
Test namelist changes:
Test status: PASS

Fixes #444

User interface changes?: N

Update gh-pages html (Y/N)?: N

Code review: @jgfouca @jedwards4b @billsacks

@mfdeakin-sandia
Copy link
Contributor Author

mfdeakin-sandia commented Jul 17, 2017

Note that I was unable to compare the output to the original NCR test, as I was getting sharedlib build failures in that one.
Also note that ACME currently fails this one for me; though other actually nightly tested system tests fail for me on master as well

Copy link
Contributor

@jgfouca jgfouca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks!

Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change in code order, otherwise looks fine

self._case.set_value("NTASKS_{}".format(comp), ntasks // 2)
case_setup(self._case, test_mode = True, reset = True)

def _case_one_setup(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put case_one first - just for continuity when you read the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm just being dense, it looks to me like there is complete inconsistency between various parts of this implementation, in terms of which case does what.

First, both cases set NINST to 1. Shouldn't one of them set NINST to 2? (And as an aside, I don't think you need str(1) here - simply 1 should suffice, I think.)

In addition, I see:

  • The docstring says that the second case is the one that runs 2 instances

  • In __init__, the description of each case agrees with the docstring, but the run_two_suffix is "singleinst" rather than "multiinst"

  • The implementation of case_one_setup and case_two_setup make case 1 look like the one that is supposed to be multi-inst - disagreeing with the above documentation (though agreeing with run_two_suffix)

If you want to be consistent with NCK, then case 1 should be single-inst, case 2 multi-inst.

See also some my inline comments.

ntasks = self._case.get_value("NTASKS_{}".format(comp))
ntasks_sum += ntasks * 2
self._case.set_value("NTASKS_{}".format(comp), ntasks * 2)
case_setup(self._case, test_mode = False, reset = True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is test_mode False here? Please add a comment describing this unintuitive setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_mode is False here because the test_mode flag indicates whether or not the case.test file should be updated (if test_mode is true it is not updated)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b Could you clarify this for me? All I know is that it was running with the wrong number of nodes when it was True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b is there a general rule for when the case.test file should or should not be updated?

I'm especially wondering why test_mode is False here despite being True for both cases in the NCK test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the two test runs have different pe counts as this one does then the run with the larger value should write the case.test file and the other one should not overwrite it. That way the batch system will allocate enough resources for the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great, thanks @jedwards4b . @mfdeakin-sandia can you please add Jim's comments as comments in the code?

Build two exectuables for this test:
The first is a default build
The second runs two instances for each component with the same total number of tasks,
and runs each of them concurrently
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that (as you said) you haven't been able to fully test this, can you please add a comment here - and ideally also in the config_tests.xml file - noting that this is untested and may not be working quite right?

self._case.set_value("ROOTPE_{}".format(comp), 0)
ntasks = self._case.get_value("NTASKS_{}".format(comp))
if ntasks > 1:
self._case.set_value("NTASKS_{}".format(comp), ntasks // 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's wrong to both halve the number of tasks here and double the number of tasks in case_one. I think you want to do one or the other. My preference would be for following what's done in the NCK test: halving ntasks in both initially, then doubling them in the case that runs 2 instances - so that the case that runs 2 instances ends up with the original task count (though slightly different, by design, if the original task count was odd).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood the old code - I was treating it like it had two separate env_run.xml files with the same initial paramters, as is the case with SystemTestsCompareTwo. But this makes much more sense and explains my confusion.

@rljacob
Copy link
Member

rljacob commented Jul 17, 2017

This comment deserves more discussion: " ACME currently fails the diff test; but at least it builds unlike on master."

Does CESM run the NCR test in its suites? Because it sounds like it never worked prior to this PR.

@mfdeakin-sandia
Copy link
Contributor Author

mfdeakin-sandia commented Jul 17, 2017

@rljacob The error I was getting was that acme.exe didn't exist in the build directory after SHAREDLIBBUILD, which seems like it would fail for CESM as well.
But given I'm having issues with other tests (though these actually pass this step), it might?

@billsacks
Copy link
Member

@rljacob and @mfdeakin-sandia as far as I can tell, the NCR test is currently unused.

@rljacob
Copy link
Member

rljacob commented Jul 17, 2017

How does CESM test multi-instance? NCK?

@billsacks
Copy link
Member

How does CESM test multi-instance? NCK?

Yes. NCK tested a different multi-instance configuration, which would probably be useful to test, but I guess we hadn't been testing it for a while.

@mfdeakin-sandia mfdeakin-sandia force-pushed the mfdeakin-sandia/systemtestscomparetwo/ncr branch from ad810a2 to 8b6e3d5 Compare July 18, 2017 16:57
@mfdeakin-sandia
Copy link
Contributor Author

I think I've fixed everything, is there anything else?

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these fixes. One remaining inconsistency, noted inline.

However, I'd prefer if you reversed case one and case two here, following my earlier comment:

If you want to be consistent with NCK, then case 1 should be single-inst, case 2 multi-inst.

-- it seems confusing to have NCR work in the reverse way from NCK in this respect. But I won't hold up this PR on that account.

SystemTestsCompareTwo.__init__(self, case,
separate_builds = True,
run_two_suffix = "singleinst",
run_one_description = "default build",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_one_description and run_two_description are reversed, I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have them reversed because I'm uncertain it'll work otherwise; but I'll try it and see if it works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope; it crashes if I have case_two setup this way. I'll just fix the decriptions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this doesn't work: In general, tests implemented on top of system_test_compare_two shouldn't care which is case one and which is case two. The one thing I can think of is that maybe this is tied in with the need to set test_mode to False in case_one - e.g., maybe case one needs to be the one with more tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I'm fine with your just fixing the description here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; that's exactly the reason. I've added a comment explaining this

Add comments warning that this test is currently unused and may not work
Add clarifying comments
@mfdeakin-sandia mfdeakin-sandia force-pushed the mfdeakin-sandia/systemtestscomparetwo/ncr branch from 8b6e3d5 to 8fbbb03 Compare July 19, 2017 18:27
@mfdeakin-sandia mfdeakin-sandia merged commit e1db2f1 into master Jul 19, 2017
@mfdeakin-sandia mfdeakin-sandia deleted the mfdeakin-sandia/systemtestscomparetwo/ncr branch July 19, 2017 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants