Provide default for the number of create_test parallel jobs #3115

jhkennedy · 2019-05-20T21:43:46Z

E3SM's e3sm_developer test suite will launch a large number of parallel build on the login node unless explicitly passing create_test the number of parallel jobs (-j/--parallel-jobs) it should use. This is because the current default is set by the MAX_MPITASKS_PER_NODE machine/env config variable, which for Cori-knl is 64.

This commit:

sets the default number of parallel jobs to 3
add a possible machine config (xml or env) variable, NTEST_PARALLEL_JOBS,
which can be set to override the default number on a per machine basis

The parallel jobs setting priority is now (highest to lowest):

-j/--parallel-jobs command line argument
NTEST_PARALLEL_JOBS config_machines.xml or environment variable
the default value

Test suite: scripts_regression_tests.py on Cori-knl
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes E3SM-Project/E3SM#2923
User interface changes?: N
Update gh-pages html (Y/N)?: N

E3SM's `e3sm_developer` test suite will launch a large number of parallel build on the login node unless explicitly passing create_test the number of parallel jobs (-j/--parallel-jobs) it should use (see E3SM-Project/E3SM#2923 ). This is because the current default is set by the MAX_MPITASKS_PER_NODE machine/env config variable, which for Cori-knl is 64. This commit: * sets the default number of parallel jobs to 3 * add a possible machine config (xml or env) variable, NTEST_PARALLEL_JOBS, which can be set to override the default number on a per machine basis The parallel jobs setting priority is now (highest to lowest): 1. -j/--parallel-jobs command line argument 2. NTEST_PARALLEL_JOBS config_machines.xml or environment variable 3. the default value

jhkennedy · 2019-05-20T21:47:59Z

This will likely effect CESM testing, so I'd like some feedback on the chosen default parallel jobs value of 3 (suggested by @ndkeen )

rljacob · 2019-05-20T21:51:34Z

CESM uses a different system for their testing so this may have no effect.

jedwards4b

@rljacob Although we use a different test system, this code is common to both e3sm and cesm.

jgfouca

Looks good, but we will need well-thought out entries for this value on important testing machines and I suspect CESM will want some as well. It would be a bit of a disaster to limit melvin to 3 concurrent tests for example.

jhkennedy · 2019-05-21T12:49:08Z

@jgfouca I agree, but I don't have a good feel for the values on the testing machines. Do you have a recommendation? One option is I could just set them to the same value as MAX_MPITASKS_PER_NODE in config_machines.xml.

jedwards4b · 2019-05-21T15:01:17Z

@jhkennedy I think setting them to MAX_MPITASKS_PER_NODE is a good default - that would be the same as before the change, but now we allow for customization.

jhkennedy · 2019-05-21T17:29:26Z

I've bumped the default value NTEST_PARALLEL_JOBS to 4 based on E3SM testing experience and reset the value for melvin and sandiatoss3 in e3sm/machines/config_machines.xml to their MAX_MPITASKS_PER_NODE value.

billsacks

@jedwards4b and @jgfouca both approved this, but it looks to me like it doesn't yet do the suggestion of making NTEST_PARALLEL_JOBS default to MAX_MPITASKS_PER_NODE; should it?

jhkennedy · 2019-05-21T18:20:33Z

@billsacks -- For E3SM, most of our Jenkins test scrips were limiting the NTEST_PARALLEL_JOBS to 4 via the -j/--parallel-jobs command line option, so the default of 4 works well for us. The exception being melvin and sandiatoss3 which were using the MAX_MPITASKS_PER_NODE value, which is now reflected in the e3sm/machines/config_machines.xml file.

For CESM testing, I assume the default of 4 is fine as @jedwards4b approved the pr before that suggestion was made (so was E3SM specific). I can go through cesm/machines/config_machines.xml and set NTEST_PARALLEL_JOBS=${MAX_MPITASKS_PER_NODE} if that's desired though.

billsacks · 2019-05-21T18:33:38Z

For CESM testing, I assume the default of 4 is fine as @jedwards4b approved the pr before that suggestion was made (so was E3SM specific). I can go through cesm/machines/config_machines.xml and set NTEST_PARALLEL_JOBS=${MAX_MPITASKS_PER_NODE} if that's desired though.

@jedwards4b what do you think should be done here?

jedwards4b · 2019-05-21T18:52:27Z

It really seems to me that the safest default to set is the one that was there prior to the PR.
And it can be done in test_scheduler.py. For systems where 4 makes most sense use config_machines.xml.

jhkennedy · 2019-05-21T19:45:45Z

@billsacks and @jedwards4b , I'm not sure I follow the logic of using MAX_MPITASKS_PER_NODE (the number of PES on the compute nodes) as the default value for the number of parallel test jobs to start. At least for E3SM, users and our jenkins tests are running ./create_test on login nodes and the number of parallel tests (and builds especially!) that are launched should be significantly less than MAX_MPITASKS_PER_NODE (see E3SM-Project/E3SM#2923).

So, the logic of this setup (A) is:

Provide a safe default for any node
Expand the default number for machines it makes sense to do so (2 for E3SM)
Allow users ultimate control via the command line option

To me, your suggestion (B) sounds like:

Set the default to the max on the compute nodes
Limit it in config machines for machines users are using (all but 2 for E3SM)
or require them to (know how and when to) control it via the command line (not viable, realistically)
Allow users ultimate control via the command line option

An alternative (C) would be:

require a NTEST_PARALLEL_JOBS in config_machines.xml and use it as the default.
Allow users ultimate control via the command line option

From an E3SM perspective, (A) is definitely the best option, and if we're going to have to set that value for all our machines I'd rather it be required (C > B).

So some questions:

How is this value being used in CESM?
Is create_test being run on login nodes by your users or workflow tools?
Do you want to default to using the max the machine can handle?

billsacks · 2019-05-21T19:54:35Z

To be honest, I've always been confused about this, so will defer to @jedwards4b

jgfouca · 2019-05-21T20:16:53Z

@jhkennedy

Prior to this PR, we used MAX_MPITASKS_PER_NODE as the default -j for create_test:

            self._parallel_jobs = min(len(test_names),
                                      self._machobj.get_value("MAX_MPITASKS_PER_NODE"))

My selection of this value reflects my desktop-centric perspective (I am almost always on melvin). In most HPC contexts, using MAX_MPITASKS results in wildly oversubscribing the login node, which is why we have had to set -j to a much lower number in all our Jenkins testing.

I think maybe it would make sense to check if the machine is a batch machine before trying to use MAX_MPITASKS as a default.

rljacob · 2019-05-21T20:24:20Z

Its hard to say which case is more common in the union of all our testing platforms. cori-knl is particularly bad because of its high max AND the large number of users on a login node. JimE is saying only change the machines that we know need to be changed.

jhkennedy · 2019-05-21T21:14:55Z

@rljacob I get that. Maybe I can make this discussion simpler. There are 2 options here:

Leave the default value MAX_MPITASKS_PER_NODE and use NTEST_PARALLEL_JOBS if given for the machine. That means for:
- CESM: Works the same as before
- E3SM: We need to (or should at least) set NTEST_PARALLEL_JOBS value for all our machines with batch systems.
(This PR) Set a small default that is login-node safe for all machines. That means for:
- E3SM: Set NTEST_PARALLEL_JOBS for 2 of our testing machines
- CESM:
  a. If CESM is subscribing jobs up to the MAX_MPITASKS_PER_NODE level, they will need to set NTEST_PARALLEL_JOBS value for all their machines or incur a performance hit.
  b. If CESM is not subscribing jobs up to the MAX_MPITASKS_PER_NODE level, they may not need to set NTEST_PARALLEL_JOBS values on any, or very many, of their machines (like E3SM).

(1) Is the less "(user)safe/user friendly" but more "(dev)safe" option -- as in E3SM-Project/E3SM#2923, we'll learn we need to reduce the default value on a machine by a user hammering a login node, getting a nasty email, and opening an issue. And depending on how CESM is using the test scheduler, it may leave them open to issues similar to E3SM-Project/E3SM#2923. But we're only changing behavior where we directly intend to. (And because this will possibly be a thing every time a machine is added, I'd argue NTEST_PARALLEL_JOBS should be a required config value.)

(2) Overall is more user friendly because it prevents users from hammering login nodes unless they directly intended to (or the machine was configured to), but we may or may not be changing behavior on a swath of machines -- this depends on how the test scheduler is being used by CESM.

My PR is based on the premise of (2.b), but that might not be the case, hence my questions:

How is this value being used in CESM? That is, are large sets of tests being queued by the test scheduler (i.e., a suite of tests being passed to ./create_test)?
If so, is create_test being run on login nodes by your users or workflow tools?

jedwards4b · 2019-05-21T21:18:31Z

Again - the best option is the one that will have no affect on current CESM usage. 1.

jhkennedy · 2019-05-21T21:21:51Z

Wilco -- just wanted to make sure that everyone understand the full implications of this decision.

@jgfouca

…machines This resets the default value of NTEST_PARALLEL_JOBS to MAX_MPITASKS_PER_NODE so as to not make any behavioral changes to CESM. Warning: This is not a safe value on machine with batch systems who's login nodes are more limited than the compute nodes and therefore NTEST_PARALLEL_JOBS should be set on these systems. @jgfouca found via E3SM testing that limiting to 4 parallel jobs was required for many of the testing machines with batch systems to prevent hammering login nodes. Therefore, we set that value for these E3SM machines: * cori-haswell * cori-knl * blues * anvil * bebop * theta * titan * summit Warning: Non test machines for E3SM that have a batch system may still oversubscribe parallel test jobs.

jhkennedy · 2019-05-22T13:55:26Z

Alright, I've set the default to be MAX_MPITASKS_PER_NODE.

@jgfouca , based on E3SM testing, I've set NTEST_PARALLEL_JOBS=4 on these E3SM machines:

cori-haswell
cori-knl
blues
anvil
bebop
theta
titan
summit

These E3SM machines may still be susceptible to E3SM-Project/E3SM#2923 -- do you think we should limit any of them as well?

~~edison~~
stampede2
snl-white
snl-blake
ghost
cetus
syrah
quartz
mira
jlse
sooty
cascade
constance
compy
oic5
cades
eos
grizzly
badger
mesabi
itasca
lawrencium-lr2
lawrencium-lr3
lawrencium-lr6
summitdev

Also, I didn't limit sandiatoss3 because of testing on/with skybridge, but for testing you limit it on/with chama.

@jedwards4b and @billsacks you'll have to say whether or not any CESM machines should be limited since I don't know your testing setup.

jedwards4b

Thanks, we will tune these settings for CESM on a case by case basis when/if we encounter problems.

billsacks

Thanks a lot!

Default MPI module has been updated on Summit. Old module is still available but hidden in listing. A SIGSEGV error was encountered using older module with F-case. This module update fixes that issue. Fixes #3114 [BFB]

jhkennedy added ty: Bug Fix tp: CIMElib labels May 20, 2019

jhkennedy requested review from ndkeen, jedwards4b and jgfouca May 20, 2019 21:43

jhkennedy assigned jedwards4b May 20, 2019

jhkennedy mentioned this pull request May 20, 2019

create_test concurrent tasks defaults on cori violate policies E3SM-Project/E3SM#2923

Open

jedwards4b approved these changes May 20, 2019

View reviewed changes

jgfouca requested changes May 20, 2019

View reviewed changes

jhkennedy requested a review from jgfouca May 21, 2019 17:26

jhkennedy removed the request for review from ndkeen May 21, 2019 17:33

jgfouca approved these changes May 21, 2019

View reviewed changes

billsacks requested changes May 21, 2019

View reviewed changes

jhkennedy force-pushed the jhkennedy/test-concurrent-tasks branch from 5e6de80 to cae6b73 Compare May 22, 2019 13:45

jhkennedy requested review from billsacks, jgfouca and jedwards4b May 22, 2019 13:48

jedwards4b approved these changes May 22, 2019

View reviewed changes

jgfouca approved these changes May 22, 2019

View reviewed changes

billsacks approved these changes May 22, 2019

View reviewed changes

jedwards4b merged commit a224635 into master May 22, 2019

jhkennedy deleted the jhkennedy/test-concurrent-tasks branch May 22, 2019 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide default for the number of create_test parallel jobs #3115

Provide default for the number of create_test parallel jobs #3115

jhkennedy commented May 20, 2019 •

edited by rljacob

Loading

jhkennedy commented May 20, 2019

rljacob commented May 20, 2019

jedwards4b left a comment

jgfouca left a comment

jhkennedy commented May 21, 2019 •

edited

Loading

jedwards4b commented May 21, 2019

jhkennedy commented May 21, 2019

billsacks left a comment

jhkennedy commented May 21, 2019

billsacks commented May 21, 2019

jedwards4b commented May 21, 2019

jhkennedy commented May 21, 2019

billsacks commented May 21, 2019

jgfouca commented May 21, 2019

rljacob commented May 21, 2019

jhkennedy commented May 21, 2019

jedwards4b commented May 21, 2019

jhkennedy commented May 21, 2019

jhkennedy commented May 22, 2019 •

edited

Loading

jedwards4b left a comment

billsacks left a comment

Provide default for the number of create_test parallel jobs #3115

Provide default for the number of create_test parallel jobs #3115

Conversation

jhkennedy commented May 20, 2019 • edited by rljacob Loading

jhkennedy commented May 20, 2019

rljacob commented May 20, 2019

jedwards4b left a comment

Choose a reason for hiding this comment

jgfouca left a comment

Choose a reason for hiding this comment

jhkennedy commented May 21, 2019 • edited Loading

jedwards4b commented May 21, 2019

jhkennedy commented May 21, 2019

billsacks left a comment

Choose a reason for hiding this comment

jhkennedy commented May 21, 2019

billsacks commented May 21, 2019

jedwards4b commented May 21, 2019

jhkennedy commented May 21, 2019

billsacks commented May 21, 2019

jgfouca commented May 21, 2019

rljacob commented May 21, 2019

jhkennedy commented May 21, 2019

jedwards4b commented May 21, 2019

jhkennedy commented May 21, 2019

jhkennedy commented May 22, 2019 • edited Loading

jedwards4b left a comment

Choose a reason for hiding this comment

billsacks left a comment

Choose a reason for hiding this comment

jhkennedy commented May 20, 2019 •

edited by rljacob

Loading

jhkennedy commented May 21, 2019 •

edited

Loading

jhkennedy commented May 22, 2019 •

edited

Loading