-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide default for the number of create_test parallel jobs #3115
Conversation
E3SM's `e3sm_developer` test suite will launch a large number of parallel build on the login node unless explicitly passing create_test the number of parallel jobs (-j/--parallel-jobs) it should use (see E3SM-Project/E3SM#2923 ). This is because the current default is set by the MAX_MPITASKS_PER_NODE machine/env config variable, which for Cori-knl is 64. This commit: * sets the default number of parallel jobs to 3 * add a possible machine config (xml or env) variable, NTEST_PARALLEL_JOBS, which can be set to override the default number on a per machine basis The parallel jobs setting priority is now (highest to lowest): 1. -j/--parallel-jobs command line argument 2. NTEST_PARALLEL_JOBS config_machines.xml or environment variable 3. the default value
This will likely effect CESM testing, so I'd like some feedback on the chosen default parallel jobs value of |
CESM uses a different system for their testing so this may have no effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rljacob Although we use a different test system, this code is common to both e3sm and cesm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but we will need well-thought out entries for this value on important testing machines and I suspect CESM will want some as well. It would be a bit of a disaster to limit melvin to 3 concurrent tests for example.
@jgfouca I agree, but I don't have a good feel for the values on the testing machines. Do you have a recommendation? One option is I could just set them to the same value as |
@jhkennedy I think setting them to MAX_MPITASKS_PER_NODE is a good default - that would be the same as before the change, but now we allow for customization. |
I've bumped the default value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedwards4b and @jgfouca both approved this, but it looks to me like it doesn't yet do the suggestion of making NTEST_PARALLEL_JOBS
default to MAX_MPITASKS_PER_NODE
; should it?
@billsacks -- For E3SM, most of our Jenkins test scrips were limiting the For CESM testing, I assume the default of 4 is fine as @jedwards4b approved the pr before that suggestion was made (so was E3SM specific). I can go through |
@jedwards4b what do you think should be done here? |
It really seems to me that the safest default to set is the one that was there prior to the PR. |
@billsacks and @jedwards4b , I'm not sure I follow the logic of using So, the logic of this setup (A) is:
To me, your suggestion (B) sounds like:
An alternative (C) would be:
From an E3SM perspective, (A) is definitely the best option, and if we're going to have to set that value for all our machines I'd rather it be required (C > B). So some questions:
|
To be honest, I've always been confused about this, so will defer to @jedwards4b |
Prior to this PR, we used MAX_MPITASKS_PER_NODE as the default -j for create_test:
My selection of this value reflects my desktop-centric perspective (I am almost always on melvin). In most HPC contexts, using MAX_MPITASKS results in wildly oversubscribing the login node, which is why we have had to set -j to a much lower number in all our Jenkins testing. I think maybe it would make sense to check if the machine is a batch machine before trying to use MAX_MPITASKS as a default. |
Its hard to say which case is more common in the union of all our testing platforms. cori-knl is particularly bad because of its high max AND the large number of users on a login node. JimE is saying only change the machines that we know need to be changed. |
@rljacob I get that. Maybe I can make this discussion simpler. There are 2 options here:
(1) Is the less "(user)safe/user friendly" but more "(dev)safe" option -- as in E3SM-Project/E3SM#2923, we'll learn we need to reduce the default value on a machine by a user hammering a login node, getting a nasty email, and opening an issue. And depending on how CESM is using the test scheduler, it may leave them open to issues similar to E3SM-Project/E3SM#2923. But we're only changing behavior where we directly intend to. (And because this will possibly be a thing every time a machine is added, I'd argue (2) Overall is more user friendly because it prevents users from hammering login nodes unless they directly intended to (or the machine was configured to), but we may or may not be changing behavior on a swath of machines -- this depends on how the test scheduler is being used by CESM. My PR is based on the premise of (2.b), but that might not be the case, hence my questions:
|
Again - the best option is the one that will have no affect on current CESM usage. 1. |
Wilco -- just wanted to make sure that everyone understand the full implications of this decision. |
…machines This resets the default value of NTEST_PARALLEL_JOBS to MAX_MPITASKS_PER_NODE so as to not make any behavioral changes to CESM. Warning: This is not a safe value on machine with batch systems who's login nodes are more limited than the compute nodes and therefore NTEST_PARALLEL_JOBS should be set on these systems. @jgfouca found via E3SM testing that limiting to 4 parallel jobs was required for many of the testing machines with batch systems to prevent hammering login nodes. Therefore, we set that value for these E3SM machines: * cori-haswell * cori-knl * blues * anvil * bebop * theta * titan * summit Warning: Non test machines for E3SM that have a batch system may still oversubscribe parallel test jobs.
5e6de80
to
cae6b73
Compare
Alright, I've set the default to be @jgfouca , based on E3SM testing, I've set
These E3SM machines may still be susceptible to E3SM-Project/E3SM#2923 -- do you think we should limit any of them as well?
Also, I didn't limit sandiatoss3 because of testing on/with skybridge, but for testing you limit it on/with chama. @jedwards4b and @billsacks you'll have to say whether or not any CESM machines should be limited since I don't know your testing setup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we will tune these settings for CESM on a case by case basis when/if we encounter problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
Default MPI module has been updated on Summit. Old module is still available but hidden in listing. A SIGSEGV error was encountered using older module with F-case. This module update fixes that issue. Fixes #3114 [BFB]
E3SM's
e3sm_developer
test suite will launch a large number of parallel build on the login node unless explicitly passingcreate_test
the number of parallel jobs (-j
/--parallel-jobs
) it should use. This is because the current default is set by theMAX_MPITASKS_PER_NODE
machine/env config variable, which for Cori-knl is 64.This commit:
NTEST_PARALLEL_JOBS
,which can be set to override the default number on a per machine basis
The parallel jobs setting priority is now (highest to lowest):
-j
/--parallel-jobs
command line argumentNTEST_PARALLEL_JOBS
config_machines.xml
or environment variableTest suite: scripts_regression_tests.py on Cori-knl
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes E3SM-Project/E3SM#2923
User interface changes?: N
Update gh-pages html (Y/N)?: N