-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix COST_PES and TOTALPES for titan (aprun) #1990
Conversation
COST_PES should not take spare nodes into consideration. Most fields that get set in case.initialize_derived_attributes are impacted by aprun. Fix string search/replace that capitalized regular stack variables by mistake.
Let me try to add @worleyph as a reviewer. |
Might not let me do it until he accepts invitation to CIME. In any event, he should have gotten emails regarding this PR. |
executable = env_mach_spec.get_mpirun(self, mpi_attribs, job="case.run", exe_only=True)[0] | ||
if executable is not None and "aprun" in executable: | ||
self.num_nodes = get_aprun_cmd_for_case(self, "acme.exe")[1] | ||
_, self.num_nodes, self.total_tasks, self.tasks_per_node, self.thread_count = get_aprun_cmd_for_case(self, "acme.exe") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important change.
scripts/lib/CIME/case_setup.py
Outdated
|
||
case.initialize_derived_attributes() | ||
|
||
cost_per_node = 16 if case.get_value("MACH") == "yellowstone" else case.get_value("MAX_MPITASKS_PER_NODE") | ||
case.set_value("COST_PES", (case.num_nodes - case.spare_nodes) * cost_per_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should not attempt to automatically set COST_PES - since it often ends up incorrect. Instead maybe we should consider adding a COST_PER_NODE variable to the machine definition in config_machines.xml?
Hi @jgfouca , I'm not competent to review this. I don't know what these are used for, nor can I read the python with any degree of confidence. Even though these are currently often wrong on Titan, they do not seem to prevent correct execution. My ACME github issue was just to point out that they are either incorrect or the descriptions are incorrect (or both), in case anyone cares. Based on the description for COST_PES ("pes or cores used relative to PES_PER_NODE for accounting"), perhaps the spare nodes should be accounted for, since they will be charged for? Again, I do not know how these are used. |
@jedwards4b @worleyph @rljacob @mvertens : the presence of aprun forces us to think about concepts like num_threads and thread_count on a per-component basis rather than as global values. This is fundamentally incompatible with CIME's current implementation. This is why, without a major refactor, things like COST_PES and other things in getTiming etc will always be wrong on titan for some pe-layouts. We've currently gotten things to kind-of work on titan by hacking our way around this problem using special code to generate aprun commands for titan. I think this PR puts us in slightly better shape since it at least ensures TOTALPES is correct on titan. |
@jgfouca , please put this into an ACME branch when you get the chance and I will try it on Titan. The aprun command is correct currently on Titan, and I want to exercise your update to ensure that this continues to hold. Thanks. (I haven't checked yet, but Theta at ALCF also uses aprun. Anyone know whether it also supports different process count/thread count per process for each node? If so, this will be important on more than just Titan.) |
@worleyph I'm pretty confident in the aprun command since it is well-protected by unit-tests. This change will be brought in during the next ESMCI merge to ACME and we can test it then. |
Okay. Thanks. |
Maybe COST_PES is OK since it uses num_nodes (which is correct for aprun too) and just multiplies is by MAX_MPITASKS_PER_NODE. Maybe having case.thread_count and case.tasks_per_node be wrong for aprun is irrelevant as long as the aprun command is being generated correctly? |
FYI: Anyone with a valid github account can comment on issues on an open repo like CIME. Accepting an invitation is only necessary for write access to the repo itself. |
@jgfouca , COST_PES is not correct in my one example described in ACME github issue #1426 (unless you are saying the fix in this PR is correct). Again, from what I can tell TOTAL_PES is an accurate count of the number of MPI tasks, though that is not what the descripion implies that it should be, and I am not sure that this is the information needed in get_timing.py either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the PR is about changing the case of max_mpitasks_per_node to follow variable naming conventions. I have no problem with that. The current state of COST_PES is less than perfect, but I don't see an improvement in this change. If we are reserving spare nodes on titan and not using them they should be included in the cost. On cori-knl we have 68 tasks per node but only use 64 - I think that we might consider an additional variable in the machine definition: COST_PER_NODE this fixed value multiplied by the total number of nodes consumed should give an accurate result on all systems.
scripts/lib/CIME/case_setup.py
Outdated
|
||
case.initialize_derived_attributes() | ||
|
||
cost_per_node = 16 if case.get_value("MACH") == "yellowstone" else case.get_value("MAX_MPITASKS_PER_NODE") | ||
case.set_value("COST_PES", (case.num_nodes - case.spare_nodes) * cost_per_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should not attempt to automatically set COST_PES - since it often ends up incorrect. Instead maybe we should consider adding a COST_PER_NODE variable to the machine definition in config_machines.xml?
@jedwards4b , I'm fine with continuing to include spare nodes in the COST_PES as long as @worleyph is OK with that. I would just need some guidance on how to change getTiming to report something useful. |
What about adding COST_PER_NODE to the config_machines.xml file? |
@jedwards4b , I'm struggling with this PR since I have never used COST_PES for anything. I would be fine with anything both you and @worleyph would be happy with. |
@jgfouca , as I indicated in the my first comment, I'm not competent to review the code. I was just pointing out that the values and/or descriptions of COST_PES and TOTAL_PES are often incorrect on Titan. I can verify the fix when it is put into a version of ACME that I can try it out with, and once it is clarified what these values are supposed to be. I agree with @jedwards4b that, based on its name, it is reasonable that COST_PES should include the spare nodes. Depends on whether this is some evaluation of the model run cost as a performance metric (so would not include the spare nodes?) or whether it is the true charged cost for the run (in which case it should include the spare nodes?). |
@worleyph , that's fine, don't worry about doing a full review. I believe this PR fixes TOTALPES, since it's now computed in the aprun module for titan (assuming TOTALPES = total number of MPI tasks). If you are OK with COST_PES including spare nodes, then, basically, COST_PES = NUM_NODES * MAX_MPITASKS_PER_NODE. If that's OK then I think we're good to go. I'll just push a small update to add spare nodes. |
I don't use COST_PES and TOTAL_PES for anything either. That said, I don't know if they are used by anything that I do use. |
@jgfouca COST_PES = NUM_NODES * MAX_MPITASKS_PER_NODE is not always correct - yellowstone and cori-knl do not generally follow this rule. This is why I want to introduce COST_PER_NODE. |
@jedwards4b , OK, sounds like all parties would be OK with that. Can you give me some guidance on how to set COST_PER_NODE? |
@jgfouca , if TOTALPES is now the total number of MPI tasks then the description needs to be modified. It is currently
which would imply a different value. |
@worleyph , OK I will fix. |
Looking over code, TOTALPES was clearly not intended to convey any information about threads.
TOTALPES should include the threads. |
I'm going to hold off further communication about this PR until the CIME call this afternoon. @worleyph , if you want to join it's at 1:00PM MDT, https://global.gotomeeting.com/join/243702717 |
@@ -2047,7 +2047,7 @@ | |||
<default_value>0</default_value> | |||
<group>mach_pes_last</group> | |||
<file>env_mach_pes.xml</file> | |||
<desc>total number of tasks and threads (setup automatically - DO NOT EDIT)</desc> | |||
<desc>total number of MPI tasks (setup automatically - DO NOT EDIT)</desc> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm confused - this should be the total number of physical cores used this would include threads unless the threads are using SMT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I would have thought TOTALPES meant, but it has been returning number of MPI tasks recently.
@jedwards4b , TOTALPES seems to be widely-used throughout CIME as the -np argument to mpiexec. That would mean it should not mean threads, right? |
I've always found the various PES, TASKS and other variables like that in CIME to be frustratingly difficult to remember what they are and what they mean. What about a node does CIME need to know to provide its services? The number of physical cores, how many mpi-tasks can you put on those cores (sometimes not 1-1), how many nodes/cores do you need to ask for in a submission, how many nodes/cores are you going to run on, how many nodes/cores/tasks/threads are going to be given to each major component. There's probably more. |
Discussed in meeting: TOTALPES is the number of mpi tasks, does not include thread info. We will make a follow-on PR to do something about COST_PES, COST_PER_NODE. |
Then should we change TOTALPES to TOTALTASKS ? A Processor Element could be assigned a thread or a task. |
COST_PES should not take spare nodes into consideration.
Most fields that get set in case.initialize_derived_attributes are impacted
by aprun.
Fix string search/replace that capitalized regular stack variables by mistake.
Test suite: scripts_regression_tests
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes [CIME Github issue #]
User interface changes?: N
Update gh-pages html (Y/N)?: N
Code review: @jedwards4b @worleyph