Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix COST_PES and TOTALPES for titan (aprun) #1990

Merged
merged 5 commits into from
Oct 25, 2017

Conversation

jgfouca
Copy link
Contributor

@jgfouca jgfouca commented Oct 24, 2017

COST_PES should not take spare nodes into consideration.

Most fields that get set in case.initialize_derived_attributes are impacted
by aprun.

Fix string search/replace that capitalized regular stack variables by mistake.

Test suite: scripts_regression_tests
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes [CIME Github issue #]

User interface changes?: N

Update gh-pages html (Y/N)?: N

Code review: @jedwards4b @worleyph

COST_PES should not take spare nodes into consideration.

Most fields that get set in case.initialize_derived_attributes are impacted
by aprun.

Fix string search/replace that capitalized regular stack variables by mistake.
@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 24, 2017

Let me try to add @worleyph as a reviewer.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 24, 2017

Might not let me do it until he accepts invitation to CIME. In any event, he should have gotten emails regarding this PR.

executable = env_mach_spec.get_mpirun(self, mpi_attribs, job="case.run", exe_only=True)[0]
if executable is not None and "aprun" in executable:
self.num_nodes = get_aprun_cmd_for_case(self, "acme.exe")[1]
_, self.num_nodes, self.total_tasks, self.tasks_per_node, self.thread_count = get_aprun_cmd_for_case(self, "acme.exe")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important change.


case.initialize_derived_attributes()

cost_per_node = 16 if case.get_value("MACH") == "yellowstone" else case.get_value("MAX_MPITASKS_PER_NODE")
case.set_value("COST_PES", (case.num_nodes - case.spare_nodes) * cost_per_node)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should not attempt to automatically set COST_PES - since it often ends up incorrect. Instead maybe we should consider adding a COST_PER_NODE variable to the machine definition in config_machines.xml?

@worleyph
Copy link
Contributor

Hi @jgfouca , I'm not competent to review this. I don't know what these are used for, nor can I read the python with any degree of confidence. Even though these are currently often wrong on Titan, they do not seem to prevent correct execution. My ACME github issue was just to point out that they are either incorrect or the descriptions are incorrect (or both), in case anyone cares.

Based on the description for COST_PES ("pes or cores used relative to PES_PER_NODE for accounting"), perhaps the spare nodes should be accounted for, since they will be charged for? Again, I do not know how these are used.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 24, 2017

@jedwards4b @worleyph @rljacob @mvertens : the presence of aprun forces us to think about concepts like num_threads and thread_count on a per-component basis rather than as global values. This is fundamentally incompatible with CIME's current implementation. This is why, without a major refactor, things like COST_PES and other things in getTiming etc will always be wrong on titan for some pe-layouts. We've currently gotten things to kind-of work on titan by hacking our way around this problem using special code to generate aprun commands for titan.

I think this PR puts us in slightly better shape since it at least ensures TOTALPES is correct on titan.

@worleyph
Copy link
Contributor

@jgfouca , please put this into an ACME branch when you get the chance and I will try it on Titan. The aprun command is correct currently on Titan, and I want to exercise your update to ensure that this continues to hold. Thanks.

(I haven't checked yet, but Theta at ALCF also uses aprun. Anyone know whether it also supports different process count/thread count per process for each node? If so, this will be important on more than just Titan.)

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 24, 2017

@worleyph I'm pretty confident in the aprun command since it is well-protected by unit-tests. This change will be brought in during the next ESMCI merge to ACME and we can test it then.

@worleyph
Copy link
Contributor

Okay. Thanks.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 24, 2017

Maybe COST_PES is OK since it uses num_nodes (which is correct for aprun too) and just multiplies is by MAX_MPITASKS_PER_NODE.

Maybe having case.thread_count and case.tasks_per_node be wrong for aprun is irrelevant as long as the aprun command is being generated correctly?

@rljacob
Copy link
Member

rljacob commented Oct 24, 2017

FYI: Anyone with a valid github account can comment on issues on an open repo like CIME. Accepting an invitation is only necessary for write access to the repo itself.

@worleyph
Copy link
Contributor

@jgfouca , COST_PES is not correct in my one example described in ACME github issue #1426 (unless you are saying the fix in this PR is correct).

Again, from what I can tell TOTAL_PES is an accurate count of the number of MPI tasks, though that is not what the descripion implies that it should be, and I am not sure that this is the information needed in get_timing.py either.

Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the PR is about changing the case of max_mpitasks_per_node to follow variable naming conventions. I have no problem with that. The current state of COST_PES is less than perfect, but I don't see an improvement in this change. If we are reserving spare nodes on titan and not using them they should be included in the cost. On cori-knl we have 68 tasks per node but only use 64 - I think that we might consider an additional variable in the machine definition: COST_PER_NODE this fixed value multiplied by the total number of nodes consumed should give an accurate result on all systems.


case.initialize_derived_attributes()

cost_per_node = 16 if case.get_value("MACH") == "yellowstone" else case.get_value("MAX_MPITASKS_PER_NODE")
case.set_value("COST_PES", (case.num_nodes - case.spare_nodes) * cost_per_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should not attempt to automatically set COST_PES - since it often ends up incorrect. Instead maybe we should consider adding a COST_PER_NODE variable to the machine definition in config_machines.xml?

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

@rljacob , I invited @worleyph to ESMCI so I could request a review from him for this PR.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

@jedwards4b , I'm fine with continuing to include spare nodes in the COST_PES as long as @worleyph is OK with that. I would just need some guidance on how to change getTiming to report something useful.

@jedwards4b
Copy link
Contributor

What about adding COST_PER_NODE to the config_machines.xml file?

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

@jedwards4b , I'm struggling with this PR since I have never used COST_PES for anything. I would be fine with anything both you and @worleyph would be happy with.

@worleyph
Copy link
Contributor

@jgfouca , as I indicated in the my first comment, I'm not competent to review the code. I was just pointing out that the values and/or descriptions of COST_PES and TOTAL_PES are often incorrect on Titan. I can verify the fix when it is put into a version of ACME that I can try it out with, and once it is clarified what these values are supposed to be.

I agree with @jedwards4b that, based on its name, it is reasonable that COST_PES should include the spare nodes. Depends on whether this is some evaluation of the model run cost as a performance metric (so would not include the spare nodes?) or whether it is the true charged cost for the run (in which case it should include the spare nodes?).

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

@worleyph , that's fine, don't worry about doing a full review.

I believe this PR fixes TOTALPES, since it's now computed in the aprun module for titan (assuming TOTALPES = total number of MPI tasks).

If you are OK with COST_PES including spare nodes, then, basically, COST_PES = NUM_NODES * MAX_MPITASKS_PER_NODE. If that's OK then I think we're good to go. I'll just push a small update to add spare nodes.

@worleyph
Copy link
Contributor

I don't use COST_PES and TOTAL_PES for anything either. That said, I don't know if they are used by anything that I do use.

@jedwards4b
Copy link
Contributor

@jgfouca COST_PES = NUM_NODES * MAX_MPITASKS_PER_NODE is not always correct - yellowstone and cori-knl do not generally follow this rule. This is why I want to introduce COST_PER_NODE.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

@jedwards4b , OK, sounds like all parties would be OK with that. Can you give me some guidance on how to set COST_PER_NODE?

@worleyph
Copy link
Contributor

worleyph commented Oct 25, 2017

@jgfouca , if TOTALPES is now the total number of MPI tasks then the description needs to be modified. It is currently

 <desc>total number of tasks and threads (setup automatically - DO NOT EDIT)</desc>

which would imply a different value.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

@worleyph , OK I will fix.

Looking over code, TOTALPES was clearly not intended to convey
any information about threads.
@jedwards4b
Copy link
Contributor

TOTALPES should include the threads.
COST_PER_NODE is the number of physical cores on a node and should be set in config_machines.xml COST_PER_NODE for yellowstone is 16 and for cori-knl is 68.
In most other cases it is MAX_MPITASKS_PER_NODE.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

I'm going to hold off further communication about this PR until the CIME call this afternoon. @worleyph , if you want to join it's at 1:00PM MDT, https://global.gotomeeting.com/join/243702717

@@ -2047,7 +2047,7 @@
<default_value>0</default_value>
<group>mach_pes_last</group>
<file>env_mach_pes.xml</file>
<desc>total number of tasks and threads (setup automatically - DO NOT EDIT)</desc>
<desc>total number of MPI tasks (setup automatically - DO NOT EDIT)</desc>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm confused - this should be the total number of physical cores used this would include threads unless the threads are using SMT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I would have thought TOTALPES meant, but it has been returning number of MPI tasks recently.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

@jedwards4b , TOTALPES seems to be widely-used throughout CIME as the -np argument to mpiexec. That would mean it should not mean threads, right?

@rljacob
Copy link
Member

rljacob commented Oct 25, 2017

I've always found the various PES, TASKS and other variables like that in CIME to be frustratingly difficult to remember what they are and what they mean. What about a node does CIME need to know to provide its services? The number of physical cores, how many mpi-tasks can you put on those cores (sometimes not 1-1), how many nodes/cores do you need to ask for in a submission, how many nodes/cores are you going to run on, how many nodes/cores/tasks/threads are going to be given to each major component. There's probably more.

@jgfouca
Copy link
Contributor Author

jgfouca commented Oct 25, 2017

Discussed in meeting: TOTALPES is the number of mpi tasks, does not include thread info. We will make a follow-on PR to do something about COST_PES, COST_PER_NODE.

@rljacob
Copy link
Member

rljacob commented Oct 25, 2017

Then should we change TOTALPES to TOTALTASKS ? A Processor Element could be assigned a thread or a task.

@jedwards4b jedwards4b merged commit e959bdc into master Oct 25, 2017
@jgfouca jgfouca deleted the jgfouca/titan_fixes_and_misc branch October 26, 2017 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants