Update config_machines.xml for Feb 18 software update on Frontier. #6977

trey-ornl · 2025-02-06T00:48:32Z

Various changes to config_machines.xml in preparation for system software updates on Frontier on 2025-02-18. Within the frontier-scream-gpu machine:

Change lmod paths from /usr/share to /opt/cray/pe because the /usr/share software is not maintained and is not available on internal OLCF test computers.
Update craygnuamdgpu to latest version of cpe available on Frontier, cpe/24.11.
Add explicit versions to other modules in craygnuamdgpu. I used to think this wasn't needed, since module load cpe/24.11 should set all the appropriate defaults. I recently discovered that this setting of defaults only works if the module load cpe/24.11 is a separate command before the other module loads. Cime uses a single module load with a list of modules, which does not update the defaults. I selected the explicit versions based on the cpe/24.11 defaults, including the default for rocm, rocm/6.2.4.
Remove the libfabric module from craygnuamdgpu. On Feb 18, the default libfabric will change from libfabric/1.20.1 to libfabric/1.22.0, but the latter is not yet available on Frontier. Only the default version on a given computer is officially supported by HPE. I removed the libfabric module from craygnuamdgpu so that it will stay with the default when it changes.
Change from libfabric/1.15.2.0 to libfabric/1.20.1 for crayclang-scream. On Feb 18, libfabric/1.15.2.0 will disappear from Frontier. And the new default, libfabric/1.22.0, does not support Cray MPI versions before cray-mpich/8.1.28. Because crayclang-scream is stuck back at cpe/22.12 and cray-mpich/8.1.26, we have to use libfabric/1.20.1. We previously switched from libfabric/1.20.1 to libfabric/1.15.2.0 because the former had a serious performance regression. That is supposed to be fixed. And, yes, this does mean that crayclang-scream will be using a version of libfabric that isn't officially supported with the libfabric/1.22.0 drivers, mais c'est la vie.

I ran an 8-node test of NE30 on Frontier and on our internal system (for libfabric/1.22.0) to confirm that the changes work and probably don't seriously impact performance. Here are timing results as reported in e3sm_timing.t.*.

Compiler	`cpe`	`rocm`	`libfabric`	`CPM:ATM_RUN`
`crayclang-scream`	22.12	5.4.0	1.15.2.0	15.855: 16.571
			1.20.1	15.885: 16.683
`craygnuamdgpu`	24.07	6.2.0	1.15.2.0	14.850: 15.594
	24.11	6.2.4	1.15.2.0	14.727: 15.644
			1.20.1	14.729: 15.447
			1.22.0	14.990: 15.601

It would probably be good to confirm with NE1024 runs.

grnydawn · 2025-02-06T21:57:56Z

@jgfouca , Just to check, do you think crayclang-scream needs any changes due to this system software updates on Frontier?

jgfouca · 2025-02-06T22:06:35Z

@grnydawn my guess, looking at the module changes, is no.

ndkeen · 2025-02-06T23:35:43Z

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

trey-ornl · 2025-02-07T17:58:10Z

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

The chart above shows that the various configurations have similar performance for a short NE30 run using 8 nodes. If you have NE256 and NE1024 runs to suggest, then I'm happy to try them, though they may be too big to test libfabric/1.22.0 on the internal system.

I compared BFB using bfbhash> 234 on the NE30 runs.

Changing crayclang-scream from libfabric/1.15.2.0 to libfabric/1.20.1 was BFB.
Changing craygnuamdgpu from cpe/24.07+rocm/6.2.0 to cpe/24.11+rocm/6.2.4, both with libfabric/1.15.2.0, was not BFB.
Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.15.20.0 to libfabric/1.20.1 was BFB.
Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.20.1 on Frontier to libfabric/1.22.0 on our internal computer was not BFB.

ndkeen · 2025-02-07T18:08:17Z

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

grnydawn · 2025-02-07T20:40:49Z

@jgfouca , @rljacob could you assign at least one reviewer for this PR?

rljacob · 2025-02-07T20:43:16Z

I would like us to settle the machine/compiler definition issue (#6773) first and then apply these changes to the unified entry.

xylar · 2025-02-07T20:52:51Z

Would this also be a good chance to set the missing MPICH_GPU_SUPPORT_ENABLED variables? See
E3SM-Project#198
E3SM-Project/polaris#275
E3SM-Project#196
E3SM-Project/mache#231

trey-ornl · 2025-02-07T22:09:27Z

@rljacob, @xylar This pull request has some urgency, as it is needed before the Frontier software changes scheduled for February 18.

rljacob · 2025-02-07T22:25:50Z

I'm aware but 11 days seems like enough time to settle the machine/compiler name issue.

xylar · 2025-02-07T22:29:52Z

Okay, the MPICH_GPU_SUPPORT_ENABLED issue can be addressed separately. It just seemed like it would fit an spare us another PR. This was done for Perlmuter nearly three ago: #4833

trey-ornl · 2025-02-11T18:18:26Z

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

I am trying an NE1024 test from @whannah1, and I got a seg fault for the new compiler setting. I'm investigating with NE30 and NE256 versions of the same test. To be continued...

trey-ornl · 2025-02-12T23:48:15Z

Testing of NE1024 revealed an issue with cray-mpich/8.1.31, the default for cpe/24.11, where the maximum MPI tag went down below the number needed by EamXX. I worked around this by using cray-mpich/8.1.30.

Then I found that performance takes a slight hit moving to cpe/24.11 and its default rocm/6.2.4. All the following use cray-mpich/8.1.30.

`cpe`	`rocm`	`libfabric`	`CPL:ATM_RUN`	Description
24.07	6.2.0	1.15.2.0	676.298: 732.867	Current `master`. Will fail starting 2025-02-18. And defaults will change.
24.11	6.2.4	1.20.1	699.482: 756.622	Upcoming defaults for 2025-02-18, except `cray-mpich/8.1.30`.
24.07	6.1.3	1.20.1	675.430: 732.025	Hardcoded defaults for `cpe/24.07` and updated `libfabric`.

I just pushed a commit that uses the last line. We can use this for now until we resolve the performance loss or get new versions.

trey-ornl · 2025-02-12T23:51:39Z

To summarize, this pull request makes the following changes to config_machines.xml in preparation for 2025-02-18 changes on Frontier.

A required change in crayclang-scream away from libfabric/1.15.2.0.
Hardwired versions matching cpe/24.07, because the new default versions will no longer match cpe/24.07.

…to next (PR #7007) Fixes an issue encountered during testing of PR #6977, where the tags we used went above the max tag allowed by the MPI distribution [BFB]

Fixes an issue encountered during testing of PR #6977, where the tags we used went above the max tag allowed by the MPI distribution [BFB]

trey-ornl · 2025-02-17T21:48:18Z

I will replace this with two separate pull requests after the changes of #6990: one for the minimal libfabric change needed for the old craycray build and one with all the latest software in craygnu, once I test it.

rljacob · 2025-02-19T15:50:02Z

Replaced partly by #7021

Update config_machines.xml for Feb 18 software update on Frontier.

bb10170

rljacob assigned grnydawn Feb 6, 2025

rljacob added Machine Files Frontier labels Feb 6, 2025

rljacob requested a review from jgfouca February 7, 2025 20:44

Because of possible performance issues, stick with cpe/24.07 defaults.

4133360

bartgol mentioned this pull request Feb 13, 2025

EAMxx: avoid MPI tags in GridImportExport gather/scatter methods #7007

Merged

jgfouca approved these changes Feb 13, 2025

View reviewed changes

bartgol added a commit that referenced this pull request Feb 17, 2025

Merge branch 'bartgol/eamxx/grid-import-export-no-mpi-tags' (PR #7007)

752a110

Fixes an issue encountered during testing of PR #6977, where the tags we used went above the max tag allowed by the MPI distribution [BFB]

trey-ornl closed this Feb 17, 2025

This was referenced Feb 19, 2025

Update config_machines for Frontier: lmod paths and craygnu module versions. #7033

Closed

Change lmod paths and craygnu software versions for Frontier. #7044

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Update config_machines.xml for Feb 18 software update on Frontier. #6977

trey-ornl commented Feb 6, 2025

grnydawn commented Feb 6, 2025

jgfouca commented Feb 6, 2025

ndkeen commented Feb 6, 2025

trey-ornl commented Feb 7, 2025

ndkeen commented Feb 7, 2025

grnydawn commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025 •

edited

Loading

trey-ornl commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025

trey-ornl commented Feb 11, 2025

trey-ornl commented Feb 12, 2025

trey-ornl commented Feb 12, 2025

trey-ornl commented Feb 17, 2025

rljacob commented Feb 19, 2025 •

edited

Loading

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Conversation

trey-ornl commented Feb 6, 2025

grnydawn commented Feb 6, 2025

jgfouca commented Feb 6, 2025

ndkeen commented Feb 6, 2025

trey-ornl commented Feb 7, 2025

ndkeen commented Feb 7, 2025

grnydawn commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025 • edited Loading

trey-ornl commented Feb 7, 2025

rljacob commented Feb 7, 2025

xylar commented Feb 7, 2025

trey-ornl commented Feb 11, 2025

trey-ornl commented Feb 12, 2025

trey-ornl commented Feb 12, 2025

trey-ornl commented Feb 17, 2025

rljacob commented Feb 19, 2025 • edited Loading

xylar commented Feb 7, 2025 •

edited

Loading

rljacob commented Feb 19, 2025 •

edited

Loading