Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update config_machines.xml for Feb 18 software update on Frontier. #6977

Closed

Conversation

trey-ornl
Copy link
Contributor

Various changes to config_machines.xml in preparation for system software updates on Frontier on 2025-02-18. Within the frontier-scream-gpu machine:

  • Change lmod paths from /usr/share to /opt/cray/pe because the /usr/share software is not maintained and is not available on internal OLCF test computers.
  • Update craygnuamdgpu to latest version of cpe available on Frontier, cpe/24.11.
  • Add explicit versions to other modules in craygnuamdgpu. I used to think this wasn't needed, since module load cpe/24.11 should set all the appropriate defaults. I recently discovered that this setting of defaults only works if the module load cpe/24.11 is a separate command before the other module loads. Cime uses a single module load with a list of modules, which does not update the defaults. I selected the explicit versions based on the cpe/24.11 defaults, including the default for rocm, rocm/6.2.4.
  • Remove the libfabric module from craygnuamdgpu. On Feb 18, the default libfabric will change from libfabric/1.20.1 to libfabric/1.22.0, but the latter is not yet available on Frontier. Only the default version on a given computer is officially supported by HPE. I removed the libfabric module from craygnuamdgpu so that it will stay with the default when it changes.
  • Change from libfabric/1.15.2.0 to libfabric/1.20.1 for crayclang-scream. On Feb 18, libfabric/1.15.2.0 will disappear from Frontier. And the new default, libfabric/1.22.0, does not support Cray MPI versions before cray-mpich/8.1.28. Because crayclang-scream is stuck back at cpe/22.12 and cray-mpich/8.1.26, we have to use libfabric/1.20.1. We previously switched from libfabric/1.20.1 to libfabric/1.15.2.0 because the former had a serious performance regression. That is supposed to be fixed. And, yes, this does mean that crayclang-scream will be using a version of libfabric that isn't officially supported with the libfabric/1.22.0 drivers, mais c'est la vie.

I ran an 8-node test of NE30 on Frontier and on our internal system (for libfabric/1.22.0) to confirm that the changes work and probably don't seriously impact performance. Here are timing results as reported in e3sm_timing.t.*.

Compiler cpe rocm libfabric CPM:ATM_RUN
crayclang-scream 22.12 5.4.0 1.15.2.0 15.855: 16.571
1.20.1 15.885: 16.683
craygnuamdgpu 24.07 6.2.0 1.15.2.0 14.850: 15.594
24.11 6.2.4 1.15.2.0 14.727: 15.644
1.20.1 14.729: 15.447
1.22.0 14.990: 15.601

It would probably be good to confirm with NE1024 runs.

@grnydawn
Copy link
Contributor

grnydawn commented Feb 6, 2025

@jgfouca , Just to check, do you think crayclang-scream needs any changes due to this system software updates on Frontier?

@jgfouca
Copy link
Member

jgfouca commented Feb 6, 2025

@grnydawn my guess, looking at the module changes, is no.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 6, 2025

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

@trey-ornl
Copy link
Contributor Author

Have you verified that tests are BFB? and similar performance? Including ne256 and ne1024?

The chart above shows that the various configurations have similar performance for a short NE30 run using 8 nodes. If you have NE256 and NE1024 runs to suggest, then I'm happy to try them, though they may be too big to test libfabric/1.22.0 on the internal system.

I compared BFB using bfbhash> 234 on the NE30 runs.

  • Changing crayclang-scream from libfabric/1.15.2.0 to libfabric/1.20.1 was BFB.
  • Changing craygnuamdgpu from cpe/24.07+rocm/6.2.0 to cpe/24.11+rocm/6.2.4, both with libfabric/1.15.2.0, was not BFB.
  • Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.15.20.0 to libfabric/1.20.1 was BFB.
  • Changing craygnuamdgpu with cpe/24.11+rocm/6.2.4 from libfabric/1.20.1 on Frontier to libfabric/1.22.0 on our internal computer was not BFB.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 7, 2025

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

@grnydawn
Copy link
Contributor

grnydawn commented Feb 7, 2025

@jgfouca , @rljacob could you assign at least one reviewer for this PR?

@rljacob
Copy link
Member

rljacob commented Feb 7, 2025

I would like us to settle the machine/compiler definition issue (#6773) first and then apply these changes to the unified entry.

@rljacob rljacob requested a review from jgfouca February 7, 2025 20:44
@xylar
Copy link
Contributor

xylar commented Feb 7, 2025

Would this also be a good chance to set the missing MPICH_GPU_SUPPORT_ENABLED variables? See
E3SM-Project#198
E3SM-Project/polaris#275
E3SM-Project#196
E3SM-Project/mache#231

@trey-ornl
Copy link
Contributor Author

@rljacob, @xylar This pull request has some urgency, as it is needed before the Frontier software changes scheduled for February 18.

@rljacob
Copy link
Member

rljacob commented Feb 7, 2025

I'm aware but 11 days seems like enough time to settle the machine/compiler name issue.

@xylar
Copy link
Contributor

xylar commented Feb 7, 2025

Okay, the MPICH_GPU_SUPPORT_ENABLED issue can be addressed separately. It just seemed like it would fit an spare us another PR. This was done for Perlmuter nearly three ago: #4833

@trey-ornl
Copy link
Contributor Author

I might suggest at least trying a day or two of ne1024 (with any recent production script) to ensure no issues before merging.

I am trying an NE1024 test from @whannah1, and I got a seg fault for the new compiler setting. I'm investigating with NE30 and NE256 versions of the same test. To be continued...

@trey-ornl
Copy link
Contributor Author

Testing of NE1024 revealed an issue with cray-mpich/8.1.31, the default for cpe/24.11, where the maximum MPI tag went down below the number needed by EamXX. I worked around this by using cray-mpich/8.1.30.

Then I found that performance takes a slight hit moving to cpe/24.11 and its default rocm/6.2.4. All the following use cray-mpich/8.1.30.

cpe rocm libfabric CPL:ATM_RUN Description
24.07 6.2.0 1.15.2.0 676.298: 732.867 Current master. Will fail starting 2025-02-18. And defaults will change.
24.11 6.2.4 1.20.1 699.482: 756.622 Upcoming defaults for 2025-02-18, except cray-mpich/8.1.30.
24.07 6.1.3 1.20.1 675.430: 732.025 Hardcoded defaults for cpe/24.07 and updated libfabric.

I just pushed a commit that uses the last line. We can use this for now until we resolve the performance loss or get new versions.

@trey-ornl
Copy link
Contributor Author

To summarize, this pull request makes the following changes to config_machines.xml in preparation for 2025-02-18 changes on Frontier.

  • A required change in crayclang-scream away from libfabric/1.15.2.0.
  • Hardwired versions matching cpe/24.07, because the new default versions will no longer match cpe/24.07.

bartgol added a commit that referenced this pull request Feb 17, 2025
…to next (PR #7007)

Fixes an issue encountered during testing of PR #6977,
where the tags we used went above the max tag allowed by the MPI distribution

[BFB]
bartgol added a commit that referenced this pull request Feb 17, 2025
Fixes an issue encountered during testing of PR #6977,
where the tags we used went above the max tag allowed by the MPI distribution

[BFB]
@trey-ornl
Copy link
Contributor Author

I will replace this with two separate pull requests after the changes of #6990: one for the minimal libfabric change needed for the old craycray build and one with all the latest software in craygnu, once I test it.

@trey-ornl trey-ornl closed this Feb 17, 2025
@rljacob
Copy link
Member

rljacob commented Feb 19, 2025

Replaced partly by #7021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants