Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAMxx: avoid MPI tags in GridImportExport gather/scatter methods #7007

Merged
merged 4 commits into from
Feb 17, 2025

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Feb 13, 2025

Fixes an issue encountered during testing of PR #6977 (see this comment), where the tags we used went above the max tag allowed by the MPI distribution

[BFB]

Instead of sending one msg per grid point, we follow the common pack/unpack paradigm, and send/recv a single message per PID, allowing us to just always use 0 as a tag. It's worth reminding that the gather/scatter routines are used only during initialization, to set up remappers weights (we read map files "evenly" across ranks, and then ship the CRS matrix triplets to the ranks that need them).

@trey-ornl This PR should fix the tags issue. The unit tests for this struct pass on my laptop, but I won't claim it works until I run some tests where coarsening remappers are used (our CI runs ne4 only, so no coarsened output). But if you want to give it a shot, you could try to integrate it on the branch of the Frontier software update, and give it a try on Frontier...

@tcclevenger I don't know if you've ever taken a look at this part of the code. Let me know if you're not familiar with it, and you'd rather walk through it together.

…MPI tags

* Tags were 1-1 with grid GIDs, which at large resolutions
  can exceed the TAG limit for certain MPI distributions
* Use a pack/unpack approach to send a single msg per pid
@bartgol bartgol added the EAMxx PRs focused on capabilities for EAMxx label Feb 13, 2025
@bartgol bartgol requested a review from tcclevenger February 13, 2025 05:07
@bartgol bartgol self-assigned this Feb 13, 2025
@bartgol
Copy link
Contributor Author

bartgol commented Feb 14, 2025

I ran an ne30 test with output remapped to ne4, and cmp against what I get with master, and the two are BFB. This PR is ready to integrate upon review.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 15, 2025

I was finally able to verify that this fix allows a ne1024 case to run one day using cray-mpich/8.1.31 (where without this change we hit the error suggesting tag too large with this cray-mpich version).

Amazing work to find/fix this so quickly!

Copy link
Contributor

@tcclevenger tcclevenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a minor comment about a comment. It's in a test so feel free to merge without addressing.

@bartgol
Copy link
Contributor Author

bartgol commented Feb 17, 2025

The only CI fails are unrelated, and are currently being dealt with. Merging.

bartgol added a commit that referenced this pull request Feb 17, 2025
…to next (PR #7007)

Fixes an issue encountered during testing of PR #6977,
where the tags we used went above the max tag allowed by the MPI distribution

[BFB]
@bartgol bartgol added the BFB PR leaves answers BFB label Feb 17, 2025
@bartgol bartgol merged commit 752a110 into master Feb 17, 2025
14 of 19 checks passed
@bartgol bartgol deleted the bartgol/eamxx/grid-import-export-no-mpi-tags branch February 17, 2025 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx PRs focused on capabilities for EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants