Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new MachEnv class to hold various machine and messaging parameters #26

Merged
merged 7 commits into from
Aug 2, 2023

Conversation

philipwjones
Copy link

@philipwjones philipwjones commented Jul 25, 2023

This adds a new class MachEnv to hold a number of parameters that describe the machine environment, including MPI communicators and task info, number of threads if threaded, vector length for CPUs and potentially other GPU and node-level configuration info. This is described in the included documentation for User's and Developer's Guides.

This has been tested on Chrysalis using a provided unit test. This unit test has not yet been integrated into Cmake, but will be soon in an upcoming PR. In the meantime, you can build the unit test in the test/base directory using
mpicc (or other MPI-wrapped compiler) -I../../src/base -DOMEGA_VECTOR_LENGTH=16 ../../src/base/MachEnv.cpp MachEnvTest.cpp -lstdc++
and run using:
srun (or other mpi launcher) -n 8 ./a.out
Note that the unit test requires at least 8 MPI tasks.

Checklist

  • Documentation:
    • Design document has been generated and added to the docs
    • User's Guide has been updated
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • A comment in the PR documents testing used to verify the changes including any tests that are added/modified/impacted.
    • CTest unit tests for new features have been added per the approved design.
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.

}

// Set task 0 as master
NewEnv.MasterTask = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we should make master rank an optional argument with a default value of zero?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add.

MPI_Group_range_incl(InGroup, NRanges, Range, &NewGroup);

// Create the communicator for the new group
MPI_Comm_create(InComm, NewGroup, &(NewEnv.Comm));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can do a error check for comm_create so that it exits here than somewhere further down stream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add, but it turns out that there are few instances where an actual error is returned. Most often, it will return a MPI_COMM_NULL which can be for a variety of valid reasons.

//------------------------------------------------------------------------------
// Set task ID for the master task (if not 0)

int MachEnv::setMasterTask(const int TaskID) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the expectation is that we can dynamically change master even after comm creation. Would be great for adaptive/dynamic load balancing if we can make it work throughout the model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this could be part of load-balancing, especially if the default master is particularly over-loaded.

#ifdef OMEGA_VECTOR_LENGTH
constexpr int VecLength = OMEGA_VECTOR_LENGTH;
#else
constexpr int VecLength = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm overthinking this (not enough coffee) for future-proofing GPU architectures but should we make an equivalent GPU_VECTOR_LENGTH and set to 1?
Ignore if that's a trivial change anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to think about, but we can probably hold off on that level of granular configuration for GPUs until needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this vector length will often appear as vector blocking of the inner loop, we will generally want it to be 1 for GPU and a separate GPU_VECTOR_LENGTH would be redundant. (Basically like the pack they're doing in the atmosphere). Though I think Trey was exploring use of non-unity values to match warps/thread group sizes and we could set this vector length to an appropriate value in that case. I do think we eventually want some GPU-related optimizing parameters, but Brian is correct that we want to figure out what those should be and can add later.

Copy link
Member

@sarats sarats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.

Copy link

@brian-oneill brian-oneill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built and ran the unit test on Chrysalis, looks good.

@philipwjones
Copy link
Author

I've made the changes suggested in the reviews, namely adding an optional argument for selecting a different master task and adding error checking on Comm_create. The former uncovered a problem with the implementation so I've had to refactor a bit. And I added a print function that was useful for debugging. Documentation has been changed to reflect these changes. All still passes the unit tests, including a new unit test for the added optional argument.

@philipwjones philipwjones merged commit 6004902 into E3SM-Project:develop Aug 2, 2023
@philipwjones philipwjones deleted the omega/mach-env branch December 4, 2023 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants