Add a new MachEnv class to hold various machine and messaging parameters #26

philipwjones · 2023-07-25T19:25:47Z

This adds a new class MachEnv to hold a number of parameters that describe the machine environment, including MPI communicators and task info, number of threads if threaded, vector length for CPUs and potentially other GPU and node-level configuration info. This is described in the included documentation for User's and Developer's Guides.

This has been tested on Chrysalis using a provided unit test. This unit test has not yet been integrated into Cmake, but will be soon in an upcoming PR. In the meantime, you can build the unit test in the test/base directory using
mpicc (or other MPI-wrapped compiler) -I../../src/base -DOMEGA_VECTOR_LENGTH=16 ../../src/base/MachEnv.cpp MachEnvTest.cpp -lstdc++
and run using:
srun (or other mpi launcher) -n 8 ./a.out
Note that the unit test requires at least 8 MPI tasks.

Checklist

Documentation:
- Design document has been generated and added to the docs
- User's Guide has been updated
- Developer's Guide has been updated
- Documentation has been built locally and changes look as expected
Testing
- A comment in the PR documents testing used to verify the changes including any tests that are added/modified/impacted.
- CTest unit tests for new features have been added per the approved design.
- Unit tests have passed. Please provide a relevant CDash build entry for verification.

includes source and a unit test

sarats · 2023-07-26T17:24:32Z

components/omega/src/base/MachEnv.cpp

+   }
+
+   // Set task 0 as master
+   NewEnv.MasterTask = 0;


Just wondering if we should make master rank an optional argument with a default value of zero?

Sure, I will add.

sarats · 2023-07-26T17:26:10Z

components/omega/src/base/MachEnv.cpp

+   MPI_Group_range_incl(InGroup, NRanges, Range, &NewGroup);
+
+   // Create the communicator for the new group
+   MPI_Comm_create(InComm, NewGroup, &(NewEnv.Comm));


Perhaps we can do a error check for comm_create so that it exits here than somewhere further down stream.

I can add, but it turns out that there are few instances where an actual error is returned. Most often, it will return a MPI_COMM_NULL which can be for a variety of valid reasons.

sarats · 2023-07-26T17:28:30Z

components/omega/src/base/MachEnv.cpp

+//------------------------------------------------------------------------------
+// Set task ID for the master task (if not 0)
+
+int MachEnv::setMasterTask(const int TaskID) {


I guess the expectation is that we can dynamically change master even after comm creation. Would be great for adaptive/dynamic load balancing if we can make it work throughout the model.

Yup, this could be part of load-balancing, especially if the default master is particularly over-loaded.

sarats · 2023-07-26T17:30:21Z

components/omega/src/base/MachEnv.h

+#ifdef OMEGA_VECTOR_LENGTH
+constexpr int VecLength = OMEGA_VECTOR_LENGTH;
+#else
+constexpr int VecLength = 1;


I'm overthinking this (not enough coffee) for future-proofing GPU architectures but should we make an equivalent GPU_VECTOR_LENGTH and set to 1?
Ignore if that's a trivial change anyway.

Good to think about, but we can probably hold off on that level of granular configuration for GPUs until needed.

Since this vector length will often appear as vector blocking of the inner loop, we will generally want it to be 1 for GPU and a separate GPU_VECTOR_LENGTH would be redundant. (Basically like the pack they're doing in the atmosphere). Though I think Trey was exploring use of non-unity values to match warps/thread group sizes and we could set this vector length to an appropriate value in that case. I do think we eventually want some GPU-related optimizing parameters, but Brian is correct that we want to figure out what those should be and can add later.

sarats

Overall, looks good to me.

brian-oneill

Built and ran the unit test on Chrysalis, looks good.

this change uncovered a problem with the original implementation that required some refactoring Also added a print function for debugging

philipwjones · 2023-08-02T21:18:04Z

I've made the changes suggested in the reviews, namely adding an optional argument for selecting a different master task and adding error checking on Comm_create. The former uncovered a problem with the implementation so I've had to refactor a bit. And I added a print function that was useful for debugging. Documentation has been changed to reflect these changes. All still passes the unit tests, including a new unit test for the added optional argument.

philipwjones added 3 commits July 25, 2023 10:58

add MachEnv code for MPI and other machine environment variables

9bcac37

includes source and a unit test

changes to MachEnv formatting to comply with style

25f8a4a

add documentation of MachEnv class

e5da693

philipwjones added the Omega label Jul 25, 2023

philipwjones requested review from sarats and brian-oneill July 25, 2023 19:25

philipwjones self-assigned this Jul 25, 2023

sarats reviewed Jul 26, 2023

View reviewed changes

sarats approved these changes Jul 26, 2023

View reviewed changes

brian-oneill approved these changes Jul 27, 2023

View reviewed changes

philipwjones added 4 commits August 2, 2023 14:55

added optional arg to change master task on construction

91bcb1c

this change uncovered a problem with the original implementation that required some refactoring Also added a print function for debugging

fixed the developer guide section to reflect implementation change

af8eb94

format changes to MachEnv to satisfy linter

48b2d42

added error checking on Comm_create calls

900d341

philipwjones merged commit 6004902 into E3SM-Project:develop Aug 2, 2023

philipwjones deleted the omega/mach-env branch December 4, 2023 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new MachEnv class to hold various machine and messaging parameters #26

Add a new MachEnv class to hold various machine and messaging parameters #26

philipwjones commented Jul 25, 2023 •

edited by brian-oneill

Loading

sarats Jul 26, 2023

philipwjones Jul 31, 2023

sarats Jul 26, 2023

philipwjones Jul 31, 2023

sarats Jul 26, 2023

philipwjones Jul 31, 2023

sarats Jul 26, 2023

brian-oneill Jul 27, 2023

philipwjones Jul 31, 2023

sarats left a comment

brian-oneill left a comment

philipwjones commented Aug 2, 2023

Add a new MachEnv class to hold various machine and messaging parameters #26

Add a new MachEnv class to hold various machine and messaging parameters #26

Conversation

philipwjones commented Jul 25, 2023 • edited by brian-oneill Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarats left a comment

Choose a reason for hiding this comment

brian-oneill left a comment

Choose a reason for hiding this comment

philipwjones commented Aug 2, 2023

philipwjones commented Jul 25, 2023 •

edited by brian-oneill

Loading