Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/acoll: A few miscellaneous bugfixes #12985

Merged
merged 2 commits into from
Jan 15, 2025

Conversation

amd-nithyavs
Copy link
Contributor

This PR has some bug fixes and addition of a few command line arguments for configurability.

Bug fixes

  • Selection logic and associated parameter assignments fixed for socket level and node level subcommunicators (bcast.c, reduce.c, utils.h)
  • Calling appropriate bcast function from acoll's allgather/allreduce for subcommunicators (allreduce.c, allgather.c)
  • Proper free and deregister for xpmem related memory (component.c, utils.h)

Command-line arguments

  • To enable/disable use of NUMA and socket based subcommunicators (component.c, bcast.c)

In addition, the algorithm selection logic for multinode bcast is modified for better performance after the bugfixes.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the minor comment, this looks good to me.

} else if (total_dsize <= dsize_thresh[thr_ind][2]) {
*sg_cnt = sg_size;
*sg_cnt = sg_size;
if (num_nodes == 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: Can you please swap the arguments around and have the constant first? i.e. this should be

if (2 == num_nodes) {

(this is just to follow the Open MPI coding guidelines). There are a number of similar instances in this routine, I will not point them out, but please apply the same rule throughout the patch.

(Note: you are using this rule already at many other places)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in all the relevant places in the latest update.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mshanthagit
Copy link
Contributor

@lrbison could you please review this patch as well? Assigned it to you as might be familiar with the original acoll patch.

Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I am only somewhat familiar with the acoll component, so some of the logic escapes me, in particular the ranks and roots during init at various layers.

One additional note: I don't think your merge commit would be required if you rebased your changes onto the tip of main.

amd-nithyavs and others added 2 commits January 15, 2025 11:35
A few bugfixes (mostly applicable for multinode) and some extra command
line arguments for easier configurability.

Signed-off-by: Nithya V S <Nithya.VS@amd.com>
A hash table, as part of the acoll modules struct, is used to track the
rcache registrations done as part of the register_and_cache api called
from acoll collective components. This hash table is then iterated over
during module destruct and each rcache registration is deregistered to
ensure that the rcache module destroy proceeds correctly.

Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>
Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for removing the merge commit.

@mshanthagit mshanthagit merged commit 398b8d4 into open-mpi:main Jan 15, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants