-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/acoll: A few miscellaneous bugfixes #12985
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the minor comment, this looks good to me.
} else if (total_dsize <= dsize_thresh[thr_ind][2]) { | ||
*sg_cnt = sg_size; | ||
*sg_cnt = sg_size; | ||
if (num_nodes == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment: Can you please swap the arguments around and have the constant first? i.e. this should be
if (2 == num_nodes) {
(this is just to follow the Open MPI coding guidelines). There are a number of similar instances in this routine, I will not point them out, but please apply the same rule throughout the patch.
(Note: you are using this rule already at many other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in all the relevant places in the latest update.
8d8338e
to
e1a70f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@lrbison could you please review this patch as well? Assigned it to you as might be familiar with the original acoll patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I am only somewhat familiar with the acoll component, so some of the logic escapes me, in particular the ranks and roots during init at various layers.
One additional note: I don't think your merge commit would be required if you rebased your changes onto the tip of main.
A few bugfixes (mostly applicable for multinode) and some extra command line arguments for easier configurability. Signed-off-by: Nithya V S <Nithya.VS@amd.com>
A hash table, as part of the acoll modules struct, is used to track the rcache registrations done as part of the register_and_cache api called from acoll collective components. This hash table is then iterated over during module destruct and each rcache registration is deregistered to ensure that the rcache module destroy proceeds correctly. Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>
5fcb328
to
3e67dd1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for removing the merge commit.
This PR has some bug fixes and addition of a few command line arguments for configurability.
Bug fixes
Command-line arguments
In addition, the algorithm selection logic for multinode bcast is modified for better performance after the bugfixes.