-
Notifications
You must be signed in to change notification settings - Fork 69
Conversation
This commit fixes an issue that can occur if a target gets overwhelmed with requests. This can cause osc/pt2pt to go into deep recursion with a stack like req_complete_cb -> ompi_osc_pt2pt_callback -> start -> req_complete_cb -> ... . At small scale this is fine as the recursion depth stays small but at larger scale we can quickly exhaust the stack processing frag requests. To fix the issue the request callback now simply puts the request on a list and returns. The osc/pt2pt progress function then handles the processing and reposting of the request. As part of this change osc/pt2pt can now post multiple fragment receive requests per window. This should help prevent a target from being overwhelmed. Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from commit open-mpi/ompi@7589a25) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The original lock_all algorithm in osc/pt2pt sent a lock message to each peer in the communicator even if the peer is never the target of an operation. Since this scales very poorly the implementation has been replaced by one that locks the remote peer on first communication after a call to MPI_Win_lock_all. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit open-mpi/ompi@9444df1) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This PR fixes 1) a stack smash from deep recursion, and 2) horribly bad lock-all scaling. Can be bumped to 2.0.2 if really necessary. Want this for 2.0.1 though. |
Test PASSed. |
@regrant Can you please review this in the Real Near Future? 😄 |
@jsquyres you read my mind, I was reviewing this already, almost done. |
int rc; | ||
|
||
module->recv_frag_count = mca_osc_pt2pt_component.receive_count; | ||
if (module->recv_frag_count < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn I don't see where/how this count can ever be negative. Am I missing a intialization somewhere? I searched and couldn't find it. So this might be dead code.
Did you mean module->recv_frag_count<=0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, good catch. The initialization didn't make it. The fun of transcribing patches. Fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, hold on. It is there. It can be negative because the user can be negative. That does need to be fixed.
(cherry picked from commit open-mpi/ompi@8faa1ed) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This receive_count MCA variable should never be negative. Change it to an unsigned int. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit open-mpi/ompi@40b7088)
@regrant Should be better now. |
@hjelmn Looks good so far, once the checks complete successfully I'll mark as reviewed. |
Test PASSed. |
Per @regrant 👍 |
@hppritcha Gave a verbal "good to go" |
:bot:milestone:v2.0.1
:bot🏷️bug
:bot:assign: @regrant