Skip to content
This repository has been archived by the owner on Sep 30, 2022. It is now read-only.

v2.x osc/pt2pt fixes #1321

Merged
merged 4 commits into from
Aug 17, 2016
Merged

v2.x osc/pt2pt fixes #1321

merged 4 commits into from
Aug 17, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Aug 16, 2016

:bot:milestone:v2.0.1
:bot🏷️bug
:bot:assign: @regrant

hjelmn and others added 2 commits August 16, 2016 08:02
This commit fixes an issue that can occur if a target gets overwhelmed with
requests. This can cause osc/pt2pt to go into deep recursion with a stack
like req_complete_cb -> ompi_osc_pt2pt_callback -> start -> req_complete_cb
-> ... . At small scale this is fine as the recursion depth stays small but
at larger scale we can quickly exhaust the stack processing frag requests.
To fix the issue the request callback now simply puts the request on a
list and returns. The osc/pt2pt progress function then handles the
processing and reposting of the request.

As part of this change osc/pt2pt can now post multiple fragment receive
requests per window. This should help prevent a target from being overwhelmed.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>

(cherry picked from commit open-mpi/ompi@7589a25)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The original lock_all algorithm in osc/pt2pt sent a lock message to
each peer in the communicator even if the peer is never the target of
an operation. Since this scales very poorly the implementation has
been replaced by one that locks the remote peer on first communication
after a call to MPI_Win_lock_all.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@9444df1)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@ompiteam-bot ompiteam-bot added this to the v2.0.1 milestone Aug 16, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Aug 16, 2016

This PR fixes 1) a stack smash from deep recursion, and 2) horribly bad lock-all scaling. Can be bumped to 2.0.2 if really necessary. Want this for 2.0.1 though.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2092/ for details.

@jsquyres
Copy link
Member

@regrant Can you please review this in the Real Near Future? 😄

@regrant
Copy link
Contributor

regrant commented Aug 16, 2016

@jsquyres you read my mind, I was reviewing this already, almost done.

int rc;

module->recv_frag_count = mca_osc_pt2pt_component.receive_count;
if (module->recv_frag_count < 0) {
Copy link
Contributor

@regrant regrant Aug 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn I don't see where/how this count can ever be negative. Am I missing a intialization somewhere? I searched and couldn't find it. So this might be dead code.

Did you mean module->recv_frag_count<=0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, good catch. The initialization didn't make it. The fun of transcribing patches. Fixing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, hold on. It is there. It can be negative because the user can be negative. That does need to be fixed.

ggouaillardet and others added 2 commits August 17, 2016 08:08
(cherry picked from commit open-mpi/ompi@8faa1ed)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This receive_count MCA variable should never be negative. Change it
to an unsigned int.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@40b7088)
@hjelmn
Copy link
Member Author

hjelmn commented Aug 17, 2016

@regrant Should be better now.

@regrant
Copy link
Contributor

regrant commented Aug 17, 2016

@hjelmn Looks good so far, once the checks complete successfully I'll mark as reviewed.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2097/ for details.

@jsquyres
Copy link
Member

Per @regrant 👍

@jsquyres
Copy link
Member

@hppritcha Gave a verbal "good to go"

@jsquyres jsquyres merged commit 0958bc4 into open-mpi:v2.x Aug 17, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants