Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progression discussion [work in progress] #3640

Closed

Conversation

markalle
Copy link
Contributor

@markalle markalle commented Jun 1, 2017

We have a progresion issue in openib when we're out of resources and
a completion function wants to send another message.

Currently the completion function blocks (spins calling progression)
expecting its send to happen eventually. But in the out-of-resources case
the send is queued and the needed resources won't be freed until we
return from our completion function which will allow processing to
continue through the rest of the completed work list.

I can only see three categories of fix:

  1. don't let completion functions block like that
  2. allow recursive handle_wc() while already inside a completion function
  3. have some reserved resources for completion functions only so they can always complete

This is an example implementation of #2. This includes a conditional
recursion that gets less frequent the deeper the call stack becomes.

The biggest risk I see for #2 is that as written there is no longer a
strict ordering of completion functions. They would be initiated in the
same order they are seen in ibv_poll_cq() but you can't be sure WC1.cbfunc
has finished before WC2.cbfunc starts. I'm not sure if that's a requirement
for correctness or not.

Signed-off-by: Mark Allen markalle@us.ibm.com

We have a progresion issue in openib when we're out of resources and
a completion function wants to send another message.

Currently the completion function blocks (spins calling progression)
expecting its send to happen eventually. But in the out-of-resources case
the send is queued and the needed resources won't be freed until we
return from our completion function which will allow processing to
continue through the rest of the completed work list.

I can only see three categories of fix:
1. don't let completion functions block like that
2. allow recursive handle_wc() while already inside a completion function
3. have some reserved resources for completion functions only so they can always complete

This is an example implementation of open-mpi#2. This includes a conditional
recursion that gets less frequent the deeper the call stack becomes.

The biggest risk I see for open-mpi#2 is that as written there is no longer a
strict ordering of completion functions. They would be initiated in the
same order they are seen in ibv_poll_cq() but you can't be sure WC1.cbfunc
has finished before WC2.cbfunc starts. I'm not sure if that's a requirement
for correctness or not.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle
Copy link
Contributor Author

markalle commented Jun 1, 2017

The issue report around this PR including a testcase and some discussion is here:
#3616

@markalle markalle closed this Jun 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants