-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate the error from the generalize request free callback to the user #11683
Conversation
The behavior in the case of the user's function returning non-SUCCESS is a little odd:
Meaning: if the user calls the freeing function a 2nd time (e.g., MPI_REQUEST_FREE):
Is that intended? |
It is what makes sense to me. I assume that calling a second time the free function would generate the same outcome as the first call (aka. returning an error) and this will result in the resource never being released. With the approach implemented here, the second call to free will call directly into our object management (bypassing the user function) and will release the OMPI objects. In same time the user request will be set to |
@bosilca Gotcha. I think that this is new and interesting behavior -- and I think it's valid behavior for an MPI implementation (i.e., call REQUEST_FREE more than once on the same request when an error occurs). I guess the app usage would need to be something like: // Or the equivalent in sessions
MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);
int err = MPI_Request_free(&req);
if (MPI_SUCCESS != err && req_is_generalized_request) {
// Try again, because we might be in the case where
// the user-defined free function failed
err = MPI_Request_free(&req);
}
if (MPI_SUCCESS != err) {
// handle error
} At a minimum, we'd need to document this in the MPI_REQUEST_FREE man page. This would set a new precedent for how to handle errors; are you thinking that this is a strategic direction in which Open MPI should go (w.r.t. handling errors)? |
@jsquyres the code is correct for // Or the equivalent in sessions
MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);
int err = MPI_Wait(&req);
if (MPI_SUCCESS != err && req_is_generalized_request) {
// Try again, because we might be in the case where
// the user-defined free function failed
err = MPI_Request_free(&req);
}
if (MPI_SUCCESS != err) {
// handle error
} It does set a precedent in the sense that for generalized requests it gives us a way to release OMPI resources, something we are totally lacking today. 👍 for the documentation. |
Ok. My example was calling Your example called Did you mean to call |
Yes, the 2nd call should always be a call to |
Should we return a specific error code to indicate that the reason the 1st completion function failed was because the user's free function failed? E.g., We could fail the 1st completion function for a different reason, and it may not be appropriate to call MPI_REQUEST_FREE. Perhaps something like this: // Or the equivalent in sessions
MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);
int err = MPI_Wait(&req);
if (MPIX_GREQUEST_USER_FREE_FUNC_FAILED == err) {
// Try again, because we **are** in the case where
// the user-defined free function failed
err = MPI_Request_free(&req);
}
if (MPI_SUCCESS != err) {
// handle error
} |
I am sure that would not be compliant with the current MPI standard. Read |
I'm not disagreeing there -- I'm just wondering if we should return a specific error code so that users can tell that this specific error case is exactly what happened, and that they therefore should call MPI_REQUEST_FREE to actually free the resources. That being said, if what this PR is doing is:
Is there a reason we don't just tell the user that some error occurred in the completion function, and also release the resources? I.e., why force the 2nd step? Specifically, instead of: if (OMPI_SUCCESS == rc ) {
OBJ_RELEASE(*req);
*req = MPI_REQUEST_NULL;
} else {
/* Make sure we will not be calling the grequest free function
* a second time when we release the request.
*/
greq->greq_free.c_free = NULL;
}
return rc; have this: OBJ_RELEASE(*req);
*req = MPI_REQUEST_NULL;
return rc; Put differently: is there something to be gained by forcing the user to call MPI_REQUEST_FREE? |
|
@bosilca I did a little testing; I'm not sure this patch is right. I first tried to find out what MPICH does:
#include <stdio.h>
#include <mpi.h>
static int query_fn (void *ctx, MPI_Status *s) { return MPI_SUCCESS; }
static int free_fn (void *ctx) { return MPI_ERR_OTHER; } // <-- RETURN WITH FAILURE !!!
static int cancel_fn (void *ctx, int c) { return MPI_SUCCESS; }
static void test1(void)
{
int ret;
MPI_Status status;
MPI_Request request;
MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);
MPI_Grequest_complete(request);
ret = MPI_Wait(&request, &status);
printf("Test 1: ret=%d, request==REQUEST_NULL: %d\n",
ret, request == MPI_REQUEST_NULL);
}
static void test2(void)
{
int ret;
MPI_Request request, req_copy;
MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);
req_copy = request;
ret = MPI_Request_free(&request);
printf("Test 2: MPI_Request_free: ret=%d, request==REQUEST_NULL: %d\n",
ret, request == MPI_REQUEST_NULL);
ret = MPI_Grequest_complete(req_copy);
printf("Test 2: MPI_Grequest_complete: ret=%d\n",
ret);
}
static void test3(void)
{
int ret;
MPI_Status status;
MPI_Request request, req_copy;
MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);
req_copy = request;
ret = MPI_Grequest_complete(request);
printf("Test 3: MPI_Grequest_complete: ret=%d\n",
ret);
ret = MPI_Request_free(&request);
printf("Test 3: MPI_Request_free: ret=%d, request==REQUEST_NULL: %d\n",
ret, request == MPI_REQUEST_NULL);
ret = MPI_Wait(&request, &status);
printf("Test 3: MPI_Wait: ret=%d, request==REQUEST_NULL: %d\n",
ret, request == MPI_REQUEST_NULL);
}
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);
test1();
test2();
test3();
MPI_Finalize();
return 0;
} Here's the output I get:
In the first test, we do get In test 2, I'm not sure what it means that it apparently called the generalized free function before I called In test 3 is different than test 2 only because it calls Grequest_complete before Request_free. But we still see that Hence, we're seeing different behavior here:
|
Make sure to reset the generalized request to guarantee not to call the free callback a second time. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
a67ac80
to
ac3647e
Compare
Did some updates, but I'm still puzzled by the intent of the generalized requests. In any case, @jsquyres I don't think your test3 is legal. The standard clearly states that once
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca and I chatted about this on the phone. We're pretty convinced that this latest commit is correct. @bosilca is going to add a comment in grequest.c to explain a subtlety in the error path of ompi_grequest_free(), which inadvertently led to the lengthy discussion about error handling.
The whole conversation prior to this about the user needing to call MPI_REQUEST_FREE after an error occurs is now moot (i.e., it is not necessary). So there's really no new handling of errors here, no new precedent, ...etc. It's just a subtlety in how the base request is actually freed in the error path. @bosilca's comment will explain.
I'll approve when the new commit gets here with the comment.
bot:ibm:retest |
Hey @bosilca -- can you add the comment as was described in #11683 (review)? Then we can get this PR merged. |
Sadly, neither @bosilca nor I remember what the subtle issue is/was 😦 and we kinda need this PR now. Sooo... let's merge. @bosilca said in Slack:
So there's @bosilca's promise to figure it out if we need it again. 😉 |
Make sure to reset the generalized request to guarantee not to call the free callback a second time.
Fixes #11681.