-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel rank0 deadlock fixes #1183
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -304,8 +304,10 @@ H5AC__broadcast_candidate_list(H5AC_t *cache_ptr, unsigned *num_entries_ptr, had | |
* are used to receiving from process 0, and also load it | ||
* into a buffer for transmission. | ||
*/ | ||
if (H5AC__copy_candidate_list_to_buffer(cache_ptr, &chk_num_entries, &haddr_buf_ptr) < 0) | ||
HGOTO_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.") | ||
if (H5AC__copy_candidate_list_to_buffer(cache_ptr, &chk_num_entries, &haddr_buf_ptr) < 0) { | ||
/* Push an error, but still participate in following MPI_Bcast */ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate buffer.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. call to HDONE_ERROR before done: flag? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, if we can't copy the candidate list to buffer, I expect we will crash and burn regardless. Thus I don't see much point in participating in the Bcast. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it will almost certainly crash. On the other hand, it will probably crash in a useful way from a debugging perspective. The alternative would be a hang, which seemed the worse of the two options, but I could be convinced either way. |
||
} | ||
HDassert(chk_num_entries == num_entries); | ||
HDassert(haddr_buf_ptr != NULL); | ||
|
||
|
@@ -428,18 +430,23 @@ H5AC__broadcast_clean_list(H5AC_t *cache_ptr) | |
|
||
/* allocate a buffer to store the list of entry base addresses in */ | ||
buf_size = sizeof(haddr_t) * num_entries; | ||
if (NULL == (addr_buf_ptr = (haddr_t *)H5MM_malloc(buf_size))) | ||
HGOTO_ERROR(H5E_CACHE, H5E_CANTALLOC, FAIL, "memory allocation failed for addr buffer") | ||
|
||
/* Set up user data for callback */ | ||
udata.aux_ptr = aux_ptr; | ||
udata.addr_buf_ptr = addr_buf_ptr; | ||
udata.u = 0; | ||
|
||
/* Free all the clean list entries, building the address list in the callback */ | ||
/* (Callback also removes the matching entries from the dirtied list) */ | ||
if (H5SL_free(aux_ptr->c_slist_ptr, H5AC__broadcast_clean_list_cb, &udata) < 0) | ||
HGOTO_ERROR(H5E_CACHE, H5E_CANTFREE, FAIL, "Can't build address list for clean entries") | ||
if (NULL == (addr_buf_ptr = (haddr_t *)H5MM_malloc(buf_size))) { | ||
/* Push an error, but still participate in following MPI_Bcast */ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTALLOC, FAIL, "memory allocation failed for addr buffer") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a case where the MPI implementation may crash when MPI_Bcast is called with a NULL addr_buf_ptr. At least, this seems to be the case for OpenMPI. However, if memory allocation fails here, there isn't a whole lot else we can do, and rank 0 still needs to participate in the BCast in a similar manner as the other ranks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per comment above, I don't see much point in worrying about the Bcast if rank 0 can't create the buffer. If you want to recover, the better solution would be to allocate the buffer before we Bcast the number of entries in it, and pretend entry_count is zero if the malloc fails. That said, if rank 0 can't allocate a smallish buffer, the computation is hosed regardless. |
||
} | ||
else { | ||
/* Set up user data for callback */ | ||
udata.aux_ptr = aux_ptr; | ||
udata.addr_buf_ptr = addr_buf_ptr; | ||
udata.u = 0; | ||
|
||
/* Free all the clean list entries, building the address list in the callback */ | ||
/* (Callback also removes the matching entries from the dirtied list) */ | ||
if (H5SL_free(aux_ptr->c_slist_ptr, H5AC__broadcast_clean_list_cb, &udata) < 0) { | ||
/* Push an error, but still participate in following MPI_Bcast */ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTFREE, FAIL, "Can't build address list for clean entries") | ||
} | ||
} | ||
|
||
/* Now broadcast the list of cleaned entries */ | ||
if (MPI_SUCCESS != | ||
|
@@ -1448,8 +1455,10 @@ H5AC__receive_haddr_list(MPI_Comm mpi_comm, unsigned *num_entries_ptr, haddr_t * | |
|
||
/* allocate buffers to store the list of entry base addresses in */ | ||
buf_size = sizeof(haddr_t) * num_entries; | ||
if (NULL == (haddr_buf_ptr = (haddr_t *)H5MM_malloc(buf_size))) | ||
HGOTO_ERROR(H5E_CACHE, H5E_CANTALLOC, FAIL, "memory allocation failed for haddr buffer") | ||
if (NULL == (haddr_buf_ptr = (haddr_t *)H5MM_malloc(buf_size))) { | ||
/* Push an error, but still participate in following MPI_Bcast */ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTALLOC, FAIL, "memory allocation failed for haddr buffer") | ||
} | ||
|
||
/* Now receive the list of candidate entries */ | ||
if (MPI_SUCCESS != | ||
|
@@ -1800,10 +1809,14 @@ H5AC__rsp__dist_md_write__flush_to_min_clean(H5F_t *f) | |
|
||
if (evictions_enabled) { | ||
/* construct candidate list -- process 0 only */ | ||
if (aux_ptr->mpi_rank == 0) | ||
if (aux_ptr->mpi_rank == 0) { | ||
/* If constructing candidate list fails, push an error but still participate | ||
* in collective operations during following candidate list propagation | ||
*/ | ||
if (H5AC__construct_candidate_list(cache_ptr, aux_ptr, H5AC_SYNC_POINT_OP__FLUSH_TO_MIN_CLEAN) < | ||
0) | ||
HGOTO_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate list.") | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't construct candidate list.") | ||
} | ||
|
||
/* propagate and apply candidate list -- all processes */ | ||
if (H5AC__propagate_and_apply_candidate_list(f) < 0) | ||
|
@@ -1899,15 +1912,21 @@ H5AC__rsp__p0_only__flush(H5F_t *f) | |
aux_ptr->write_permitted = FALSE; | ||
|
||
/* Check for error on the write operation */ | ||
if (result < 0) | ||
HGOTO_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't flush.") | ||
|
||
/* this code exists primarily for the test bed -- it allows us to | ||
* enforce POSIX semantics on the server that pretends to be a | ||
* file system in our parallel tests. | ||
*/ | ||
if (aux_ptr->write_done) | ||
(aux_ptr->write_done)(); | ||
if (result < 0) { | ||
/* If write operation fails, push an error but still participate | ||
* in collective operations during following cache entry | ||
* propagation | ||
*/ | ||
HDONE_ERROR(H5E_CACHE, H5E_CANTFLUSH, FAIL, "Can't flush.") | ||
} | ||
else { | ||
/* this code exists primarily for the test bed -- it allows us to | ||
* enforce POSIX semantics on the server that pretends to be a | ||
* file system in our parallel tests. | ||
*/ | ||
if (aux_ptr->write_done) | ||
(aux_ptr->write_done)(); | ||
} | ||
} /* end if */ | ||
|
||
/* Propagate cleaned entries to other ranks. */ | ||
|
@@ -2019,15 +2038,21 @@ H5AC__rsp__p0_only__flush_to_min_clean(H5F_t *f) | |
aux_ptr->write_permitted = FALSE; | ||
|
||
/* Check for error on the write operation */ | ||
if (result < 0) | ||
HGOTO_ERROR(H5E_CACHE, H5E_SYSTEM, FAIL, "H5C_flush_to_min_clean() failed.") | ||
|
||
/* this call exists primarily for the test code -- it is used | ||
* to enforce POSIX semantics on the process used to simulate | ||
* reads and writes in t_cache.c. | ||
*/ | ||
if (aux_ptr->write_done) | ||
(aux_ptr->write_done)(); | ||
if (result < 0) { | ||
/* If write operation fails, push an error but still participate | ||
* in collective operations during following cache entry | ||
* propagation | ||
*/ | ||
HDONE_ERROR(H5E_CACHE, H5E_SYSTEM, FAIL, "H5C_flush_to_min_clean() failed.") | ||
} | ||
else { | ||
/* this call exists primarily for the test code -- it is used | ||
* to enforce POSIX semantics on the process used to simulate | ||
* reads and writes in t_cache.c. | ||
*/ | ||
if (aux_ptr->write_done) | ||
(aux_ptr->write_done)(); | ||
} | ||
} /* end if */ | ||
|
||
if (H5AC__propagate_flushed_and_still_clean_entries_list(f) < 0) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1397,9 +1397,7 @@ H5CX_set_apl(hid_t *acspl_id, const H5P_libclass_t *libclass, | |
|
||
/* If parallel is enabled and the file driver used is the MPI-IO | ||
* VFD, issue an MPI barrier for easier debugging if the API function | ||
* calling this is supposed to be called collectively. Note that this | ||
* happens only when the environment variable H5_COLL_BARRIER is set | ||
* to non 0. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a minor documentation issue here I noticed while fixing these other issues: The H5_COLL_BARRIER environment variable doesn't exist |
||
* calling this is supposed to be called collectively. | ||
*/ | ||
if (H5_coll_api_sanity_check_g) { | ||
MPI_Comm mpi_comm; /* File communicator */ | ||
|
@@ -1456,9 +1454,7 @@ H5CX_set_loc(hid_t | |
|
||
/* If parallel is enabled and the file driver used is the MPI-IO | ||
* VFD, issue an MPI barrier for easier debugging if the API function | ||
* calling this is supposed to be called collectively. Note that this | ||
* happens only when the environment variable H5_COLL_BARRIER is set | ||
* to non 0. | ||
* calling this is supposed to be called collectively. | ||
*/ | ||
if (H5_coll_api_sanity_check_g) { | ||
MPI_Comm mpi_comm; /* File communicator */ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -997,6 +997,9 @@ H5C__read_cache_image(H5F_t *f, H5C_t *cache_ptr) | |
#endif /* H5_HAVE_PARALLEL */ | ||
|
||
/* Read the buffer (if serial access, or rank 0 of parallel access) */ | ||
/* NOTE: if this block read is being performed on rank 0 only, throwing | ||
* an error here will cause other ranks to hang in the following MPI_Bcast. | ||
*/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, I'm just leaving a note here as properly fixing the potential hang is a little tricky and very messy with #ifdefs |
||
if (H5F_block_read(f, H5FD_MEM_SUPER, cache_ptr->image_addr, cache_ptr->image_len, | ||
cache_ptr->image_buffer) < 0) | ||
HGOTO_ERROR(H5E_CACHE, H5E_READERROR, FAIL, "Can't read metadata cache image block") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was of the impression that HDONE_ERROR is only used after the done: tag. Is this this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general yes. However the main upside to the macro is just that it doesn't do the typical goto. It still pushes an error message to the stack, sets the err_occurred variable and sets the return value. The HERROR macro only pushes an error message to the stack, without setting those needed variables. The HCOMMON_ERROR macro might be a good alternative, but the header comment for it says that it shouldn't need to be used outside H5Eprivate.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what is going on here, you are trying to flag an error but not jump to done so as to participate in the upcoming sync point.
Using HDONE_ERROR doesn't seem the right way to do this.
For starters, we have comments expressly stating that HDONE_ERROR is only to be used after the done: flag. Breaking this rule without explanation is bound to cause confusion.
I note that we do this already in H5Fint.c in H5F_flush_phase1, H5F_flush_phase2, and H5F_dest. However, we add the comment "/* Push error, but keep going */". At a minimum, we should have this. It would be good to explain why.
One could also argue that we should only be using HDONE_ERROR in this context in parallel builds. I'm not sure that this is a good idea, but it might be worth thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I tried to add a comment to all these places to the same effect, mentioning that we push an error but keep participating in the collective operation. If any of those comments aren't too clear on that, please let me know so I can clarify. I do agree that HDONE_ERROR isn't necessarily meant for this, but I believe an alternative macro would be implemented exactly the same, just with a different name.
Perhaps "DONE_ERROR" was not the best name, because there should be no harm from calling it outside the done: tag. Seems to me like the reasoning was just the fact that calling HGOTO_ERROR after the done: tag would of course result in an infinite loop.