-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc/rdma: Fixes when running with btl/tcp #8756
Changes from all commits
5e3219f
34763df
95b1039
532cd9e
9dd1996
691037a
d57a944
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -428,10 +428,6 @@ static int ompi_osc_rdma_refresh_dynamic_region (ompi_osc_rdma_module_t *module, | |
} | ||
peer->regions = temp; | ||
|
||
/* lock the region */ | ||
ompi_osc_rdma_lock_acquire_shared (module, &peer->super, 1, offsetof (ompi_osc_rdma_state_t, regions_lock), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you say a word why you removed the lock? AFAICS, the owner of the region takes an exclusive lock in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm seeing a hang in this code path. From what I can see, because this process already has an exclusive lock on this peer, it can't acquire the shared lock (it thinks someone else owns it), and will loop for infinity trying to acquire this lock. To me this looks like a double-lock scenerio. @hjelmn can probably say if that is intended or not, perhaps there is a different fix needed for this case. |
||
OMPI_OSC_RDMA_LOCK_EXCLUSIVE); | ||
|
||
source_address = (uint64_t)(intptr_t) peer->super.state + offsetof (ompi_osc_rdma_state_t, regions); | ||
ret = ompi_osc_get_data_blocking (module, peer->super.state_btl_index, peer->super.state_endpoint, | ||
source_address, peer->super.state_handle, peer->regions, region_len); | ||
|
@@ -440,9 +436,6 @@ static int ompi_osc_rdma_refresh_dynamic_region (ompi_osc_rdma_module_t *module, | |
return ret; | ||
} | ||
|
||
/* release the region lock */ | ||
ompi_osc_rdma_lock_release_shared (module, &peer->super, -1, offsetof (ompi_osc_rdma_state_t, regions_lock)); | ||
|
||
/* update cached region ids */ | ||
peer->region_id = region_id; | ||
peer->region_count = region_count; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would't it be better to release the lock after
OBj_RELEASE(group)
, right before the return statement? Any chance multiple threads would compete to decrement the group's reference count?Maybe this is totally irrelevant, I do not know this codebase. In that case just ignore my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems reasonable to me. I'll make the change. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user provided group is unnecessary at this point, releasing it as early as possible seems correct.
In fact, the user provided group is only necessary early on while building the peers array. As we refcount each peer proc explicitly, technically we don't even need to keep a pointer to the group, because this leads to double refcounting the peers. From a performance perspective, I would rather keep and refcount only the group than refcounting each individual procs (because the procs are already refcounted by the group). We could save 2*peers atomic operations per epoch.