-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential issue in osc/sm #10175
Comments
This looks more like a question about the info subsystem than osc, and I know nothing about the info subsystem, so I'm not the right person to own this ticket. @hjelmn and @jsquyres may be able to help. It looks like (ignoring the info questions), the SM OSC component is working properly. Unless the SM component detects that the info key "alloc_shared_noncontig" is "true", the component will allocate one shared memory segment of size SUM(rank_sizes), and the baseptr for every process will be non-NULL, because the algorithm for assigning base pointers is essentially:
However, when the non-contig case is true, the SM component is rightly not setting all baseptrs to non-null, because every rank is locally allocating memory (and calling malloc(0) generally causes unhappiness). So this really comes down to a usage behavior in info retrieval. |
The registered callback The correct way in this case is to check the info key first and then subscribe to it, e.g., like in osc/ucx: https://github.com/open-mpi/ompi/blob/main/ompi/mca/osc/ucx/osc_ucx_component.c#L404 followed by the info subscription in https://github.com/open-mpi/ompi/blob/main/ompi/mca/osc/ucx/osc_ucx_component.c#L542 |
@devreal is there actually any reason to subscribe to an info key like "alloc_shared_noncontig"? The behavior in question can't change once WIN_ALLOCATE_SHARED returns, since the memory is allocated at that point. Can't we simplify by just checking the value, doing our allocations, and moving on? Is there anything we should be doing to reject an attempt to set an info key after the window is created in this case? |
Good question. Looking at the infosubscriber code, it seems like any attempt to update the key is ignored if there are no subscribers. Dropping the subscription in osc/sm should be safe then, the key won't be changed. |
Remove the info subscribe for both the alloc_shared_noncontig and blocking_fence info keys. This change was inspired by issue open-mpi#10175, which highlighted that we were not properly following the non-contig info key (our behavior was standards compliant, but not particularly helpful), because the info subscription was overwriting the provided value. In the investigation, it became clear that there is really no advantage to subscribing to a key that can't be changed, so drop the subscription code for the two keys that can't be changed and fix a bug and remove code all at the same time. Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Remove the info subscribe for both the alloc_shared_noncontig and blocking_fence info keys. This change was inspired by issue open-mpi#10175, which highlighted that we were not properly following the non-contig info key (our behavior was standards compliant, but not particularly helpful), because the info subscription was overwriting the provided value. In the investigation, it became clear that there is really no advantage to subscribing to a key that can't be changed, so drop the subscription code for the two keys that can't be changed and fix a bug and remove code all at the same time. Signed-off-by: Brian Barrett <bbarrett@amazon.com> (cherry picked from commit 38940b3)
Pr's merged - I believe this can be closed? |
During onesided debugging we observed strange behavior with ompi-tests/ibm/onesided/c_win_shared_noncontig_put when we're running ppn > 1.
The issue happens when ranks call Win_shared_query() on windows of size 0 and check the return base_ptr.
Interestingly, it seems that in the osc/sm code, the following calls may impact
module->noncontig
variable which is used in a conditional statement that setsbase_ptr
ompi/ompi/mca/osc/sm/osc_sm_component.c
Line 222 in 9d94a14
ompi/ompi/mca/osc/sm/osc_sm_component.c
Lines 276 to 280 in 9d94a14
After 277 the
noncontig
will befalse
, which will evaluate totrue
, and thebase_ptr
is set.ompi/ompi/mca/osc/sm/osc_sm_component.c
Lines 369 to 374 in 9d94a14
However, if we call
opal_info_get_bool()
beforeopal_infosubscribe_subscribe()
, thenoncontig
flag will remaintrue
, and thebase_ptr
will not be set. To be clear, we're not removing the originalopal_info_get_bool()
on line 277, only adding an extra call prior to the subscribe call, and the flag will remain unchanged.The text was updated successfully, but these errors were encountered: