Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal hotel: only delete events that have not yet fired #1291

Merged
merged 1 commit into from
Jan 13, 2016

Conversation

jsquyres
Copy link
Member

@jsquyres jsquyres commented Jan 8, 2016

The opal_hotel_checkout() function can be called manually or it can be invoked by the eviction callback. When it is invoked by the eviction callback, we do not want to opal_event_del() the eviction callback.

@rhc54 Please review -- I think that this is a bug in the hotel code.

@jsquyres jsquyres added the bug label Jan 8, 2016
@jsquyres jsquyres added this to the v1.10.3 milestone Jan 8, 2016
@annu13
Copy link
Contributor

annu13 commented Jan 8, 2016

@jsquyres - The same change is also needed in function: opal_hotel_checkout_and_return_occupant.

@jsquyres
Copy link
Member Author

jsquyres commented Jan 8, 2016

@annu13 Good catch -- I had that in another branch, but missed it here in the official PR. Thanks! Fixing now...

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2016

@jsquyres Just to be clear: we don't want to event_del the event because libevent automatically deletes it upon callback, yes?

@jsquyres
Copy link
Member Author

jsquyres commented Jan 8, 2016

@rhc54 I'm 99% certain of this -- I saw weird things when I was testing some usnic stuff last night and noticed this. I couldn't fully trace it through libevent to confirm the issue, but when I added this check, it seemed to fix the issue. I had hoped you would look at this (since you're much more familiar with libevent than me) and provide a sanity check/confirmation in the form of "Oh, right! We should absolutely not be deleting the event from that event's own callback!" 😄 And also confirm that opal_event_evtimer_pending() is the Right way to check for that... 😄 😄

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2016

@jsquyres I should have been clearer. libevent definitely deletes the event once the callback returns, so you should never delete an event from within a callback. I didn't realize that opal_hotel was doing so or I would have flagged it.

I'll take a closer look a little later, once I've put out some other fires.

@hppritcha
Copy link
Member

@jsquyres this needs to go in to 2.0.0?

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2016

@jsquyres I've been scratching my head over this one, and think there may be something more subtle here that needs to be clarified. Because we have an async thread that is progressing the event thread, and we have an eviction timer in that thread, you cannot just call a function and modify a room occupant because the thread can fire at any time. You can only do it from inside an event.

So if we are inside an event, then we can only delete the eviction timer if we are not in the eviction event itself. Since we have a specific callback function for eviction, you automatically know if you are in that function, and thus you know that you are in the eviction event as opposed to some other one. The only time this would be confusing is if you pointed your manual-remove event to the same function as the eviction event - and that would be a really bad design decision.

I guess my bottom-line is that I don't think the hotel class itself should be deleting any event as it just isn't safe. You need to ensure that you only remove an occupant while in an event, and you should ensure that you point the manual-remove event to a different function than the eviction event so you know what event can/should be deleted.

@jsquyres
Copy link
Member Author

I think the issue here is that the hotel currently isn't async-thread safe...

In the usnic BTL use case, the hotel evbase is the synchronous thread, so it's all safe.

We should probably talk on the phone about the implications of an async thread evbase / multiple threads interacting with a hotel. We might either need a per-hotel-instance switch to enable/disable locks on a hotel, or perhaps a 2 hotel classes: one with locks, and one without...? Fodder for discussion.

@jsquyres
Copy link
Member Author

@rhc54 and I talked on the phone. Current solution on this PR isn't correct: the _pending() call does not guarantee that this thread is the one that is in the callback for this event. I.e., there's no public way to know that. The simplest solution is likely to copy the relevant checkout bookkeeping code into the eviction callback glue. I'll update this PR.

The eviction callback, for convenience (and to avoid code
duplication), use to call opal_hotel_checkout().  However,
opal_hotel_checkout() deletes the eviction event -- which is fine to
do when opal_hotel_checkout() is invoked by the application.  But when
it's invoked by the same event that it's deleting, it can cause Bad
Things to happen.

For simplicity, instead of invoking opal_hotel_checkout() from the
eviction callback, just duplicate the checkout logic into the eviction
callback function (and skip the delete-the-evict-event part).

For good measure, put a comment in all three places where the checkout
logic occurs (because it's inlined): don't change this logic without
changing all 3 places.

Finally, also add a line in the docs for opal_hotel_init() warning
users from calling opal_hotel_checkout() from their eviction
callback.
@jsquyres
Copy link
Member Author

@rhc54 Please review the new commit I just pushed. Thanks.

@rhc54
Copy link
Contributor

rhc54 commented Jan 13, 2016

👍 thx!

jsquyres added a commit that referenced this pull request Jan 13, 2016
opal hotel: only delete events that have not yet fired
@jsquyres jsquyres merged commit e5cf2db into open-mpi:master Jan 13, 2016
@jsquyres jsquyres deleted the pr/hotel-fix branch January 13, 2016 19:51
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 25, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 26, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 28, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 29, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 30, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 30, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 30, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
bwbarrett added a commit to bwbarrett/ompi that referenced this pull request Mar 31, 2022
PMIx Changes:
fbc2513 build: Delay adding DELAYED_LIBS even more
a83d564 Minor correction to client-get logic
4aa95ff In the case where the server is using a different dstore component, check that as well.
8a41b35 Remove the "<option> help" support
76850f6 Fix signed comparison warning
65df23e Ensure we check peer's gds for data
b13e7d4 Silence Coverity warning
2a34472 build: Clean up delayed flags
a1a5935 build: Use MCA env management for common/sse
e15de4b build: Wrapper/pkg-config improvements
680326c build: check_package static improvements
330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream
a819eeb build: Fix output stream bug in summary

PRRTE Changes:
c49eafc3ac Protect against proxy confusion
86bddb2abf Add some missing help verbiage
4e9029ccea slurm: fix breakage owing to rlm refactor
514553dedf Correctly determine when to daemonize backend prted
0db143da11 Some really minor cleanups
c784c22947 build: check_package static improvements
e0810859d7 Add missing CLI option and parsing
a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream
94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py
fabb57117b build: Fix output stream bug in summary
17e6d86cbf Fix a problem with the "canonicalize_path" function

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants