-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opal hotel: only delete events that have not yet fired #1291
Conversation
@jsquyres - The same change is also needed in function: opal_hotel_checkout_and_return_occupant. |
@annu13 Good catch -- I had that in another branch, but missed it here in the official PR. Thanks! Fixing now... |
@jsquyres Just to be clear: we don't want to event_del the event because libevent automatically deletes it upon callback, yes? |
@rhc54 I'm 99% certain of this -- I saw weird things when I was testing some usnic stuff last night and noticed this. I couldn't fully trace it through libevent to confirm the issue, but when I added this check, it seemed to fix the issue. I had hoped you would look at this (since you're much more familiar with libevent than me) and provide a sanity check/confirmation in the form of "Oh, right! We should absolutely not be deleting the event from that event's own callback!" 😄 And also confirm that |
@jsquyres I should have been clearer. libevent definitely deletes the event once the callback returns, so you should never delete an event from within a callback. I didn't realize that opal_hotel was doing so or I would have flagged it. I'll take a closer look a little later, once I've put out some other fires. |
@jsquyres this needs to go in to 2.0.0? |
@jsquyres I've been scratching my head over this one, and think there may be something more subtle here that needs to be clarified. Because we have an async thread that is progressing the event thread, and we have an eviction timer in that thread, you cannot just call a function and modify a room occupant because the thread can fire at any time. You can only do it from inside an event. So if we are inside an event, then we can only delete the eviction timer if we are not in the eviction event itself. Since we have a specific callback function for eviction, you automatically know if you are in that function, and thus you know that you are in the eviction event as opposed to some other one. The only time this would be confusing is if you pointed your manual-remove event to the same function as the eviction event - and that would be a really bad design decision. I guess my bottom-line is that I don't think the hotel class itself should be deleting any event as it just isn't safe. You need to ensure that you only remove an occupant while in an event, and you should ensure that you point the manual-remove event to a different function than the eviction event so you know what event can/should be deleted. |
I think the issue here is that the hotel currently isn't async-thread safe... In the usnic BTL use case, the hotel evbase is the synchronous thread, so it's all safe. We should probably talk on the phone about the implications of an async thread evbase / multiple threads interacting with a hotel. We might either need a per-hotel-instance switch to enable/disable locks on a hotel, or perhaps a 2 hotel classes: one with locks, and one without...? Fodder for discussion. |
@rhc54 and I talked on the phone. Current solution on this PR isn't correct: the |
The eviction callback, for convenience (and to avoid code duplication), use to call opal_hotel_checkout(). However, opal_hotel_checkout() deletes the eviction event -- which is fine to do when opal_hotel_checkout() is invoked by the application. But when it's invoked by the same event that it's deleting, it can cause Bad Things to happen. For simplicity, instead of invoking opal_hotel_checkout() from the eviction callback, just duplicate the checkout logic into the eviction callback function (and skip the delete-the-evict-event part). For good measure, put a comment in all three places where the checkout logic occurs (because it's inlined): don't change this logic without changing all 3 places. Finally, also add a line in the docs for opal_hotel_init() warning users from calling opal_hotel_checkout() from their eviction callback.
5de4ab8
to
270cc11
Compare
@rhc54 Please review the new commit I just pushed. Thanks. |
👍 thx! |
opal hotel: only delete events that have not yet fired
coll/hcoll mpi datatypes support
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
PMIx Changes: fbc2513 build: Delay adding DELAYED_LIBS even more a83d564 Minor correction to client-get logic 4aa95ff In the case where the server is using a different dstore component, check that as well. 8a41b35 Remove the "<option> help" support 76850f6 Fix signed comparison warning 65df23e Ensure we check peer's gds for data b13e7d4 Silence Coverity warning 2a34472 build: Clean up delayed flags a1a5935 build: Use MCA env management for common/sse e15de4b build: Wrapper/pkg-config improvements 680326c build: check_package static improvements 330d2d4 Merge pull request open-mpi#2529 from bwbarrett/bugfix/summary_stream a819eeb build: Fix output stream bug in summary PRRTE Changes: c49eafc3ac Protect against proxy confusion 86bddb2abf Add some missing help verbiage 4e9029ccea slurm: fix breakage owing to rlm refactor 514553dedf Correctly determine when to daemonize backend prted 0db143da11 Some really minor cleanups c784c22947 build: check_package static improvements e0810859d7 Add missing CLI option and parsing a9e2303807 Merge pull request open-mpi#1292 from bwbarrett/bugfix/summary_stream 94b57788ca Merge pull request open-mpi#1291 from rhc54/topic/py fabb57117b build: Fix output stream bug in summary 17e6d86cbf Fix a problem with the "canonicalize_path" function Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The opal_hotel_checkout() function can be called manually or it can be invoked by the eviction callback. When it is invoked by the eviction callback, we do not want to opal_event_del() the eviction callback.
@rhc54 Please review -- I think that this is a bug in the hotel code.