Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out a better way to handle expiration of sessions on fabric deletion #9642

Closed
bzbarsky-apple opened this issue Sep 13, 2021 · 11 comments · Fixed by #19910
Closed

Figure out a better way to handle expiration of sessions on fabric deletion #9642

bzbarsky-apple opened this issue Sep 13, 2021 · 11 comments · Fixed by #19910

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

#9571 introduces code that waits for an exchange to close by overriding the exchange delegate. This is fragile and we should have a better solution.

Proposed Solution

Some options:

  1. Have a more generic system for registering to be notified when exchanges close.
  2. Change the remove fabric command to just not send a response, and fix CommandHandler to correctly handle no response being sent (which apparently right now it does not handle correctly). If we do this, then we can expire the connection immediately.
  3. Change the response-sending code to handle the connection being expired gracefully (which apparently it does not right now).
  4. Add a way to flag on the CommandHandler that it should expire secure sessions for the accessing fabric once it's done sending the response and has closed the exchange.

I personally thing option 1 is too overdesigned for what we want here. Option 4 is pretty simple, but scatters the logic for the "deleting current fabric" and "deleting other fabric" case around a bit, unless we make CommandHandler expire sessions for whatever fabric is deleted, not the accessing fabric. Options 2 and 3 both mean the command sender will not get a response, which the spec allows but might not be ideal.

Thoughts?

@pan-apple @msandstedt @mrjerryjohns

@msandstedt
Copy link
Contributor

Specifically to #9571 (comment), isn't the easiest thing to do to send the response without reliability (or equivalently to terminate the tcp connection immediately after sending the response)? Why does the exchange need to live on in the first place?

So this is similar to option 2, but instead of not sending the response, we do send it but also tear down everything immediately after sending it.

@msandstedt
Copy link
Contributor

In fact, now that I'm thinking about it, isn't number two as written (send no response) already what's in the spec? It makes sense. How can you continue to communicate after the fabric is removed? You can't, or at least shouldn't be able to.

@bzbarsky-apple
Copy link
Contributor Author

isn't the easiest thing to do to send the response without reliability (or equivalently to terminate the tcp connection immediately after sending the response)?

This sounds like my option 3? Right now this does not work, apparently; I was not told what exactly does not work about it.

@msandstedt
Copy link
Contributor

isn't the easiest thing to do to send the response without reliability (or equivalently to terminate the tcp connection immediately after sending the response)?

This sounds like my option 3? Right now this does not work, apparently; I was not told what exactly does not work about it.

Ah I see, yes that sounds like 3.

And I guess the spec isn't totally clear on this:

All secure sessions, exchanges and interaction model constructs related to the Operational Identity under the given Fabric SHALL also be removed. Since this operation involves the removal of the secure session data that may underpin the current set of exchanges, the Node invoking the command SHOULD NOT expect a response before terminating its secure session with the target.

This implies that no response (option 2) is a possible way to handle this. I'm not entirely sure that the spec is requiring that though. Sending a response and then immediately tearing down the session (option 3) seems much nicer ergonomically if you care about this operation being at all reliable. Else, you would be forced to infer success by trying to establish a new session and verifying you receive a no-trusted-root error response.

@bzbarsky-apple
Copy link
Contributor Author

Yes, the spec allows but does not require sending a response.

Yes, sending a response would be nice. Option 3 would be good if we can figure out what's not working about that approach right now.

@bzbarsky-apple
Copy link
Contributor Author

So the specific things that don't work with option 3 are:

  1. emberAfSendImmediateDefaultResponse does not in fact send anything immediately. It just queues up the response on the CommandHandler. The send, because of the whole "multiple invokes per packet" business, happens only when al command invokes are done.
  2. The actual response-sending code does handle the session being gone, in the sense that it sends nothing, looks like.
  3. As we then try to close the exchange we crash.

Why do we crash? Because we can't send any messages on the exchange after we shut down the session. So when we go to close the exchange we can't send the remaining standalone ack, never reach the SetAckPending(false) call (due to checking for a live session and taking an early return before we get that far), and then hit this bit in the ExchangeContext destructor:

    VerifyOrDie(!IsAckPending());

@bzbarsky-apple
Copy link
Contributor Author

Note that not sending a response would not help with the above.

@msandstedt
Copy link
Contributor

Conceptually, I would have expected this to be simple. Omitting protocol matching, we have a stack that could look something like this:

TransportMgrHandler(msg) {
  SessionMgrHandler(transportCtx, msg) {
    rv = ExchangeMgrHandler(sessionCtx, msg) {
      rv = CHIP_NO_ERROR
      foreach ex in ExchangeDelegates do {
        if (ex->session == sessionCtx and msg->ExchangeId == ex->ExchangeId) {
          rv = ev->MsgHandler(msg)
          break
        }   
      }   
      if (rv == TERMINAL) {
        foreach ex in ExchangeDelegates do {
          if (ex->session == sessionCtx) {
            ex->close
          }   
        }   
      }   
      return rv; 
    }   
    if (rv == TERMINAL) {
      sessionCtx->close
    }   
  }
}

What about the structure of our code prevents such an approach? Is fabric deletion async? Why can't we propagate the close-session action up the stack with function return values?

@mrjerryjohns
Copy link
Contributor

Alternatively, why not just invoke a Post method to asynchronously handle clean-up of the accessing session? That way, in the code flow that is handling the message on the exchange, it can safely send a response back on that session (immediately though), and then in a async code path due to the post, actually clean-up the session.

@bzbarsky-apple
Copy link
Contributor Author

@msandstedt Sorry for the lag. We could push the session expiration further down into the networking stack so it doesn't actually get expired until after we have done the message send. It's all doable; it would just need to change the API contracts all up the callstack to until we get out of the exchange message handling....

@mrjerryjohns We want to clean up the session ASAP, not async, so no more messages will come in on it after fabric deletion. Or at least that's my impression based on why #9571 happened. As in, once we have removed the fabric, things on that fabric should not be able to change the node's state anymore.

@bzbarsky-apple
Copy link
Contributor Author

Now that #19328 is fixed, we should be able to do this much more sanely.

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 23, 2022
This prevents establishment of new sessions for the fabric, and shuts down other
communication on the sessions that do exist, which our old setup did not use to
do.

Fixes project-chip#9642
woody-apple pushed a commit that referenced this issue Jun 23, 2022
This prevents establishment of new sessions for the fabric, and shuts down other
communication on the sessions that do exist, which our old setup did not use to
do.

Fixes #9642
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants