-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heal rework #1106
Comments
We have few different approaches for the heal rework. Heal as a chain element
Heal as an outstanding component
|
@edwarnicke @denis-tingaikin |
I like the idea with Heal as a chain element. |
@edwarnicke Can we start with Heal as a chain element? |
@Bolodya1997 If we put heal as a chain element before begin... how does its state get cleaned up in the event of a timeout? |
I think we just keep in the heal chain element an initial request state that should be used in case of heal, so it looks like we don't need to store some state and cleanup it. |
I'm going to reformat your excellent analysis here @Bolodya1997 because it deserves to be read in its full glory :) message Connection {
string id = 1;
string network_service = 2;
Mechanism mechanism = 3;
connectioncontext.ConnectionContext context = 4;
map<string, string> labels = 5;
Path path = 6;
string network_service_endpoint_name = 7;
string payload = 8;
State state = 9;
}
[6] can lead to a following problem:
such path is incorrect
and so I suppose we have currently some problems with the following scenario (I don’t remember any actions related to this):
if |
@Bolodya1997 This is an excellent analysis... I'll start with talking about Path clearing.
Putting aside for a moment p2mp... I think we could pretty easily detect whether an endpoint is part of a passthrough and have updatetoken (or updatepath) trim the path. @denis-tingaikin What is the current thinking on p2mp and Path ? |
This is about the ConnectionContext. I would maintain we shouldn't diff or cleanup the ConnectionContext. Here's why:
I agree that this will uncover some bugs (see the ipam situation). |
I think you are right on Mechanism, but only if we are selecting a new NSE ( see 4a and 4b)
We definitely have two cases here that need to be handled. |
I've been thinking a bit on this. we could add an option to
Would cause you to to here: sdk/pkg/networkservice/common/begin/event_factory.go Lines 93 to 99 in e4c4722
It would then be fairly easy to write a heal chain elements. For the case where you want reselect, you could pass the option, for the case you don't, you don't. Thoughts? Update: Maybe like this: #1107 |
It doesn't look like we should perform some cleanup on timeout in
Here is a problem I can see - we already have 2 entry points for the healing:
So with
And with
Diffs actually solve 2 different problems:
Here is a big problem - if we do healing in the chain, we should assume that |
For p2p we are go as we do now. |
Can we start with simply replace Connection to some empty/intial/diff state and just call close and request? I think we can consider datapath recover later and now focus on heal architecture. |
We may probably want to look into few NSM use cases. |
Case 1Description
Steps
What should be in the heal Request
|
Case 2Description
Steps
What should be in the first update RequestUse returned request (R case)We decide that Client monitors connection updates by itself.
Use initial request (I case)We decide that Client doesn't monitor connection updates (or at least doesn't use monitor updates results to update connection).
What should be in the first heal Request
|
@edwarnicke |
That's an interesting thought. My concern is that its the opposite of the semantics of Close. Close implies we are cleaning up. That's part of why adding an option to Request made sense... because Request is "I want a connection".
I don't quite follow how this relates to onidle and timeout, as those are server side chain elements, and heal is coming from pure clients. I also don't see why monitor stops working when heal starts. Heal is initiated from a non-passthrough client. If that client is in the middle of a heal... there's nothing to monitor from inside the chain. |
I don't think of refresh or timeout as heal semantically. refresh is just making sure downstream knows the client is still alive. timeout is just the cleanup of last resort of there's no indication of liveliness from the client. |
That's the point of Request(WithReselect) ... it's semantically just asking for a Request to a different NSE. Request(WithRequest) naturally internally closes out the existing Connection. But that's not the semantics the heal chain element deals with. So mechanically its quite like Refresh (though semantically different) but with a distinct trigger. |
This is a real problem. Need to think a bit more on this. I would maintain though this is simply more of an existing problem, because the conn in main is always going to be out of date by virtue of Refresh... |
On ConnectionContext... I think part of the confusion is the question of 'Client desires' Clearly, if the Client is explicit in ConnectionContext, that says something about what it wants. That said... I would maintain implicitly the client wants the connection to remain as stable as possible. It would prefer ConnectionContext unchanged over time, if possible. The NSE may or may not be able to provide that... but to have any hope of doing so... we need to pass on the ConnectionContext the client currently has. |
I would be suggesting to split work into two directions:
For
@edwarnicke Thoughts? |
Internally Ed discussion notes:
|
@Bolodya1997 I just pushed an update to #1107 with an explanation of where we are so far. |
@edwarnicke
Total estimation for heal is |
Why? Why do we need WithBlock and WaitForReady? |
I am super curious to know more about this :) |
Note: we don't need to add them strictly. We are expecting that everything should be stable with them and without them. As we can see at this moment these options can trigger tests failing. So this is not a good symptom. |
As I know current healing is a bit complicated and with your proposed healing some things going to be simpler. Vlad means that in examples https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/heal/local-nse-death will be used the same IPs as before heal. |
Ah... so this is really 'fix the ipam problems' rather than? |
Not actually, we expect new behavior from heal because we are not cleaning up path on healing, so it is only 'fix the tests'. We have 2 kinds of the IPAM problems with old healing: Healing with the same NSE
Healing with NSE change
Healing with the same NSE is no more a problem, because we are no more deleting old path and it is the thing that should be changed in integration tests. Healing with NSE change is still a problem and it probably should somehow be solved in IPAM(?) sometime in the future. |
Updated plan:
|
I don't think this is the real answer... it sort of papers over the problem... lets open an issue to really fix the point2pointipam chain element to handle this correct. |
@edwarnicke Can we close this? |
The text was updated successfully, but these errors were encountered: