Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heal rework #1106

Closed
Bolodya1997 opened this issue Oct 13, 2021 · 36 comments
Closed

Heal rework #1106

Bolodya1997 opened this issue Oct 13, 2021 · 36 comments
Assignees

Comments

@Bolodya1997
Copy link

Vladimir Popov  8:07 PM
hi Ed
here is a problem with putting heal after begin in the chain
heal sends Close with begin
begin becomes invalid for the Connection
heal tries to send Request with begin
…
8:08
so here is a question - what should be the Request after Close?

Ed Warnicke  8:08 PM
I was just contemplating this earlier as well
8:08
Lets talk a bit about how this needs to resolve
8:09
I tentatively think (and welcome to critique here) that heal looks roughly like this:
8:09
Originating Client detects vWire is down (first hint is update from monitor, but we could enhance to check the dataplane as well)
8:10
2. Originating Client Closes
8:10
3. Originating Client sends a Request to recreate the Connection
8:10
So the interesting question is: What does that Request in #3 look like?

Vladimir Popov  8:11 PM
what is Originating Client here?
some chain element
or some entity outside of the chain

Ed Warnicke  8:12 PM
Think of the Originating Client as the NSC itself for example
8:12
(we will get to chain elements in a bit :slightly_smiling_face: )

Vladimir Popov  8:13 PM
if we have such code:
func main() {
   ...
   c := client.NewClient()
   ...
   conn, err := c.Request(request)
   ...
}
heal happens in c, not in …?

Ed Warnicke  8:14 PM
My thought was it happens in c … but I’m open to other thoughts :slightly_smiling_face:

Vladimir Popov  8:15 PM
if heal happens in c, we have a problem - conn is updated (endpoint name, path), but in main we don’t know it
so if … further wants to update Connection, there will be a problem (edited) 
8:15
or … shouldn’t update Connection?

Ed Warnicke  8:16 PM
That is a valid concern.  What do you think we’d want to do with conn in main?

Vladimir Popov  8:20 PM
we may probably have some additional data that we want to share with Endpoint and this data can possibly be updated

Ed Warnicke  8:21 PM
OK… so it sounds like you are thinking of a heal mechanism outside the c
8:22
Hmm… there’s another problem

Vladimir Popov  8:25 PM
we probably can store endpoint name and path in heal client and update them in further requests
it is how we do it currently in heal

Ed Warnicke  8:27 PM
So if we are keeping the endpoint name and path the same… why not just use refresh?
8:27
(from inside the chain)

Vladimir Popov  8:28 PM
endpoint name should be cleaned up before heal
8:28
and path, I suppose

Ed Warnicke  8:28 PM
That was kind of what I was thinking
8:29
Which brings us back to the ‘What should request in step 3 look like?’

Vladimir Popov  8:31 PM
it can be either last original or last successful Request
but in both cases we probably need to clean some data like IPContext

Ed Warnicke  8:31 PM
Perhaps

Vladimir Popov  8:31 PM
we definitely should clean endpoint name and path, and probably we need to clean something else

Ed Warnicke  8:31 PM
Yeah… that was kind of what I was thinking as well
8:32
Looping back to your question of whether to do heal in chain or outside of chain (ie, in the c or the … my thought is, by virtue of request, the conn returned in the main is always going to be out of date, and so if the main() wants to change something, its going to have to monitor to get the most recent anyway

Vladimir Popov  8:35 PM
I think that in such case we need WithMonitorEventConsumers option for the NewClient
because if we are making gRPC connection inside of c we cannot make monitor client in main

Ed Warnicke  8:35 PM
We can…
8:36
Because the client is figuring out what to connect to either WithClientURL or WithClientConn
8:36
In either case, that’s coming from the main

Vladimir Popov  8:36 PM
yes, but WithClientURL doesn’t support healing
8:36
oops

Ed Warnicke  8:36 PM
How so?

Vladimir Popov  8:36 PM
WithClientConn

Ed Warnicke  8:36 PM
It certainly does in an NSC
8:37
Keep in mind… passthrough NSEs don’t heal

Vladimir Popov  8:37 PM
if gRPC connection breaks, we cannot recreate it (edited) 

Ed Warnicke  8:37 PM
We can connect to the same entity

Vladimir Popov  8:37 PM
if we create client WithClientConn

Ed Warnicke  8:38 PM
If main passes in a WithClientConn … it got that from somewhere (probably it dialed it itself) and can get another one to the same place

Vladimir Popov  8:39 PM
and if we have WithClientURL and we create gRPC connection in main, we should manually create new monitor on every gRPC connection failure

Ed Warnicke  8:39 PM
In main, yes

Vladimir Popov  8:40 PM
so both cases actually look like healing in … :slightly_smiling_face:

Ed Warnicke  8:41 PM
Hmm… here’s an interesting idea
8:41
What if healing is purely a Monitor function
8:41
So in … you do something similar to what we do in the ‘monitor’ nsc container already
8:42
Currently, on our nsc ‘monitor’ container, it uses monitor to discover existing Connections and resends them (to recreate the chains for itself).
8:43
What if we did something similar, but if it detects ‘down’ it Closes and then Requests with a properly cleaned up variation of the last Connection it received via monitor
8:43
Thoughts?

Vladimir Popov  8:49 PM
so in such case we have some entity like:
type Entity interface {
   Request(ctx, updateRequest())
   Close(ctx, id)
}
which takes c and performs monitoring and healing
8:49
is it so?
8:49
if we want to update Connection from main, we use Entity instead of pure c

Ed Warnicke  8:50 PM
So I think we are getting close…
8:50
But I don’t see why we’d have a different interface
8:50
Oh… I see.. similar to how the begin works

Vladimir Popov  8:53 PM
it would be actually very useful if we can decide what Connection fields should be set and edited on Client side and what on the not Client side

Ed Warnicke  8:53 PM
Could you say more?

Vladimir Popov  8:54 PM
in such case we just can use simple NetworkServiceClient interface and replace all set-by-remote fields with the values received from monitor

Ed Warnicke  8:57 PM
It might be more useful to think about things that are authoritative from the downstream NSE
8:57
NSC is authoritative on MechanismPreferences
8:58
NSE is authoritative on Mechanism selected
8:58
NSE provides the ConnectionContext (though the NSC has some interest in its continuity)
8:59
Path (after the NSC) is coming from the NSE (though again, the NSC has some interest in its continuity)

Vladimir Popov  9:02 PM
NSC can set some fields in ConnectionContext I suppose

Ed Warnicke  9:03 PM
It can
9:03
ConnectionContext is sort of an ‘end to end’ aspect of the vWire

Vladimir Popov  9:03 PM
so does healing mean cleaning up ConnectionContext?

Ed Warnicke  9:04 PM
I dont’ think so
9:04
Because the NSC would prefer it doesn’t change
9:04
Now the NSE may choose not to honor that request
9:04
But that’s the desire of the NSC

Vladimir Popov  9:07 PM
here is a problem with SrcAddresses and DstAdresses - ipam in healing doesn’t actually know if it should add a new address or replace an existing one (edited) 

Ed Warnicke  9:10 PM
Why isn’t that a problem in Refresh?

Vladimir Popov  9:11 PM
because in refresh we have the same Connection.ID and so we can decide that we have already set these addresses

Ed Warnicke  9:11 PM
And Heal doesn’t have the same Connection.ID?

Vladimir Popov  9:11 PM
if we heal to another endpoint - no

Ed Warnicke  9:13 PM
Why not?

Vladimir Popov  9:18 PM
I mean that we have the following:
func Request() {
   if _, ok := loadAddr(ctx); ok {
      return
   }
   conn.addrs = append(conn.addrs, newAddr)
   storeAddr(ctx, newAddr)
}
during initial request we set newAddr
during refresh request we have ok
if we change endpoint, it is again an initial request (edited) 

Ed Warnicke  9:18 PM
Ah… I see… its not that we have a different Connection.ID … its that the new Endpoint doesn’t have state for the Connection.ID

Vladimir Popov  9:19 PM
yes, it is a new Connection.ID for the Endpoint

Ed Warnicke  9:21 PM
Got it
9:21
But still the same old Connection.ID for the NSC

Vladimir Popov  9:22 PM
so I think that the easiest way to implement healing would be the following:
we implement Entity
it stores initial request
every updateRequest updates both stored inital request and request we send
on heal we send initial request
9:23
in such case we know what are the fields set by NSC, so all other fields can be erased

Ed Warnicke  9:23 PM
Why not just use begin for this
9:23
Its already doing 90%+ of what we need
9:24
You’d just need to add an option to begin.FromContext(ctx).Request(…)

Vladimir Popov  9:25 PM
how should we decide to clean up initial Requests from begin?

Ed Warnicke  9:25 PM
(basically… we have the machniery)

Vladimir Popov  9:26 PM
OK, actually Close originating from begin and Close coming to begin are 2 different closes
so we can clean up initial Request on Close coming to begin
9:26
here is a problem with updating Connection from main
9:27
we cannot pass option to c.Request()
9:27
so we still need to perform some monitoring and it probably will race with healing

Ed Warnicke  9:27 PM
Yes… in this case, we’d be back to putting a chain element in to monitor

Vladimir Popov  9:29 PM
here is a problem:
1| conn := getFromMonitor()
2| request.Conn = updateConn(conn)
3| c.Request(request, ctx)
heal can change path and endpoint name between 1 and 3 lines
9:29
so main will request with outdated data
9:29
and Request will fail, because there is no such endpoint name
9:31
so if we don’t want heal to race with monitor, we should perform updates with updateConn passed to the thing that does healing

Vladimir Popov  10:06 PM
or we can just think of updates coming from main in terms of diff, so:
request := makeRequest()
conn, err := c.Request(request) --> request is sent, request is stored for healing, request-new is stored for updates
...
request = updateRequest(request)
c.Request(request) --> request-new + diff is sent, request + diff is stored for healing, request-new-2 is stored for updates
10:06
so in such case we can just make everything in the chain and so we don’t need any monitoring in main

Ed Warnicke  10:07 PM
This is an interesting idea… what kinds of diffs were you thinking of ? (edited) 

Vladimir Popov  10:09 PM
everything changed from request to updateRequest(request)

Ed Warnicke  10:09 PM
Right… but clearly only some things
10:09
Else why take the updateRequest at all
10:10
Ah I see

Vladimir Popov  10:10 PM
because we cannot pass updateRequest to the client chain

Ed Warnicke  10:10 PM
You are diffing request-new and request-update
10:10
To figure out intent to change
10:10
That’s a clever idea
10:10
Not entirely sure its the right one per se… but it clever
10:10
My instinct would be to see if we can think of something simpler, but I’m not sure we can.
10:11
But its also kind of analogous to how ‘begin’ handles Close
10:11
It basically ignores everything about your Close except the Connection.ID
10:11
On the theory that the important part of Closing is clearing out the resources
10:12
And so the last Connection from a successful request is the right one to use to Close
10:12
It sounds like you are getting to something analogous for Request
10:12
With your diff idea

Vladimir Popov  10:12 PM
yes

Ed Warnicke  10:13 PM
So just to talk through some simple examples
10:13
If the Connection.NSE is unchanged, we should probably simply update the path with whatever we have in begin
10:14
We should probably always accept the MechanismPreferences in their entirity

Vladimir Popov  10:15 PM
are you currently talking about the changes coming from main?

Ed Warnicke  10:16 PM
Yes
10:16
MechanismPreferences from main should just be accepted
10:16
If ‘main’ hasn’t changed the NSE, then begin should probably just use its most recent Path

Vladimir Popov  10:17 PM
I suppose that main shouldn’t change NSE, because it can race with heal
10:17
and so same for the path
10:18
OK, main can possibly change NSE name, but it should mean that it want to work only with this NSE

Ed Warnicke  10:18 PM
Yes
10:18
We do and will have use cases like this
10:18
The NSC is allowed to do its own NSE selection
10:19
Most just won’t

Vladimir Popov  10:22 PM
so we can compute diff for every field like:
value - replace value
slice - diff with old slice
map - diff with old map
path probably should be ignored

Ed Warnicke  10:22 PM
It’s probably best to keep it simple

Vladimir Popov  10:23 PM
we can actually have even a SrcAddr set by client, if it is vl3 NSE acting as a client to reach another vl3 NSE
10:24
and so it will have a set of routes set both for the src and for the dst sides
10:25
It’s probably best to keep it simple
do you mean it is simple and it is OK or we should decide how to make it more simple?

Ed Warnicke  10:29 PM
I mean we should think about how to make it simpler :slightly_smiling_face:

Vladimir Popov  10:41 PM
am I right that we are OK with the following solution?
begin stores last input request and last returned request
begin -> Request uses last returned request if it exists, or last input if not and updates last returned request
begin -> Close uses last returned request and cleans up last returned request
Close -> begin uses last returned request and cleans up both last input and last returned requests
Request -> begin - updates last input and ???
heal is located somewhere in the chain and does exactly the following:
receive down event
call begin -> Close
until timeout or success call begin -> Request
(edited)

Ed Warnicke  10:43 PM
I don’t quite understand #c and #d

Vladimir Popov  10:44 PM
I mean main invoking Request and Close on c chain
10:44
or is your question about something else?

Ed Warnicke  10:44 PM
OK, in #c what do you mean by ‘cleans up all requests’ ?

Vladimir Popov  10:45 PM
both last input and last returned for the Connection.ID

Ed Warnicke  10:46 PM
So.. current begin behavior when Close is called from outside the chain is to use the last successfully completed Request (whether from outside or inside the chain)

Vladimir Popov  10:46 PM
yes, exactly
I mean after it cleans up both requests

Ed Warnicke  10:47 PM
What do you mean by cleans up both requests
10:47
that part I still don’t follow :slightly_smiling_face:

Vladimir Popov  10:49 PM
so it’s like we have last returned request and last input request stored for the connection
if main closes the connection, we perform a Close with last returned request and after clean up all data stored for the connection

Ed Warnicke  10:49 PM
Got it :slightly_smiling_face:
10:49
You are 100% correct, and just phrased it differently than I would have, which is fine :slightly_smiling_face:

Vladimir Popov  10:50 PM
sorry for my English :slightly_smiling_face:

Ed Warnicke  10:50 PM
1.  heal is located somewhere in the chain and does exactly the following:
receive down event
call begin -> Close
until timeout or success call begin -> Request
10:51
So… begin.Close will have cleaned up the state in begin.Request… which is a very solvable issue, but one we do need to think about

Vladimir Popov  10:52 PM
it will clean up not all the state, but only last returned request
10:52
so it is actually needed to start requesting with last input request
10:53
so here I want to make difference between:
main originates Close = Close -> begin
begin originates Close = begin -> Close

Ed Warnicke  10:54 PM
So that worries me a bit
10:54
Because of dangling state

Vladimir Popov  10:56 PM
there is no other chain element except of heal who can try to start using begin, because Close should end all actions for them

Ed Warnicke  10:57 PM
Lets think about this in terms of ‘Events’
10:57
The event from heal isn’t quite a ‘Request’ or ‘Close’ event
10:57
Its sort of something else

Vladimir Popov  10:59 PM
why is it different?
we are closing an old “broken” connection and requesting a new one

Ed Warnicke  10:59 PM
midchain ‘Request’ events don’
10:59
t  originate currently

Vladimir Popov  11:01 PM
refresh originates midchain Request events
or are you talking about something different?

Ed Warnicke  11:03 PM
Yes, refresh originates midchain Request events
11:03
timeout originated midchain Close events
11:03
After midchain Close, all state is gone
11:03
(currently)
11:03
What we need is a bit different
11:04
We need a midchain Close that has state lingering for a midchain Request to follow
11:06
Perhaps that looks like an option on Close…

Vladimir Popov  11:25 PM
I prefer thinking of this like we have default state and additional state
so midchain Close removes additional state
11:27
yes, probably a Close option is exactly what we need
11:28
because we want timeout to fully cleanup the state
11:28
we don’t actually need default state for the passthrough clients

Ed Warnicke  11:31 PM
begin is smart enough to put itself at the beginning of the passthrough :slightly_smiling_face:
11:31
So begin is on the server side in passthroughs
11:31
But still comes through to the client chain elements
11:31
(and we also aren’t doing heal in passthroughs)

Vladimir Popov  11:37 PM
we may probably want refresh timeout on client to trigger endpoint reselection (a.k.a. heal)
11:37
probably heal should be an option for the begin.Close? :slightly_smiling_face:
11:38
not actually for the begin.Close but for the begin itself
11:38
so with healing enabled any midchain Close originated in the client chain should trigger heal
11:39
and begin itself can make such healing
11:41
so we will have the following:
heal client listens for down event and calls begin.Close
refresh client retries until timeout and calls begin.Close
begin.Close starts request loop
(edited)

Ed Warnicke  11:43 PM
I was thinking a bit differently
11:44
Not making it part of begin per se
11:44
But user can choose to include a heal chain element
11:44
That can use begin
11:44
As to refresh timeout triggering endpoint reselection…
11:44
We do need an exit clause from refresh
11:44
But its not clear to me that we want that to be heal

Vladimir Popov  11:45 PM
it is just an idea

Ed Warnicke  11:46 PM
I like that you are throwing out ideas :slightly_smiling_face:
11:46
Exploring the space is part of getting to a good solution :slightly_smiling_face:

Vladimir Popov  11:46 PM
for refresh I mean that if we are failing to refresh old connection, we may probably want to create a new one

Ed Warnicke  11:47 PM
Hmm…
11:47
I could see that

Vladimir Popov  11:47 PM
and it is actually a problem that we are closing old gRPC connection on refresh
11:47
so there will be no monitoring
11:47
and so no heal started

Ed Warnicke  11:47 PM
Hmm…
11:47
So if Refresh is failing
11:48
Is Monitor still working?

Vladimir Popov  11:48 PM
no, because we have closed gRPC connection
11:48
and there was an error response for the new one
11:49
it is a corner case of per-request gRPC connections and monitoring - if remote error happens during the refresh, monitor can’t get it
11:53
Ed, will you be online after 9-10 hours?
I feel like we really need to reach some result, but it is too late for me :slightly_smiling_face:

Ed Warnicke  11:56 PM
good point
11:56
Go to sleep
11:56
We can follow up later :slightly_smiling_face:
11:56
Ping me when you get back on line :slightly_smiling_face:

Vladimir Popov  11:56 PM
OK

Ed Warnicke  3:56 AM
We don’t have to trim the path: https://github.com/networkservicemesh/sdk/blob/4807755c335d44ddf0ff9685e5fe4694cb8be292/pkg/networkservice/common/updatepath/common.go#L83-L87
common.go
    if nextIndex < len(path.GetPathSegments()) && path.GetPathSegments()[nextIndex].Name != segmentName {
        // 2.1 path has next segment available, but next name is not equal to segmentName
        path.PathSegments[nextIndex].Name = segmentName
        path.PathSegments[nextIndex].Id = uuid.New().String()
    }
Show more
<https://github.com/networkservicemesh/sdk|networkservicemesh/sdk>networkservicemesh/sdk | Added by GitHub

Ed Warnicke  4:13 AM
Looking here:
4:13
https://github.com/networkservicemesh/api/blob/9a36433d7d6e99699bbdd553249ae65d0ddd57cd/pkg/api/networkservice/connection.proto#L50-L60
connection.proto
message Connection {
  string id = 1;
  string network_service = 2;
  Mechanism mechanism = 3;
  connectioncontext.ConnectionContext context = 4;
Show more
<https://github.com/networkservicemesh/api|networkservicemesh/api>networkservicemesh/api | Added by GitHub
4:14
I suspect the only things that need to be cleared for heal are Mechanism, and possibly endpoint name

Vladimir Popov  9:34 AM
message Connection {
  string id = 1;
  string network_service = 2;
  Mechanism mechanism = 3;
  connectioncontext.ConnectionContext context = 4;
  map<string, string> labels = 5;
  Path path = 6;
  string network_service_endpoint_name = 7;
  string payload = 8;
  State state = 9;
}
---
---
should be cleaned up, because we are possibly selecting a new endpoint or establishing a new type of connection with the old one
should be diffed, because we are possibly selecting a new endpoint
without any cleanup it leads to a new endpoint ipam problem
with full cleanup (not diff) we can accidentally drop some Client preferences
---
should be cleaned up, because we may have different path length
should be diffed, because we may probably want to heal to another endpoint
without any cleanup we can stuck in heal
with full cleanup (not diff) we can accidentally drop Client endpoint selection
should be cleaned up, because we are no more DOWN
[6] can lead to a following problem:
path: NSC -> L-NSMgr -> L-Fwd -> R-NSMgr -> R-Fwd -> NSE-1
healing to local NSE
path: NSC -> L-NSMgr -> L-Fwd -> NSE-2 -> R-Fwd -> NSE-1
such path is incorrect
we can add some trimpath chain element to the end of all not-passthrough endpoints, but it can possibly lead to problems with building p2mp pathes in future
NSC -> L-NSMgr -> L-Fwd -> NSE-1 -> L-Fwd -> NSE-2
                           ^^^^^ - should I clean the path after?
and so I suppose we have currently some problems with the following scenario (I don’t remember any actions related to this):
NSC -> ... -> A -> ... -> NSE
[NSC, A] chain was closed during the heal - healing request will be initial request
[A, NSE] chain wasn’t closed during the heal - healing request will be refresh request (if we select the same forwarder and endpoint)
if L-Fwd is in [1] and R-Fwd is in [2] we have remote mechanisms not working - probably it is currently solved, but I don’t remember anyone working on this

Vladimir Popov  9:46 AM
I have one more idea about begin, what if we will have following for the client chain:
updatepath -> heal -> begin -> metadata -> ... -> refresh -> down -> ...
heal stores last input request without computing any diffs
begin stores both last input and last returned requests and computes diffs (or ???)
we create separate package for originating midchain events - it has simple methods like put in context and get from context for Request and Close originators
heal put Close originator in request context, which does exactly the following:
close next
until timeout or success request next with last input request
begin inserts Request / Close originator only if there is no Request / Close originator in context
on any Close begin cleans up all data related to the connection





9:47
in such case begin just acts as a begin





9:48
refresh and down chain elements will originate Close events and it will trigger heal instead of begin
@Bolodya1997
Copy link
Author

Bolodya1997 commented Oct 13, 2021

We have few different approaches for the heal rework.

Heal as a chain element

image

  1. heal stores last successful input request.
  2. begin stores last successful output and input requests.
  3. begin computes diffs between input and last successful input requests and uses it for updating last successful output request.
  4. heal inserts midchain Close originator.
  5. refresh calls Close on timeout.
  6. down calls Close on DOWN event received.
  7. main requests only with fields set by itself.

Heal as an outstanding component

image

  1. processor manages its own monitoring.
  2. processor stores last successful input, output requests.
  3. main requests with update function, not with request.
  4. processor uses update function for updating last successful input, output requests.
  5. refresh calls Close on timeout - it will trigger DOWN event.
  6. processor starts healing with last successful input requests on DOWN event.
  7. processor.Get returns last successful request connection.
  8. processor updates last successful request connection with monitor events.
  9. processor.Subscribe returns events on last successful request updates.

@Bolodya1997
Copy link
Author

@edwarnicke @denis-tingaikin
Thoughts?

@denis-tingaikin
Copy link
Member

I like the idea with Heal as a chain element.

@denis-tingaikin
Copy link
Member

@edwarnicke Can we start with Heal as a chain element?
Please share your thoughts.

@edwarnicke
Copy link
Member

@Bolodya1997 If we put heal as a chain element before begin... how does its state get cleaned up in the event of a timeout?

@denis-tingaikin
Copy link
Member

how does its state get cleaned up in the event of a timeout?

I think we just keep in the heal chain element an initial request state that should be used in case of heal, so it looks like we don't need to store some state and cleanup it.

@edwarnicke
Copy link
Member

I'm going to reformat your excellent analysis here @Bolodya1997 because it deserves to be read in its full glory :)

message Connection {
  string id = 1;
  string network_service = 2;
  Mechanism mechanism = 3;
  connectioncontext.ConnectionContext context = 4;
  map<string, string> labels = 5;
  Path path = 6;
  string network_service_endpoint_name = 7;
  string payload = 8;
  State state = 9;
}
  1. ---
  2. ---
  3. should be cleaned up, because we are possibly selecting a new endpoint or establishing a new type of connection with the old one
  4. should be diffed, because we are possibly selecting a new endpoint
    a. without any cleanup it leads to a new endpoint ipam problem
    b. with full cleanup (not diff) we can accidentally drop some Client preferences
  5. ---
  6. should be cleaned up, because we may have different path length
  7. should be diffed, because we may probably want to heal to another endpoint
    a. without any cleanup we can stuck in heal
    b. with full cleanup (not diff) we can accidentally drop Client endpoint selection
  8. should be cleaned up, because we are no more DOWN

[6] can lead to a following problem:

path: NSC -> L-NSMgr -> L-Fwd -> R-NSMgr -> R-Fwd -> NSE-1
healing to local NSE
path: NSC -> L-NSMgr -> L-Fwd -> NSE-2 -> R-Fwd -> NSE-1

such path is incorrect
we can add some trimpath chain element to the end of all not-passthrough endpoints, but it can possibly lead to problems with building p2mp pathes in future

NSC -> L-NSMgr -> L-Fwd -> NSE-1 -> L-Fwd -> NSE-2
                           ^^^^^ - should I clean the path after?

and so I suppose we have currently some problems with the following scenario (I don’t remember any actions related to this):

NSC -> ... -> A -> ... -> NSE
  1. [NSC, A] chain was closed during the heal - healing request will be initial request
  2. [A, NSE] chain wasn’t closed during the heal - healing request will be refresh request (if we select the same forwarder and endpoint)

if L-Fwd is in [1] and R-Fwd is in [2] we have remote mechanisms not working - probably it is currently solved, but I don’t remember anyone working on this

@edwarnicke
Copy link
Member

@Bolodya1997 This is an excellent analysis... I'll start with talking about Path clearing.

should be cleaned up, because we may have different path length
[6] can lead to a following problem:

path: NSC -> L-NSMgr -> L-Fwd -> R-NSMgr -> R-Fwd -> NSE-1
healing to local NSE
path: NSC -> L-NSMgr -> L-Fwd -> NSE-2 -> R-Fwd -> NSE-1

such path is incorrect
we can add some trimpath chain element to the end of all not-passthrough endpoints, but it can possibly lead to problems with building p2mp pathes in future

NSC -> L-NSMgr -> L-Fwd -> NSE-1 -> L-Fwd -> NSE-2
                           ^^^^^ - should I clean the path after?

Putting aside for a moment p2mp... I think we could pretty easily detect whether an endpoint is part of a passthrough and have updatetoken (or updatepath) trim the path.

@denis-tingaikin What is the current thinking on p2mp and Path ?

@edwarnicke
Copy link
Member

edwarnicke commented Oct 14, 2021

@Bolodya1997

  1. should be diffed, because we are possibly selecting a new endpoint
    a. without any cleanup it leads to a new endpoint ipam problem
    b. with full cleanup (not diff) we can accidentally drop some Client preferences

This is about the ConnectionContext. I would maintain we shouldn't diff or cleanup the ConnectionContext.

Here's why:

  1. Any NSE has to be able to handle the assertion by an NSC of ConnectionContext and decide for itself whether its going to respect it or not. In other words, we are going to have to fix the ipam problem no matter what.
  2. If I have multiple NSE's providing a function, and I get reselected to a new one, it may want to respect the ConnectionContext choices made by its peer. Its easier to do that if we maintain them.

I agree that this will uncover some bugs (see the ipam situation).

@edwarnicke
Copy link
Member

edwarnicke commented Oct 14, 2021

  1. should be cleaned up, because we are possibly selecting a new endpoint or establishing a new type of connection with the old one

I think you are right on Mechanism, but only if we are selecting a new NSE ( see 4a and 4b)

  1. should be diffed, because we are possibly selecting a new endpoint
    a. without any cleanup it leads to a new endpoint ipam problem
    b. with full cleanup (not diff) we can accidentally drop some Client preferences

We definitely have two cases here that need to be handled.

@edwarnicke
Copy link
Member

edwarnicke commented Oct 14, 2021

I've been thinking a bit on this.

we could add an option to

begin.FromContext(cxt).Request(begin.WithReselect())

begin.WithReselect():

Would cause you to to here:

ctx, cancel := f.ctxFunc()
defer cancel()
conn, err := f.client.Request(ctx, f.request, f.opts...)
if err == nil && f.request != nil {
f.request.Connection = conn
}
ch <- err

  1. call f.client.Close(ctx,f.request.GetConnection(),f.opts...) (to close the connection)
  2. clear f.request.GetConnection().Mechanism' and f.request.GetConnection().NetworkServiceEndpointName'
  3. continue with the request as normal

It would then be fairly easy to write a heal chain elements. For the case where you want reselect, you could pass the option, for the case you don't, you don't.

Thoughts?

Update: Maybe like this: #1107

@Bolodya1997
Copy link
Author

Bolodya1997 commented Oct 14, 2021

@edwarnicke

If we put heal as a chain element before begin... how does its state get cleaned up in the event of a timeout?

It doesn't look like we should perform some cleanup on timeout in heal. I actually suggest the following behavior:

  1. Any midchain Close triggers heal.
  2. If heal fails, it finally closes the connection and so main can be further notified with onidle (Create timeout and onidle clients #1092).
    • We really need here onidle, because monitor stops working when heal starts, so main has no idea if connection is still healing, or it is already closed.
  1. call f.client.Close(ctx,f.request.GetConnection(),f.opts...) (to close the connection)
  2. clear f.request.GetConnection().Mechanism' and f.request.GetConnection().NetworkServiceEndpointName'
  3. continue with the request as normal

It would then be fairly easy to write a heal chain elements. For the case where you want reselect, you could pass the option, for the case you don't, you don't.

Here is a problem I can see - we already have 2 entry points for the healing:

  1. down client triggering healing on DOWN event.
  2. refresh (or timeout) client triggering healing on timeout.

So with WithReselect approach we should create some heal tool and locate it in all places where we want to start healing.
And it actually mostly looks like healing should be the following:

  1. Do Close.
  2. Until success or timeout do Request.

And with WithReselect it is:

  1. Until success or timeout do Close + do Request.

I would maintain we shouldn't diff or cleanup the ConnectionContext.

Diffs actually solve 2 different problems:

  1. What should chain Request on healing.
    • I agree, this probably can be done with simple cleanup of few fields (but we also already have few bugs related to this).
  2. What should main Request on update, and what should chain Request on update.

Here is a big problem - if we do healing in the chain, we should assume that conn in the main is always invalid, so main can only send updates of the fields set by itself. In such case we should compute diff of the changed by-main fields inside of the chain (in begin or right after the begin) to get valid Request.

@denis-tingaikin
Copy link
Member

@edwarnicke

@denis-tingaikin What is the current thinking on p2mp and Path ?

For p2p we are go as we do now.
For p2mp Path is dependent on p2mp forwarder implementation.

@denis-tingaikin
Copy link
Member

@edwarnicke

Can we start with simply replace Connection to some empty/intial/diff state and just call close and request? I think we can consider datapath recover later and now focus on heal architecture.

@Bolodya1997
Copy link
Author

We may probably want to look into few NSM use cases.

@Bolodya1997
Copy link
Author

Case 1

Description

  1. Client requests NSM for ns service over one of kernel, memif mechanism.
  2. NSM provides it with remote nse-1: 1.0.0.0/16 over kernel.
  3. After some time nse-1 dies and NSM heals connection to local nse-2: 2.0.0.0/16 over memif.

Steps

  1. Client requests NSM:
request: {
    preferences: { kernel, memif },
    connection: {
        id: nsc-1,
        service: ns,
    },
}
  1. NSM responses:
connection: {
    id: nsc-1,
    service: ns,
    mechanism: { kernel },
    context: {
        srcAddrs: { 1.0.0.1/32 },
        srcRoutes: { 1.0.0.2/32 },
        dstAddrs: { 1.0.0.2/32 },
        dstRoutes: { 1.0.0.1/32 },
    },
    endpoint: nse-1,
    path: { nsc, l-nsmgr, l-fwd, r-nsmgr, r-fwd, nse-1 },
}
  1. nse-1 dies, heal requests NSM:
...
  1. Expected NSM response:
connection: {
    id: nsc-1,
    service: ns,
    mechanism: { memif },
    context: {
        srcAddrs: { 2.0.0.1/32 },
        srcRoutes: { 2.0.0.2/32 },
        dstAddrs: { 2.0.0.2/32 },
        dstRoutes: { 2.0.0.1/32 },
    },
    endpoint: nse-2,
    path: { nsc, l-nsmgr, l-fwd, nse-2 },
}

What should be in the heal Request

mechanism

If we don't cleanup mechanism, it cannot be selected from preferences, so we need it to be cleaned up.

context

If we don't cleanup context we will have ipam problem. Probably endpoint can decide, that it should cleanup existing addrs/routes if it is not able to use them.

  1. Cleanup context in heal client.
  2. Make decision in nse-2.

endpoint

If we don't cleanup endpoint, NSMgr cannot select a new one, so we need it to be cleaned up.

path

path cannot be simply overwritten, because it has another length, so we need to do one of the following:

  1. Cleanup path in heal client.
  2. Trim path in nse-2.

request

request: {
    preferences: { kernel, memif },
    connection: {
        id: nsc-1,
        service: ns,
        context(1): {},
        context(2): {
            srcAddrs: { 1.0.0.1/32 },
            srcRoutes: { 1.0.0.2/32 },
            dstAddrs: { 1.0.0.2/32 },
            dstRoutes: { 1.0.0.1/32 },
        },
        path(1): {}
        path(2): { nsc, l-nsmgr, l-fwd, r-nsmgr, r-fwd, nse-1 },
    },
}

@Bolodya1997
Copy link
Author

Case 2

Description

  1. Client request NSM for ns service over kernel mechanism. Further it needs to provide ns to some a.a.a.a/x subnet.
  2. NSM provides it with local nse-1: 1.0.0.0/16 over kernel.
  3. Client application additionally needs to provide ns to b.b.b.b/x subnet.
  4. After some time nse-1 dies and NSM heals connection to local nse-2: 2.0.0.0/16 over kernel.
  5. During the healing Client application decides that it no more needs to provide ns to a.a.a.a/x, but needs to provide it to c.c.c.c/x.
  6. After some time nse-2 dies and NSM heals connection to local nse-3: 3.0.0.0/16 over kernel.

Steps

  1. Client requests NSM:
request: {
    preferences: { kernel },
    connection: {
        id: nsc-1,
        service: ns,
        context: {
            srcAddrs: { a.a.a.1/x },
            dstRoutes: { a.a.a.a/x },
        },
    },
}
  1. NSM responses:
connection: {
    id: nsc-1,
    service: ns,
    mechanism: { kernel },
    context: {
        srcAddrs: { a.a.a.1/x, 1.0.0.1/32 },
        srcRoutes: { 1.0.0.2/32 },
        dstAddrs: { 1.0.0.2/32 },
        dstRoutes: { a.a.a.a/x, 1.0.0.1/32 },
    },
    endpoint: nse-1,
    path: { nsc, nsmgr, fwd, nse-1 },
}
  1. Client requests NSM with update:
...
  1. Expected NSM response:
connection: {
    id: nsc-1,
    service: ns,
    mechanism: { kernel },
    context: {
        srcAddrs: { a.a.a.1/x, 1.0.0.1/32, b.b.b.1/x },
        srcRoutes: { 1.0.0.2/32 },
        dstAddrs: { 1.0.0.2/32 },
        dstRoutes: { a.a.a.a/x, 1.0.0.1/32, b.b.b.b/x },
    },
    endpoint: nse-1,
    path: { nsc, nsmgr, fwd, nse-1 },
}
  1. nse-1 dies, heal requests NSM:
...
  1. Expected NSM response:
connection: {
    id: nsc-1,
    service: ns,
    mechanism: { kernel },
    context: {
        srcAddrs: { a.a.a.1/x, b.b.b.1/x, 2.0.0.1/32 },
        srcRoutes: { 2.0.0.2/32 },
        dstAddrs: { 2.0.0.2/32 },
        dstRoutes: { a.a.a.a/x, b.b.b.b/x, 2.0.0.1/32 },
    },
    endpoint: nse-2,
    path: { nsc, nsmgr, fwd, nse-2 },
}
  1. Client requests NSM with update:
...
  1. Expected NSM response:
connection: {
    id: nsc-1,
    service: ns,
    mechanism: { kernel },
    context: {
        srcAddrs: { b.b.b.1/x, 2.0.0.1/32, c.c.c.1/x },
        srcRoutes: { 2.0.0.2/32 },
        dstAddrs: { 2.0.0.2/32 },
        dstRoutes: { b.b.b.b/x, 2.0.0.1/32, c.c.c.c/x },
    },
    endpoint: nse-2,
    path: { nsc, nsmgr, fwd, nse-2 },
}
  1. nse-2 dies, heal requests NSM:
...
  1. Expected NSM response:
connection: {
    id: nsc-1,
    service: ns,
    mechanism: { kernel },
    context: {
        srcAddrs: { b.b.b.1/x, c.c.c.1/x, 3.0.0.1/32 },
        srcRoutes: { 3.0.0.2/32 },
        dstAddrs: { 3.0.0.2/32 },
        dstRoutes: { b.b.b.b/x, c.c.c.c/x, 3.0.0.1/32 },
    },
    endpoint: nse-3,
    path: { nsc, nsmgr, fwd, nse-3 },
}

What should be in the first update Request

Use returned request (R case)

We decide that Client monitors connection updates by itself.

request: {
    preferences: { kernel },
        connection: {
        id: nsc-1,
        service: ns,
        mechanism: { kernel },
        context: {
            srcAddrs: { a.a.a.1/x, 1.0.0.1/32, b.b.b.1/x },
            srcRoutes: { 1.0.0.2/32 },
            dstAddrs: { 1.0.0.2/32 },
            dstRoutes: { a.a.a.a/x, 1.0.0.1/32, b.b.b.b/x },
        },
        endpoint: nse-1,
        path: { nsc, nsmgr, fwd, nse-1 },
    },
}

Use initial request (I case)

We decide that Client doesn't monitor connection updates (or at least doesn't use monitor updates results to update connection).

request: {
    preferences: { kernel },
        connection: {
        id: nsc-1,
        service: ns,
        mechanism: { kernel },
        context: {
            srcAddrs: { a.a.a.1/x, b.b.b.1/x },
            dstRoutes: { a.a.a.a/x, b.b.b.b/x },
        },
    },
}

What should be in the first heal Request

context

It is almost impossible for the endpoint to decide what of a.a.a.a/x 1.0.0.1/32 it should cleanup and what it should preserve.
So making decision on the endpoint side doesn't more look like an option.

request

request: {
    preferences: { kernel },
        connection: {
        id: nsc-1,
        service: ns,
        mechanism: { kernel },
        context: {
            srcAddrs: { a.a.a.1/x, b.b.b.1/x },
            dstRoutes: { a.a.a.a/x, b.b.b.b/x },
        },
        path(1): {},
        path(2): { nsc, nsmgr, fwd, nse-1 },
    },
}

R case

It is almost impossible to make request from last input or last returned request.
The only option I can see is to use first input request for making heal request.

I case

It is almost last input request. So we just need to clean path [1] or preserve it [2].

What should be in the second update Request

R case

Here is a problem, because we haven't yet received monitor update from healing, so we only can send the following:

request: {
    preferences: { kernel },
        connection: {
        id: nsc-1,
        service: ns,
        mechanism: { kernel },
        context: {
            srcAddrs: { a.a.a.1/x, 1.0.0.1/32, b.b.b.1/x },  # !
            srcRoutes: { 1.0.0.2/32 },                       # !
            dstAddrs: { 1.0.0.2/32 },                        # !
            dstRoutes: { a.a.a.a/x, 1.0.0.1/32, b.b.b.b/x }, # !
        },
        endpoint: nse-1,                                     # !
        path: { nsc, nsmgr, fwd, nse-1 },                    # !
    },
}

I case

request: {
    preferences: { kernel },
        connection: {
        id: nsc-1,
        service: ns,
        mechanism: { kernel },
        context: {
            srcAddrs: { b.b.b.1/x, c.c.c.1/x },
            dstRoutes: { b.b.b.b/x, c.c.c.c/x },
        },
    },
}

What should be in the second heal Request

request

request: {
    preferences: { kernel },
        connection: {
        id: nsc-1,
        service: ns,
        mechanism: { kernel },
        context: {
            srcAddrs: { b.b.b.1/x, c.c.c.1/x },
            dstRoutes: { b.b.b.b/x, c.c.c.c/x },
        },
        path(1): {},
        path(2): { nsc, nsmgr, fwd, nse-2 },
    },
}

R case

It is almost impossible to make request from last input or last returned request.
So it is also almost impossible to make request from first input request.

I case

It is almost last input request. So we just need to clean path [1] or preserve it [2].

@Bolodya1997
Copy link
Author

Bolodya1997 commented Oct 14, 2021

@edwarnicke
It looks like we most probably want to make some input request diffs for both update and heal cases, and so it also looks like there is a problem with cleaning up connection context on the NSE side.

@edwarnicke
Copy link
Member

edwarnicke commented Oct 14, 2021

Any midchain Close triggers heal.

That's an interesting thought. My concern is that its the opposite of the semantics of Close. Close implies we are cleaning up. That's part of why adding an option to Request made sense... because Request is "I want a connection".

If heal fails, it finally closes the connection and so main can be further notified with onidle (Create timeout and onidle clients #1092).
We really need here onidle, because monitor stops working when heal starts, so main has no idea if connection is still healing, or it is already closed.

I don't quite follow how this relates to onidle and timeout, as those are server side chain elements, and heal is coming from pure clients.

I also don't see why monitor stops working when heal starts. Heal is initiated from a non-passthrough client. If that client is in the middle of a heal... there's nothing to monitor from inside the chain.

@edwarnicke
Copy link
Member

Here is a problem I can see - we already have 2 entry points for the healing:

down client triggering healing on DOWN event.
refresh (or timeout) client triggering healing on timeout.

I don't think of refresh or timeout as heal semantically. refresh is just making sure downstream knows the client is still alive. timeout is just the cleanup of last resort of there's no indication of liveliness from the client.

@edwarnicke
Copy link
Member

edwarnicke commented Oct 14, 2021

@Bolodya1997

So with WithReselect approach we should create some heal tool and locate it in all places where we want to start healing.
And it actually mostly looks like healing should be the following:

Do Close.
Until success or timeout do Request.
And with WithReselect it is:

Until success or timeout do Close + do Request.

That's the point of Request(WithReselect) ... it's semantically just asking for a Request to a different NSE. Request(WithRequest) naturally internally closes out the existing Connection. But that's not the semantics the heal chain element deals with. So mechanically its quite like Refresh (though semantically different) but with a distinct trigger.

@edwarnicke
Copy link
Member

Here is a big problem - if we do healing in the chain, we should assume that conn in the main is always invalid, so main can only send updates of the fields set by itself. In such case we should compute diff of the changed by-main fields inside of the chain (in begin or right after the begin) to get valid Request.

This is a real problem. Need to think a bit more on this. I would maintain though this is simply more of an existing problem, because the conn in main is always going to be out of date by virtue of Refresh...

@edwarnicke
Copy link
Member

On ConnectionContext... I think part of the confusion is the question of 'Client desires'

Clearly, if the Client is explicit in ConnectionContext, that says something about what it wants.

That said... I would maintain implicitly the client wants the connection to remain as stable as possible. It would prefer ConnectionContext unchanged over time, if possible. The NSE may or may not be able to provide that... but to have any hope of doing so... we need to pass on the ConnectionContext the client currently has.

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Oct 16, 2021

I would be suggesting to split work into two directions:

  1. Add initial heal for NSM release v1.1.0 (that will work no worse than in v1.0.0, but super simple). -- Focus on this for now.
  2. Continue to investigate healing problems and improve heal for v1.2.0.

For v1.1.0 we can start with simply adding a heal chain element for NSM clients that will follow the next logic:

  1. Receive down event.
  2. Do close for the actual connection.
  3. Do request with the initial request.

@edwarnicke Thoughts?

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Oct 17, 2021

Internally Ed discussion notes:

  1. We should focus on integration tests from v1.0.0 working with v1.1.0
  2. @edwarnicke is started work on heal in Add begin.WithReselect option #1107
  3. @edwarnicke will stop work on heal on Monday(17.10.21) and then we (@Bolodya1997 , @denis-tingaikin , @Mixaster995 , @ThetaDR, @DVEfremov ) will continue Add begin.WithReselect option #1107 to achieve 1.
  4. For release v1.2.0 we should consider more complex cases like datapath recovery, user connection updates, complex cases SDK testing and so on. @edwarnicke will start to think about it from Monday and will provide a Spec document or GitHub issue.

@edwarnicke
Copy link
Member

@Bolodya1997 I just pushed an update to #1107 with an explanation of where we are so far.

@Bolodya1997
Copy link
Author

@edwarnicke
Looks like we only need to fix sandbox tests and make sure that integration tests are working, so we have the following plan + estimations:

  1. Take a deep dive into the new heal implementation - 1d.
  2. Integration tests (1):
    1. Fix (change 172.16.1.10{2|3} to 172.16.1.10{0.1} in several tests), run and collect failures and logs - 1d.
    2. Check for the VPP Forwarder remote mechanisms failures - 2d.
    3. If [ii] and we failing to fix it in 2d, clean path instead of using trimpath - 1d.
  3. sdk
    1. Set back grpc.WithBlock, grpc.WaitForReady and fix sandbox tests - 1.5d.
    2. Fix local NSMgr heal test - 1.5d.
    3. Make sure that everything works - 2d.
  4. Integration tests (2):
    1. Run and fix failures - 3d.
  5. Additionally check heal with p2mp - 1d.

Total estimation for heal is 14d.
We may additionally need to have some common time buffer like 5d to validate that all release updates are working with each other without problems.

@edwarnicke
Copy link
Member

Set back grpc.WithBlock, grpc.WaitForReady and fix sandbox tests

Why? Why do we need WithBlock and WaitForReady?

@edwarnicke
Copy link
Member

Fix (change 172.16.1.10{2|3} to 172.16.1.10{0.1} in several tests), run and collect failures and logs

I am super curious to know more about this :)

@denis-tingaikin
Copy link
Member

Why? Why do we need WithBlock and WaitForReady?

Note: we don't need to add them strictly. We are expecting that everything should be stable with them and without them. As we can see at this moment these options can trigger tests failing. So this is not a good symptom.

@denis-tingaikin
Copy link
Member

I am super curious to know more about this :)

As I know current healing is a bit complicated and with your proposed healing some things going to be simpler. Vlad means that in examples https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/heal/local-nse-death will be used the same IPs as before heal.

@edwarnicke
Copy link
Member

Ah... so this is really 'fix the ipam problems' rather than?

@Bolodya1997
Copy link
Author

Bolodya1997 commented Oct 19, 2021

Ah... so this is really 'fix the ipam problems' rather than?

Not actually, we expect new behavior from heal because we are not cleaning up path on healing, so it is only 'fix the tests'.

We have 2 kinds of the IPAM problems with old healing:

Healing with the same NSE

  1. NSC-1 requests NSE for NS.
  2. NSE responses with srcAddrs: a.a.a.1/32, dstAddrs: a.a.a.2/32 and stores it in metadata.
  3. Healing happens.
  4. NSC-1 requests NSE with cleaned up path and srcAddrs: a.a.a.1/32, dstAddrs: a.a.a.2/32 set.
  5. NSE allocates srcAddrs: a.a.a.3/32, dstAddrs: a.a.a.4/32, because {1|2} are busy and existing metadata entry is for the old Connection.ID.
  6. Forwarder uses all of srcAddrs: a.a.a.{1|3}/32, dstAddrs: a.a.a.{2|4}/32 for the connection interfaces.
  7. Timeout happens for the [1-2] connection and NSE cleans up {1|2} metadata entry.
  8. NSC-2 requests NSE for NS.
  9. NSE responses with srcAddrs: a.a.a.1/32, dstAddrs: a.a.a.2/32 and stores it in metadata.
  10. Forwarder fails to set these addresses because they are already in use.

Healing with NSE change

  1. NSC-1 requests NSE-1 for NS.
  2. NSE-1 responses with srcAddrs: a.a.a.1/32, dstAddrs: a.a.a.2/32.
  3. NSC-2 requests NSE-1 for NS.
  4. NSE-1 responses with srcAddrs: a.a.a.3/32, dstAddrs: a.a.a.4/32.
  5. Healing happens.
  6. NSC-2 requests NSE-2 for NS with srcAddrs: a.a.a.2/32, dstAddrs: a.a.a.3/32 set.
  7. NSE-2 allocates srcAddrs: a.a.a.1/32, dstAddrs: a.a.a.2/32.
  8. Forwarder uses all of srcAddrs: a.a.a.{1|3}/32, dstAddrs: a.a.a.{2|4}/32 for the connection interfaces.
  9. NSC-1 requests NSE-2 for NS with srcAddrs: a.a.a.1/32, dstAddrs: a.a.a.2/32 set.
  10. NSE-2 allocates srcAddrs: a.a.a.3/32, dstAddrs: a.a.a.4/32.
  11. Forwarder fails to set these addresses because they are already in use.

Healing with the same NSE is no more a problem, because we are no more deleting old path and it is the thing that should be changed in integration tests.

Healing with NSE change is still a problem and it probably should somehow be solved in IPAM(?) sometime in the future.

@glazychev-art
Copy link
Contributor

Updated plan:

  • 1. Take a deep dive into the new heal implementation - 1d.
  • 2. Integration tests (1):
    • i. Fix (change 172.16.1.10{2|3} to 172.16.1.10{0.1} in several tests), run and collect failures and logs - 1d.
    • ii. Check for the VPP Forwarder remote mechanisms failures - 2d. + 1d. (If we failing to fix it in 2d, clean path instead of using trimpath)
  • 3. sdk
    • i. Fix local NSMgr heal test (doesn't work without WithBlock) - 3d.
    • ii. Make sure that everything works - 2d.
  • 4. Integration tests (2):
    • i. Run and fix failures - 3d.
  • 5. Additionally check heal with p2mp - 1d.

@edwarnicke
Copy link
Member

Fix (change 172.16.1.10{2|3} to 172.16.1.10{0.1} in several tests), run and collect failures and logs

I don't think this is the real answer... it sort of papers over the problem... lets open an issue to really fix the point2pointipam chain element to handle this correct.

@denis-tingaikin
Copy link
Member

@edwarnicke Can we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants